[buug] Fun (and practical) uses of named pipes (example)

Sat May 28 23:26:58 PDT 2011

So, not too long ago, had bit of a task I wanted to accomplish.  I had
two quite large compressed (gzip and bzip2) files of a rather larger
hard drive image (120,031,511,040 bytes).  Due to size, time, etc., I
wanted my manipulations of these files to be fairly I/O and time
efficient.  E.g. I didn't want to read a source file or write a target
file more than once, or unnecessarily read a target file.  I wanted to
perform various hash calculations (md5, sha1, sha256 and sha512) on the
uncompressed data from each of the compressed images, and also extract
and save as files, the uncompressed images, and also compare them, and
do a bit of custom block-wise and running total hash calculations on the
uncompressed images (the multisum.128M program I reference below is bit
of perl code I wrote to do that - it simultaneously does both block-by-block
and running cumulative (up through block) hash calculations - in this
case by 128 MiB blocks (if the files differed, I wanted to know within
which 128 MiB blocks they differed)).  I also wanted to know if the two
uncompressed files were byte-by-byte identical all the way through
(regardless as to whether or not the various hash calculations may have
also matched).

So, ... to do all that rather efficiently, I used a bunch of named
pipes.  The only processes that read from the compressed files were
their uncompressing programs, and a single program wrote the
uncompressed images - nothing else read those images.  All the other I/O
- reading, writing (for comparison and calculating all the various hash
functions) read from, or wrote to a named pipe.  tee(1) was also
significantly used to simultaneous write to multiple outputs (stdout
and/or various "files" - mostly named pipes in our case here).

What's a named pipe?  A First In First Out (FIFO) file.  One of the
types of "special" files in Unix(/Linux, etc.) - but unlike block or
character special devices, no special privilege is needed to create
named pipes.  Named pipes are sort of like the shell's pipe (|), except
they exist in the filesystem (and thus have a name), and are read from
and written to, rather like ordinary files ... except they're not.  They
have no disk data blocks - they're just a buffer - one generally has one
process read from a named pipe, and another process write the named
pipe.  The data "comes out from" (is read from) the pipe, with the bytes
coming out in the same order they were written to the pipe.  mknod(1) is
used to create named pipe, e.g.:
$ mknod name p
would generally create a named pipe of name name, e.g.:
$ mknod name p && ls -ond name && rm name
prw-------  1 1003 0 May 28 23:14 name
$
That leading p in our ls -o (or -l) listing shows us that it's a named
pipe.

One generally needs to set something up to read from named pipe, before
writing to named pipe.

Anyway, bit of example program I did a while back for the task at hand.
I've tweaked it slightly (for readability - namely shortening some
names/paths and folding some lines), and added some comments to describe
a bit better what it does.

#!/bin/sh

set -e # exit non-zero if command fails

cd /tmp/hd

for f in gz bz2
do
     # create pair of files for each of our gz / bz2 flavors we'll use
     mknod p-"$f" p
     mknod p-multisum.128M-"$f" p
     # launch our custom multisum.128M on each, saving stdout and stderr
     ./multisum.128M p-multisum.128M-"$f" \
     > P-multisum.128M-"$f".out \
     2> P-multisum.128M-"$f".err &
done

# launch our cmp, and have it report results to file CMP, and save
# stdout and stderr
if >cmp.out 2>cmp.err cmp p-gz p-bz2; then
     echo matched > CMP
else
     echo not matched > CMP
fi &

for s in md5 sha1 sha256 sha512
do
     for f in gz bz2
     do
         # make named pipes for each of our hash and file type
         # combinations
         mknod p-"$s"-"$f" p
         # start the hash calculations on each, saving stdout
         "$s"sum < p-"$s"-"$f" > P-"$s"sum-"$f" &
     done
done

# start our uncompressions, pipe to tee to write our pipes and file for
# each
gzip -d < /tmp/sdb1/hd.gz | tee p*-gz > hd-gz &
bzip2 -d < /tmp/sdb2/hd.bz2 | tee p*-bz2 > hd-bz2 &

# essentially all the preceding read/write stuff was started in
# background, with reads started before writes

# we just then wait for the preceding background stuff to all finish,
# at which point we should be done
wait

I'll commonly use similar technique when I wish to calculate mutliple
hash values on a CD or DVD or image thereof.  E.g. I'll create named
pipe file(s), start background process(es) to calculate hash(es) on the
named pipe(s), redirecting their output to file(s), then I'll read the
CD/DVD/image, and typically via tee(1), write it to the named pipes -
and typically also pipe (|) tee(1)'s stdout to one of the hash programs
I wish to use.  In that way, I read the input CD/DVD/image just once,
rather than rereading and doing that I/O on the media or disk repeatedly
for each hash type I wish to calculate.