[buug] Why I tend to prefer pax(1) over tar(1) and cpio(1) ...

Michael Paoli Michael.Paoli at cal.berkeley.edu
Thu Oct 18 17:25:09 PDT 2012


Why I tend to prefer pax(1) over tar(1) and cpio(1) (and some relevant
history).

The (very) short version.  The pax(1) utility is very well standardized
(POSIX / Single Unix Specification, etc.), can read and write both
tar(1) and cpio(1) format archives, has quite sufficiently flexible
means of specifying files to be archived or restored, it's relatively
clean, compact, robust code, well does what it needs to do, quite
secure, not overburdened with features (it's from OpenBSD), etc., and
also lacks some nasty bugs that made it into GNU's implementation of
cpio and into production releases of many Linux distributions before
being caught and corrected.

Once upon a time, before pax(1) and cpio(1), there was tar(1) (and
before that ar(1), but we'll not go back *that* far).  Now, tar (Tape
ARchive), even from "way back then" - at least as far back as 1980, was
quite good at the time for arching files of ordinary type (but not
block special, character special, etc.) and directories.  However it
couldn't handle archiving or restoring other file types (e.g. character
and block special devices - it would mostly just ignore them), and
never has handled backing up or restoring device files.  Of course even
"way back then", tar wasn't limited to reading/writing archives from
tape - it could do so via standard input/output or character or block
device file or file of ordinary type.  It also was quite limited in how
one specified what it did and didn't back up.  Specified files were
simply archived, and directories, recursively so - with no exception
mechanism.  Extraction was similarly limited - default was everything,
otherwise pathname(s) given were extracted, and if they were an
archived directory, the directory was recursively extracted.  It used
to be the case that tar had some other specific limitations due to its
archive format and implementation, but for the most part those aren't
significant impediments in modern implementations of tar - or more
specifically, the archive formats typically now created by tar.

For the most part, cpio came along later.  Though it has a history that
also goes back fair ways, as far as I'm aware, it was somewhere in the
AT&T System V UNIX days, when cpio was released as part of the
main/core part of the operating system, rather than only as part of the
"text processing system" and associated tools.  Anyway, cpio - at least
at the time, had many advantages over tar.  Most notably, it would
backup and restore "special" files - most notably block and character
device files (not their contents, but the files themselves).  This was
very advantageous for doing, e.g. "full system backups", and making
recovery a fair bit simpler (as opposed to having to manage to figure
out how to recreate all the device files again, and get their
permissions/ownerships suitablely back to what they were before).  The
cpio program also offered a much more convenient mechanism for
specifying what was to be backed up (archived), and what was to be
restored.  When archiving, it reads paths from standard input - very
handy when used with find(1).  Also very handy when used with any type
of program(s) used as filter - so it was much more feasible to do, e.g.
a very large system incremental or differential backup with cpio (and
suitable filter(s)), as opposed to tar.  Use of cpio also gave much
more flexible means of specifying what was to be restored.  In much of
that timeframe, the only other available tools for such were dump(1)
and restor(1), which, while rather/quite useful, also had significant
limitations (e.g. may not produce good backups or be safely used on
active rw mounted filesystems, difficult to do quite selective restores
or limited hierarchies, etc.).  Anyway, cpio was quite sufficiently
good and capable, and not only was it commonly used for backups (and
became my generally preferred backup archive format), it was even
chosen for other archive applications, e.g. rpm(8).

So, cpio - a highly common backup practice was one roughly like this,
e.g. for a full system backup:
# cd / && find . -depth -print | cpio -o ...
or, with GNU cpio and find, to work around one cpio limitation:
# cd / && find . -depth -print0 | cpio -o0 ...
And that was all fine and good, and worked quite excellently, until ...
Well, bugs happen.  Along the way, GNU introduced a bug in cpio, a
quite annoyingly problematic one, e.g. the common backups (noted above)
/ and restore:
# (cd directory && cpio -idmu < archive_file_or_device)
combination no longer worked.  GNU cpio doesn't have a "proper" bug
tracking system, but it does have a bug email list and archives thereof.
http://lists.gnu.org/archive/html/bug-cpio/
A bit easier to follow details on that bug (fortunately since fixed)
via Debian's bug tracking system:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=458079
In any case, the bug wasn't caught until it was in production/stable
releases of many Linux distributions.  I'm hoping GNU - and Debian -
have added suitable regression tests, so a bug of that nature never
makes it into production/stable releases again - particularly given how
commonly that backup/restore combination is quite typically used with
cpio - and quite depended upon (and not expected to break!).

Anyway, still bumping into that bug a lot in various Linux
installations (and getting repeatedly annoyed by it), I've come to
quite prefer pax.  Does not and never had that bug :-) and also much
more standard :-) and not overloaded with nonessential and unimportant
features (gee, think GNU tar and GNU cpio have enough options yet?
Maybe way too many? - at quick approximate count, and not counting
alternative forms for same option, I find about 147 options for GNU
tar, about 70 for GNU cpio, and about 38 for OpenBSD pax (21 in the
standard itself)).  "Unix philosophy: Write programs that do one thing
and do it well."  I certainly think some programs have gotten "too fat"
for their own good - that not only often leads to bugs, but often
introducing bugs in core program functionality where there were no such
bugs before (like breaking cpio in annoying ways, where it was never
broken before).

Although pax isn't a drop-in replacement for tar or cpio, with suitable
adjustments to options, and perhaps adjustment in how one specifies
what's to be archived, pax makes a fine replacement for tar and cpio -
especially if one sticks to and only was using quite standard (e.g.
POSIX / Single Unix Specification) options to tar and cpio.  Here's
partial "cheat sheet" I still tend to refer to, when using pax in place
of cpio (or tar), and their at least approximate equivalents:
cpio -it                   pax
cpio -idmu                 pax -r -p e
cpio -0o -H newc           pax -w -0d -x sv4cpio
cpio -0pdmu directory      pax -rw -0d -p e directory
cpio -0pdlmu directory     pax -rw -0dl -p e directory
tar -tf -                  pax
tar -xf - [file ...]       pax -r -p e [pattern ...]
tar -cf - file [...]       pax -w -x ustar file [...]

All the pax(1) options I show above are standard, except -0, which is an
OpenBSD / GNU style extension (which GNU nicely added to find, cpio,
xargs(1), etc., and very nicely solves the problem of dealing with
filenames that contain newlines).

pax(1) standards:
The Open Group Base Specifications Issue 7
IEEE Std 1003.1-2008
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html

OpenBSD man page for pax(1):
http://www.openbsd.org/cgi-bin/man.cgi?query=pax




More information about the buug mailing list