Most people are probably pretty familiar with creating, sending or receiving Zip files. Zip takes a collection of files and stores them in a Zip archive file, compressing the data in the process.
As well as storing the contents of the files, it also stores all their metadata, which is extra information associated with an object. In the case of files, it includes the modification times, their owners and permissions and, of course, the name of each file.
When you unzip an archive, all of this information is extracted, which recreates the original set of files exactly as they were, which is pretty handy!
Archives have several uses, but the most popular include bundling up a set of files for download – a single file is easier to handle and the compression makes it faster to download – and creating backups. As Zip has been around for a long time, everyone can use it and all OSes can handle it. Nonetheless, Zip does have a number of drawbacks.
The main issue is that its compression is poor by modern standards. Over the past 25 years compression technology has moved on and even though there have been improvements to Zip, there are some better alternatives available.
Another disadvantage of Zip is that it is intended for archiving to a file, whereas sometimes we want to send the data to another device or service.
The standard archival program for Unix-like operating systems including Linux and Mac OS X is Tar, so called because Tar was originally used to store backups on tape drives (Tape ARchive).
It works in a different way from Zip because it sends all of the archived data to its standard output and it doesn't compress the data by default because many tape drives already had hardware compression built in.
The lack of compression code may seem like a disadvantage, but it's actually a convenience. As Tar is able to pipe its data via an external compression program, it can use any compressor it likes – even one that wasn't in existence when the Tar program was developed.
Compression programs work on one file or stream of data and produce one compressed file or stream, so this splits the job into two parts: archival and compression. While this may seem more complex, Tar is perfectly capable of handling the details itself.
Let's say we have a directory that is called foo. We want to create an archive of it, which is often referred to as a tarball. We can we can do one of these options:
tar cf foo.tar foo
tar czf foo.tar.gz foo
tar cjf foo.tar.bz2 foo
tar cJf foo.tar.xz foo
The c option tells the Tar program that we are creating an archive, while f tells it that we are storing the archive in a file using the given name. Therefore, the first command creates an uncompressed archive that is called foo.tar.
The subsequent commands add an extra option that tells Tar which particular type of compression to use: z uses gzip compression, j uses bzip2 compression and J uses xz compression. (Watch the capitalisation!)
Tar arguments and options
There are also long versions of these arguments that make the commands more readable, but most of us are lazy and use the version that is shorter to type. However, we could also have used this command line if we wanted to:
tar --create --gzip --file foo.tar.gz foo
The file extension is not required, but it's a convention that makes it easier for people to see exactly what type of archive it is – the system itself needs no such help as it can work all this out on its own. Unpacking an archive is simply a matter of replacing c with x, or --create with --extract. However, you don't need to give the compression type, as Tar figures it out:
tar xf foo.tar.gz
Another option you may want to add is v or --verbose, to show you what Tar is doing.
If you have been given a tarball, you may want to see what is inside it without unpacking it. If you have created an archive, particularly a backup, you may want to check it's correct before relying on it. The test option checks the integrity and lists the contents of the archive.
tar tvf foo.tar.gz
Those are the main Tar options, but the program has many more, such as A or --concatenate to add files to an existing archive instead of creating a new one.
We mentioned that Tar can handle any new compression format that comes along because it passes compression to another program. There are command line options to do this automatically for gzip, bzip2 and xz, but what if someone comes up with a new compressor?
Say something like sdc - super-duper compressor? You could create an uncompressed tarball and then use sdc to compress it, but that's wasteful and slow. Instead use a pipe:
tar c foo | sdc >foo.tar.sdc
unsdc foo.tar.sdc | tar xv
Here, we use only the --create option with Tar. The lack of a destination causes Tar to send the archive data to standard output, which is then piped to the sdc compressor program. The second command reverses the process, decompressing the archive and sending it to Tar for extraction.