What is tar and How Do I Install it?
As per the tar manual (which you can access by typing man tar once it is installed), tar is an archiving utility. It supports many features, including compressing and decompressing files on the fly when archiving them. Let’s get started by installing tar:
To install tar on your Debian/Apt based Linux distribution (Like Ubuntu and Mint), execute the following command in your terminal:
sudo apt install tar
To install tar on your RedHat/Yum based Linux distribution (Like RHEL, Centos and Fedora), execute the following command in your terminal:
sudo yum install tar
Next, we will create some sample data:
Here we created a directory test, and created six empty files in it by using the touch command. We also added some numbers to files a, e, and b, though notably file b has repetitive data, which will compress well.
If you would like to learn more about how compression works, you can checkout our How Does File Compression Work? article.
Creating an Uncompressed Archive
Here we created an uncompressed archive using the tar -hcf all_files.tar * command. Let’s have a look at the options used in this command.
Firstly, we have -h which though not required in this particular case, I highly recommend to always include in your tar commands. This option stands for dereference, which will dereference (or follow) symlinks, archiving and dumping the files they point to.
Next we have the -c and -f options. Note that they are just written together with the - in -h, i.e. instead of specifying another -, we simply tag them onto the other shorthand options. Quick and easy.
The -c option stand for create a new archive. Note that by default directories are archived recursively, unless a –no-recursion option is also used. The -f option allows us to specify the name of the archive. It thus has to come last in our option chain (as it requires an option) so we can add the archive file name directly behind it. Using tar -fch test.tar * will not work:
After the tar is generated, we use a modified ls output which clearly shows us the number of bytes per file. As you can see, the tar file is much larger then all of our files combined. The files are simply being archived and some overall overhead for tar is being added.
As an interesting sidenote, we can also see what types of files were are working with by simply using the file command at the command prompt:
Creating an Uncompressed Archive
A very common compression algorithm is GZIP. Let’s add the option for the same (-z) to our chain of shorthand command line options and see how this affects the file size:
This time we specified a regular expression to use only the files with name a to f, preventing the tar command from including the all_files.tar file inside the new all_files.tar.gz file!
See How Do You Actually Use Regex? and Modify Text Using Regular Expressions using sed if you like to learn more about regular expressions.
We also included the -z option which will use GZIP compression to compress the resulting .tar file once the dumping of data into it is complete. It is great to see that we end up with a 186 byte file, which tells us that – in this case – the tar header / overhead of about 10Kb can be compressed very well.
The total size of the archive is 7.44 times larger then the total file size, but it matters little as this fictive example is not representative of compressing large files where gains instead of losses are almost always seen, unless the data was pre-compressed or is of such a format that it cannot be condensed easily using a variety of algorithms. Still, one algorithm (like the GZIP one) may be better then another (like for example BZIP2), and vice versa, for different data sets.
Gaining More Bytes Using High Level Compression
Can we make the file even smaller? Yes. We can set the maximum compression option of GZIP by using the -I option to tar which lets us specify a compression program to use (with thanks to stackoverflow user ideasman42):
Here we specified -I ‘gzip -9’ as the compression program to use, and we dropped the -z option (as we are now specifying a specific custom program to use instead of using the built-in tar GZIP configuration). The result is that we 12 bytes less as a result of a better (but generally slower) compression attempt (at level -9) by GZIP.
Generally speaking, the faster the compression (lower level of compression attempts, i.e. -1), the more file size. And, the slower the compression (higher level of compression attempts, i.e. -9), the smaller the file. You can set your own preference by varying the compression level from -1 (fast) to -9 (slow)
Other Compression Programs
There are two other common compression algorithms which one may explore and test (different algorithm options also give different sizing outcomes, and may have additional compression options), and that is bzip2, which can be used by specifying the -j option to tar, and XZ which can be used by specifying the -J option.
Alternatively, you can use the -I command to set maximum compression options for bzip2 (-9):
And -9e for xz:
As you can see, the results are less good in this case then using the somewhat standard GZIP algorithm. Still, the bzip2 and xz algorithms may show improvements with other data sets.
Decompressing a File
Decompressing a file is super easy, whatever the original method was to compress it, and provided that such compression algorithm is present on your computer. For example, if the original compression algorithm was bzip2 (indicated by a .bz2 extension to the tar filename), then you will want to have done sudo apt install bzip2 (or sudo yum install bzip2) on your target computer which is to decompress the file.
We simply specify -x to expand or decompress our all_files.tar.gz file, and indicate what the filename is by again using the -f shorthand option as before.
Compressing files can help you save a lot of room on your storage devices, and knowing how to use tar in combination with available compression options will help you to do so. Once the archive needs to be extracted again, it is easy to do so provided the correct decompression software is available on the computer used to decompress or extract the data from your archive. Enjoy!