Everyone downloads files, whether it's an ISO image of the latest Linux distribution, TuxRadar podcast or a PDF tutorial.

But despite this age of browser security, anti-malware software and sophisticated intrusion detection it's not always possible to ensure that files haven't been tampered with in transit (or even on the server itself).

In this tutorial we'll show you how to ensure your files have downloaded correctly and securely.

An MD5 checksum is a 128-bit code consisting of various numerals and letters for a file or string. This is referred to as a hash value, and when you generate one it is exactly the same on any PC however many times it is generated.

However, if you make even the slightest change to the source file or string, the hash value generated is completely different. This means that if you distribute a file it will generate the same checksum on any PC as long as it remains unaltered.

For example, if you type into a terminal:

echo -n 'hello world' | md5sum -

you'll get the following output:

5eb63bbbe01eeed093cb22bb8f5acdc3 -

This output will be produced in a terminal window in any PC on which you care to run the first line of code. The command itself pipes the text into the md5sum command, which then generates a hash value for it. However, if we change the value in any way, like below:

echo -n 'Hello world' | md5sum -

we get a different result:

3e25960a79dbc69b674cd4ec67a72c62 -

But why isn't MD5 used for everything? Back in the 90s researchers discovered there were occasional collisions where hash values were identical or third-party apps could be spoofed into thinking an MD5 checksum was passed. This remains extremely rare, but means that MD5 has been succeeded by SHA-2 for security-critical applications.

Despite this, MD5 is up to the task of simply checking that a file has downloaded correctly. If we're missing any part of the file we will see that the value is be different to what we expect, and we can react accordingly.

Md5sum is part of GNU Core Utilities package, so it should be installed automatically on almost every Linux distribution.

Hashing files

Md5sum is our MD5 hash value generator, but it isn't just restricted to strings! We can generate a checksum for a file by typing:

md5sum file.txt

Assuming you have a file named file.txt in the directory md5sum will generate an MD5 hash value for it. If you change the filename and run the MD5 checksum again you will notice something odd – the MD5 is exactly the same!

On the other hand, if you now change the content within the text file and run the command you will see the hash value change entirely. The reason for both of these results is that the algorithm evaluates content rather than the filename, which means you can move and rename the file as much as you like while still ensuring that the content within the file is unaltered from its original form.

We can also verify that two seemingly duplicate files are the same. Though traditionally you would use a tool such as diff, md5sum just provides two MD5 hashes for us to compare manually rather than state what each difference between the two files might be.

md5sum doc1.odt doc2.odt

If there is even a subtle change between the two files then the hash values will be totally different and obvious even at a cursory glance. Notice also how we can compare more than just plain text files!

Now that we've generated some MD5 checksums for strings and files we need a way to store hash values so that we can automate checks and share the hashes with other people. This way people can verify whether the file they have received was from you or someone else.

To output hashes to a text file, we type the following into a terminal:

md5sum a.txt b.txt > md5sums.md5

This line generates two MD5 hash values – one for a.txt and another for b.txt. It then pipes these values (with filenames) to a plain text file named md5sums.md5. To then validate that a.txt and b.txt haven't changed since you produced this checksum, we type the following into a terminal:

md5sum -c md5sums.md5

You should receive two lines of output, each with a filename followed by 'OK'. Note however that you can no longer change the names of the files you have generated hashes for, as the output file relies on the name being the same to find it and verify its integrity (if you do this you'll just get a 'File Not Found' error).

This is also the case if you want to verify files in an entire directory. To do this, type the following line into a terminal:

find dir_folder -type f -print0 | xargs -0 md5sum >> checksum.md5

This recursively hashes all the files in the folder named dir_folder and then pipes the output in plain text to checksum.md5. You would then run the code in the previous snippet to verify that each file has remained unaltered since you last hashed them.

This method also recursively hashes all files in the directory tree (so long as you have a recent copy of GNU Core Utils!), so any files in subfolders will also be hashed. This line isn't particularly easy to remember, so avid coders could write two shell scripts – one that changes into each directory and (using a much simpler command) hashes each file, and another that does much the same but verifies the files instead.

Another solution would be to install md5deep from your package manager and type the following into a terminal:

md5deep -rl dir_folder > checksum.md5

This tool does all the recursive action for you in the entire directory tree for every file and then pipes all the MD5 hash values and file locations into checksum.md5. Unlike md5sum, md5deep is not bundled with GNU Core Utils so this is not a standard fix. However, it's much easier to remember!

Sharing checksums

How you share your checksums is wholly dependent on how you intend to use them. If you're hosting a file on a dedicated server or free host and want to allow users to ensure their file has downloaded completely and that it is from the original source, you can bundle the checksum in the same directory.

If anyone tries to offer the file in an altered form on another server or the user doesn't download the file exactly as it was uploaded then the checksums won't match. The user can then either try downloading it again or from another mirror.

Email integrity

Another use would be when you are sending an email. You could produce a checksum for the message text and send it separately to the recipient. If the email has been tampered with or you have managed to send the wrong email by accident (which is easily done) then the recipient will know and can inform you accordingly.

The applications and possibilities for MD5 hashes are endless, so experiment at will and make sure you drop us an email to say how you got on! Now as you know how to generate, use and share MD5 hashes you can ensure all your files and downloads are unaltered and as their original author intended.

MD5 checksums can be applied to verify files and strings in any situation, so with this flexible tool you need never wonder about the integrity of your source again.


First published in Linux Format Issue 123

Liked this? Then check out

Sign up for TechRadar's free Weird Week in Tech newsletter
Get the oddest tech stories of the week, plus the most popular news and reviews delivered straight to your inbox. Sign up at http://www.techradar.com/register

Follow TechRadar on Twitter