Data Integrity: MD5/SHA1 are Your Best Friends!

IntegrityYesterday, I faced a very strange story that I would like to tell you to prove the importance of “integrity” in information security. Wikipedia defines data integrity as following:

Data Integrity in its broadest meaning refers to the trustworthiness of system resources over their entire life cycle.

The “entire life cycle” is very important in this case. I had to upgrade the firmware of an appliance manufactured by “A”. I visited (over HTTPS) the support website  of “A”, went to the download sections and grabbed the necessary files to perform the maintenance. The website provided MD5 hashes for all the files. Good practice! Once the files transferred on my laptop, md5sum reported the same hashes. My files were ready!

Just a small reminder for those who don’t know what’s a hashing algorithm. Based on a variable amount of data, a hashing algorithm computes a fixed size message digest. Well know algorithms are MD5, SHA1 or HMAC. Practically, the generated message digest will uniquely identify the original data. Example, almost all operating systems have tools to compute the MD5 or SHA1 digest of files:

  $ md5sum /tmp/file.txt
  451024bdf01d5d4f64567bea70c402be  /tmp/file.txt
  $ sha1sum /tmp/file.txt
  93c6c7c22e0846ca1944f76ceb6981a2f49ce70e  /tmp/file.txt

This is a common way to control the integrity of files distributed online. Hashes are given on the original website. You perform the same operation on your local files, if the message digest is the same, files are identical!

Once at the customer premises, another good security practice: I was not authorized to connect my laptop on their management network. I simply copied the files to a clean (read: safe, scanned) USB stick to transfer them to the management workstation. Finally, I uploaded the files on the appliance and launched the upgrade procedure. After many coffees, the device was still decompressing the firmware (a 670MB archive). Strange, I decided to investigate…

I checked the USB stick: the firmware file looked ok, I could read it, even the file size was the same as the original. I generated the MD5 hash on the file directly from the USB stick and… it was not the same! The file was corrupted during the transfer between my laptop to the USB stick!? No error message was displayed during the copy operation, the stick was properly unmounted, no USB/SCSI errors were reported by my laptop kernel. I’m still wondering what happened!

Hopefully, the second attempt to upgrade the appliance was successful. What are the lessons learned from this story?

  • Integrity is a key element in information security (That’s the “I” in the CIA triad)
  • MD5/SHA1 hashes are a common way to verify the integrity of files downloaded via public resources. It must be checked not only while receiving the data from the source itself but during the complete data life-cycle: transfer, storage and retrieval. (what I omitted to do in this story – shame on me!)
  • Data integrity can be compromised by multiple factors:
    • Security threads (ex: a virus)
    • Human errors
    • Physical factors (ex: a bad sector on a disk)
    • Software bugs

If I failed (and we learn by our mistakes) to check the integrity of the files from A to Z, the vendor “A” also failed somewhere:

  • The process to decompress the firmware image did not report a problem with the file and crashed silently leaving the web console with a time counter running.
  • Some vendors still fail to implement integrity checks on the firmware they have to process. Distributed files are simply not signed. It means that can be altered and injected in the device (MitM attack). There exist solutions to validate the integrity of a file from a consistency point of view (using CRC or “Cyclic Redundant Checks“).

Keep this in mind and stay safe!

Note: For a while, MD5 is considered as broken. It has been proven that MD5 is vulnerable to collision attacks. But it remains mainly used to check downloaded files integrity.





  1. “Note: For a while, MD5 is considered as broken”: indeed and SHA1 is not in a very good position neither.
    So, rule is:
    – if you want to check against *accidental* corruption, MD5 is ok
    – if you want to check against *malicious* corruption, SHA1 is bare minimum, SHA256 (cf sha256sum) is better, or if not available, use jointly SHA1 and MD5.

  2. Do you use ECC memory? Anyway, it may be worth to clean disk buffers before calculating checksum to force reading a file from disk. There were nice blog post about corruption of the file in RAM, unfortunately can’t find it now.

    Good news for Linux users is that brtfs will have checksum support.

  3. That’s why I like ZFS so much. It uses checksums to protect against “silent” data corruption like what you experienced, and it has saved me in production many times (I’m looking at you, Amazon). It’s disappointing other filesystems have not followed suit.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.