Silent data loss exposed


I was moving a large number of image files around and decided to compare checksums after putting them in their new home.

Ouf of several thousand files, about 80GB of data, I found that seventeen of them had a checksum mismatch.

Running md5sum manually on each of those was showing a correct checksum, well, up until the sixth file and then I found this:

$ md5sum DSC_2624.NEF
94fc8d3cdea3b0f3479fa255f7634b5b  DSC_2624.NEF
$ md5sum DSC_2624.NEF
25cf4469f44ae5e5d6a13c8e2fb220bf  DSC_2624.NEF
$ md5sum DSC_2624.NEF
03a68230b2c6d29a9888d2358ed8e225  DSC_2624.NEF

Yes, each time I run md5sum on the same file it gives a different result. Out of the seventeen files, I found one other displaying the same problem and the others gave correct checksums when I ran md5sum manually. Definitely not a healthy disk, or is it?

This is the reason why checksumming filesystems like Btrfs are so important.

There are no errors or warnings in the logs on the system with this disk. Silent data loss at its best.

Is the disk to blame though?

It may be tempting to think this is a disk fault, most people have seen faulty disks at some time or another. In the old days you could often hear them too. There is another possible explanation though: memory corruption. The data read from disk is normally cached in RAM and if the RAM is corrupt, the cache would return bad data.

I dropped the read cache:

# echo 3 > /proc/sys/vm/drop_caches 

and tried md5sum again and observed the file checksum is now correct.

It would appear the md5sum command had been operating on data in the page cache and the root cause of the problem is memory corruption. Time to run a memory test and then replace the RAM in the machine.