Is Your Data Safe in Hadoop?
Hadoop’s HDFS can lose data!
What? You didn’t know?
We didn’t know for sure either, although we’d heard stories – so it didn’t come as a huge shock. Our attention was drawn to this phenomenon by MapR in a recent briefing. Naturally, MapR draws attention to this because it doesn’t use HDFS. It had the foresight to employ its own file system, MapR-FS, rather than HDFS in its implementation of Hadoop.
There’s a study that was done by the Department of Computer Science, North Carolina State University, by Peipei Wang, Daniel J. Dean and Xiaohui Gu entitled Understanding Real World Data Corruptions in Cloud Systems (click on the link to download – you’ll go right to it.) What this team of researchers describe in their paper is a study of 138 real-world data corruption incidents reported in the Hadoop bug repositories of a live, cloud-based Hadoop system.
If you want the nitty gritty, read the paper. We recommend it. Even if you know Hadoop well, you may find it educational. Perhaps the most interesting aspect of Hadoop data corruption is illustrated by the pie chart below.
What the study is telling here is that a mere 25% of the data corruption incidents were detected and reported correctly (the correct corruption detection piece of the pie chart). 12% of the incidents reported were false positives (the misreported corruption slice). The 21% imprecise corruption detection refer to situations where the error report does not clearly identify the One example: a “ChecksumException” could be generated, indicating data corruption but not reporting whether the corruption is in the block or the metadata file.
Finally, there is silent data corruption – corruption that goes unreported but comes to light later. HDFS provides two main methods for detecting data corruption: checksum verification and block scanning. Checksum verification works by calculating checksums and updating a checksum value in the metadata file when a data block is updated. If data corruption affects stored data and an operation is performed on it, checksum verification spots it at once and reports it. The other method is to periodically run a block scanner, which will detect data corruption without operating on it. This unearths silent data corruption.
The causes of data corruption that the report discovered were: Improper runtime checking, Race conditions, Inconsistent state, Improper network failure handling, Improper node crash handling, Incorrect name, Lib/command errors, Compression-related errors and Incorrect data movement.
If HDFS data corruption comes as a bit of surprise to you, you’re not alone. However, if you consider Hadoop’s origins, perhaps it shouldn’t be so surprising. If you have a vast data lake full of compressed website information and the only transactions you run against it are searches, then if there’s a smidgeon of data corruption in the mix, it’s unlikely to affect the results. The same might be true for some analytics algorithms running on vast collections of data.
Nevertheless, I cannot help but feel uneasy about it. Call me old-fashioned, but I’m a fan of data integrity.