Data Corruption - Countermeasures

Countermeasures

When data corruption behaves as a Poisson process, where each bit of data has an independently low probability of being changed, data corruption can generally be detected by the use of checksums, and can often be corrected by the use of error correcting codes.

If an uncorrectable data corruption is detected, procedures such as automatic retransmission or restoration from backups can be applied. Certain levels of RAID disk arrays have the ability to store and evaluate parity bits for data across a set of hard disks and can reconstruct corrupted data upon the failure of a single or multiple disks, depending on the level of RAID implemented.

Today, many errors are detected and corrected by the disk drive using the ECC/CRC codes which are stored on disk for each sector. If the disk drive detects multiple read errors on a sector it may make a copy of the failing sector on another part of the disk- remapping the failed sector of the disk to a spare sector without the involvement of the operating system (though this may be delayed until the next write to the sector).

This "silent correction" can lead to other problems if disk storage is not managed well, as the disk drive will continue to remap sectors until it runs out of spares, at which time the temporary correctable errors can turn into permanent ones as the disk drive deteriorates. S.M.A.R.T. provides a standardized way of monitoring the health of a disk drive, and there are tools available for most operating systems to automatically check the disk drive for impending failures by watching for deteriorating SMART parameters.

"Data scrubbing" is another method to reduce the likelihood of data corruption, as disk errors are caught and recovered from, before multiple errors accumulate and overwhelm the number of parity bits. Instead of parity being checked on each read, the parity is checked during a regular scan of the disk, often done as a low priority background process. Note that the "data scrubbing" operation activates a parity check. If a user simply runs a normal program that reads data from the disk, then the parity would not be checked unless parity-check-on-read was both supported and enabled on the disk subsystem.

If appropriate mechanisms are employed to detect and remedy data corruption, data integrity can be maintained. This is particularly important in commercial applications (e.g. banking), where an undetected error could either corrupt a database index or change data to drastically affect an account balance, and in the use of encrypted or compressed data, where a small error can make an extensive dataset unusable. It is worth noting that while the study by CERN has been often referenced as showing large levels of data corruption, the disk subsystem which was the subject of the paper was set up with RAID5 and a single parity bit (hence could not recover from a single "silent" error), did not use parity-check-on-read (and hence could not detect "silent errors" through parity checking of the RAID stripe), and did not use data scrubbing. The disk storage was also subject to a microcode software bug which caused higher levels of errors than normal.

Read more about this topic:  Data Corruption