Data Corruption - Storage

Storage

Data loss during storage has two broad causes: hardware and software failure. Background radiation, head crashes, and aging or wear of the storage device fall into the former category, while software failure typically occurs due to bugs in the code.

Error detection and correction may occur in the hardware, the disk subsystem or adapter, or software which implements error checking and correction (i.e., RAID software such as mdadm for Linux).

Today, most errors are detected and corrected by the disk drive- sometimes by remapping a failed sector of the disk to a spare sector without the involvement of the operating system. This "silent correction" can lead to other problems if disk storage is not managed well, as the disk drive will continue to remap sectors until it runs out of spares, at which time the temporary correctable errors can turn into permanent ones as the disk drive deteriorates. S.M.A.R.T. provides a standardized way of monitoring the health of a disk drive, and there are tools available for most operating systems to automatically check the disk drive for impending failures by watching for deteriorating SMART parameters.

In the case of a RAID setup with a single parity (e.g. RAID1, RAID5), a single detected error will allow the RAID controller or software to correct that error. If there is an "undetected" error then the RAID algorithm for a single parity drive cannot determine which drive is at fault, it can only determine that the data does not match its parity/checksum, and will therefore just correct the parity/checksum information.

For RAID levels with two parity bits, a single "undetected" error where a drive provides incorrect data without flagging a fault can be corrected. This relies on the RAID system checking the parity- called parity check on read (DDN disk subsystems) or pre-read redundancy check (PRRC on LSI/Engenio disk subsystems). This feature may need to be enabled even if it is available on the disk subsystem. This check may not be done on normal reads (depending on the disk subsystem), as it involves reading all of the data from all disks in the RAID stripe. This may be a performance overhead, if the requesting program only wanted the data which resided on a fraction of the stripe. In the case of parity-check-on-read all of the data from all of the disks in the stripe is read, increasing the IOs to the disk susbsystem, as well as potentially incurring an overhead to calculate and check the parity.

"Data scrubbing" is another method to reduce the likelihood of data corruption, as disk errors are caught and recovered from, before multiple errors accumulate and overwhelm the number of parity bits. Instead of parity being checked on each read, the parity is checked during a regular scan of the disk, often done as a low priority background process. Note that the "data scrubbing" operation activates a parity check. If a user simply runs a normal program that reads data from the disk, then the parity would not be checked unless parity-check-on-read was both supported and enabled on the disk subsystem.


There are two types of data loss:

  • Undetected- also known as "silent corruption". These problems have been attributed to errors during the write process to disk. These are the most dangerous errors as there is no indication that the data is incorrect.
  • Detected- these errors are most often caused by disk drive problems. Errors may either be permanent or temporary, where temporary errors are able to be overcome when the operation is repeated by the hardware. Errors are normally detected by the hardware, either by the disk drive by checking the data read from the disk using the ECC/CRC error correcting code stored alongside the data on disk, or in the case of a RAID array by comparing the contents of the RAID strips with the ECC checksum or parity of the RAID stripe.

Read more about this topic:  Data Corruption

Famous quotes containing the word storage:

    Many of our houses, both public and private, with their almost innumerable apartments, their huge halls and their cellars for the storage of wines and other munitions of peace, appear to me extravagantly large for their inhabitants. They are so vast and magnificent that the latter seem to be only vermin which infest them.
    Henry David Thoreau (1817–1862)