Implementations
Seymour Cray famously said "parity is for farmers" when asked why he left this out of the CDC 6600. He included parity in the CDC 7600, which caused pundits to remark that "apparently a lot of farmers buy computers" (see parity bit#History). The original IBM PC and all PCs until the early 1990s used parity checking. Later ones mostly did not. Wider memory buses make parity and especially ECC more affordable. Many current microprocessor memory controllers, including almost all AMD 64-bit offerings, support ECC, but many motherboards and in particular those using low-end chipsets do not.
In a few cases, systems with a non-ECC memory controller can still gain most of the benefits of ECC memory by using EOS memory modules.
An ECC-capable memory controller as used in many modern PCs (mostly medium- to high-end workstation and server-class) can detect and correct errors of a single bit per 64-bit "word" (the unit of bus transfer), and detect (but not correct) errors of two bits per 64-bit word. The BIOS in some computers, when matched with operating systems such as (only some versions of) Linux, MacOS, and Windows, allow counting of detected and corrected memory errors, in part to help identify failing memory modules before the problem becomes catastrophic.
Error detection and correction depends on an expectation of the kinds of errors that occur. Implicitly, we have assumed that the failure of each bit in a word of memory is independent and hence that two simultaneous errors are improbable. This used to be the case when memory chips were one bit wide (typical in the first half of the 1980s). Now many bits are in the same chip. This weakness does not seem to be widely addressed; one exception is Chipkill.
DRAM memory may provide increased protection against soft errors by relying on error correcting codes. Such error-correcting memory, known as ECC or EDAC-protected memory, is particularly desirable for high fault-tolerant applications, such as servers, as well as deep-space applications due to increased radiation. Some systems also 'scrub' the memory – periodically reading all addresses and writing back corrected versions if neccessary to remove soft errors.
Interleaving allows for distribution of the effect of a single cosmic ray, potentially upsetting multiple physically neighboring bits across multiple words by associating neighboring bits to different words. As long as a single event upset (SEU) does not exceed the error threshold (e.g., a single error) in any particular word between accesses, it can be corrected (e.g., by a single-bit error correcting code), and an effectively error-free memory system may be maintained.
Error-correcting memory controllers traditionally use Hamming codes, although some use triple modular redundancy. The latter is preferred because its hardware is faster than Hamming error correction hardware. Space satellite systems often use TMR, although satellite RAM usually uses Hamming error correction.
Read more about this topic: ECC Memory