ECC Memory - Problem Background

Problem Background

Electrical or magnetic interference inside a computer system can cause a single bit of dynamic random-access memory (DRAM) to spontaneously flip to the opposite state. It was initially thought that this was mainly due to alpha particles emitted by contaminants in chip packaging material, but research has shown that the majority of one-off ("soft") errors in DRAM chips occur as a result of background radiation, chiefly neutrons from cosmic ray secondaries, which may change the contents of one or more memory cells or interfere with the circuitry used to read/write them.

There was some concern that as DRAM density increases further, and thus the components on chips get smaller, while at the same time operating voltages continue to fall, DRAM chips will be affected by such radiation more frequently—since lower-energy particles will be able to change a memory cell's state. On the other hand, smaller cells make smaller targets, and moves to technologies such as SOI may make individual cells less susceptible and so counteract, or even reverse, this trend. Recent studies show that single event upsets due to cosmic radiation have been dropping dramatically with process geometry and previous concerns over increasing bit cell error rates are unfounded.

The Cassini–Huygens spacecraft, launched in 1997, contains two identical flight recorders, each of which contains 2.5 gigabits of memory in the form of arrays of commercial DRAM chips. Its engineering telemetry sends the number of (correctable) single-bit-per-word errors and (uncorrectable) double-bit-per-word errors. In the vicinity of Earth, and when the sun is "quiet", it reported a nearly constant single-bit error rate of about 280 errors per day. The maximum hourly error report from Cassini–Huygens in the first month in space was 3072 single-bit errors per day during a weak solar flare. If the flight recorders had been designed with EDAC words assembled from widely-separated bits, the number of (uncorrectable) multiple-bit errors should average less than one per year.

Work published between 2007 and 2009 showed widely varying error rates with over 7 orders of magnitude difference, ranging from 10−10–10−17 error/bit·h, roughly one bit error, per hour, per gigabyte of memory to one bit error, per millennium, per gigabyte of memory. A very large-scale study based on Google's very large number of servers was presented at the SIGMETRICS/Performance’09 conference. The actual error rate found was several orders of magnitude higher than previous small-scale or laboratory studies, with 25,000 to 70,000 errors per billion device hours per megabit (about 3–10 × 10−9 error/bit·h), and more than 8% of DIMM memory modules affected by errors per year.

Read more about this topic:  ECC Memory

Famous quotes containing the words problem and/or background:

    Congress seems drugged and inert most of the time. ...Its idea of meeting a problem is to hold hearings or, in extreme cases, to appoint a commission.
    Shirley Chisholm (b. 1924)

    They were more than hostile. In the first place, I was a south Georgian and I was looked upon as a fiscal conservative, and the Atlanta newspapers quite erroneously, because they didn’t know anything about me or my background here in Plains, decided that I was also a racial conservative.
    Jimmy Carter (James Earl Carter, Jr.)