Reliability, Availability and Serviceability (computer Hardware) - RAS Features

RAS Features

Example hardware features for improving RAS include the following, listed by subsystem:

  • Processor:
    • Processor instruction error detection (e.g. residue checking of results) with instruction retry e.g. alternative processor recovery in IBM mainframes, or "Instruction replay technology" in Itanium systems.
    • Processors running in lock-step to perform master-checker or voting schemes.
    • Machine check architecture to report errors to the OS.
    • Extended precision, including 128-bit, floating point to minimize roundoff error in intermediate calculations, and hardware decimal floating point to avoid decimal-binary conversion errors.
  • Memory:
    • Parity or ECC (including single device correction) protection of memory components (cache and system memory), as well as memory bus; bad cache line disabling; memory scrubbing; memory sparing; bad page offlining; redundant bit steering; redundant array of independent memory (RAIM).
  • I/O:
    • Cyclic redundancy check checksums for data transmission/retry and data storage, e.g. PCIe Advanced Error Reporting, redundant I/O paths.
  • Storage:
    • RAID configurations for magnetic disk storage.
    • Journaling file systems for file repair after crashes.
    • Checksums on both data and metadata, and background scrubbing.
  • Power/cooling:
    • Duplication of components to avoid single point of failures (for example power-supplies).
    • Over-designing the system for the specified operating ranges of clock frequency, temperature, voltage, vibration.
    • Temperature sensors to throttle operating frequency when temperature goes out of specification.
    • Surge protector, uninterruptible power supply, auxiliary power.
  • System:
    • Hot swapping of components.
    • Predictive failure analysis to predict which intermittent correctable errors will lead eventually to hard non-correctable errors.
    • Partitioning/domaining of computer components to allow one large system to act as several smaller systems.
    • Virtual machines to decrease the severity of operating system software faults.
    • Computer clustering capability with failover capability, for complete redundancy of hardware and software.
    • Dynamic software updating to avoid the need to reboot the system for a kernel software update, for example Ksplice under Linux.
    • Independent service processor for serviceability: remote monitoring, alerting and control.

Fault-tolerant designs extended the idea by making RAS to be the defining feature of their computers for applications like stock market exchanges or air traffic control, where system crashes would be catastrophic. Fault-tolerant computers (e.g., see Tandem Computers and Stratus Technologies), which tend to have duplicate components running in lock-step for reliability, have become less popular, due to their high cost. High availability systems, using distributed computing techniques like computer clusters, are often used as cheaper alternatives.

Read more about this topic:  Reliability, Availability And Serviceability (computer Hardware)

Famous quotes containing the word features:

    “It looks as if
    Some pallid thing had squashed its features flat
    And its eyes shut with overeagerness
    To see what people found so interesting
    In one another, and had gone to sleep
    Of its own stupid lack of understanding,
    Or broken its white neck of mushroom stuff
    Short off, and died against the windowpane.”
    Robert Frost (1874–1963)