Virtual Synchrony - Data Replication and Fault Tolerance

Data Replication and Fault Tolerance

The basic goal for all of these protocols is to replicate data in a distributed system in a manner that makes the replicated entity indistinguishable from a non-replicated object implementing the same interface. For example, if we imagine a simple variable, x, that can be read or written to, a replicated version might consist of some set of replicas x0, x1, ... xn and an associated protocol, such that reads and writes to the replicates are performed in a way that looks indistinguishable from reads and writes to the original variable. The challenge is to deal with cases in which multiple updates are initiated concurrently (sometimes called an edit conflict), or where a failure disrupts an update while it is still in progress. When we create a process group, the idea is that each of its members will hold a replica. Updates are delivered to the group members through an event notification interface implemented in a way that eliminates these kinds of problems.

The central difference between the three models is that virtual synchrony assumes that the variable is replicated in memory by a set of processes executing on some collection of machines in a network. Transactional one-copy serializability assumes that the data resides in a collection of transactional databases (on disk), and implements the full transactional ACID properties, with the usual begin/commit/abort interface. State-machine consensus lies somewhere in the middle: the variables are assumed to be persistent (for example they might be stored in files), but are not assumed to have full ACID properties, and access is not assumed to go through a transactional begin/commit/abort interface.

None of the three models is particularly difficult to support in a system where the set of participating processes is stable, and where messages are delivered reliably. However, in real networks, computers can crash or become disconnected and messages can be lost. The need to preserve the properties of the model while masking failures and maintaining high performance is what makes the data replication problem so difficult.

All three models assume that machines may fail by crashing: a computer halts, or some process on it halts, and other processes sense the failure by timeout. Timeout, of course, is a potentially inaccurate way to sense failures (timeouts always discover true crashes, but sometimes a timeout will trigger for some other reason, such as a transient connectivity problem.) A platform implementing any of these models must provide the programmer with a set of system calls that allows him or her to write code that will continue to respect the model even if these kinds of problems occur. In effect, the platform hides this difficult fault-tolerance problem from the programmer.

None of the three models can handle more complex failures, such as machines that are taken over by a virus, or a network that sometimes modifies the messages transmitted. The so-called Byzantine agreement model goes beyond the data replication schemes discussed here by also solving such issues, but does so at a price: Byzantine replication protocols typically require larger numbers of servers, and can be much slower.

Read more about this topic:  Virtual Synchrony

Famous quotes containing the words data, fault and/or tolerance:

    To write it, it took three months; to conceive it three minutes; to collect the data in it—all my life.
    F. Scott Fitzgerald (1896–1940)

    It is almost always a fault of one who loves not to realize when he ceases to be loved.
    François, Duc De La Rochefoucauld (1613–1680)

    Persecution was at least a sign of personal interest. Tolerance is composed of nine parts of apathy to one of brotherly love.
    Frank Moore Colby (1865–1925)