Virtual Synchrony - Failure Semantics

Failure Semantics

We mentioned that there is a sense in which virtual synchrony is a weaker model than transactional one-copy serializability or state machine consensus in the style of Paxos. Partly this relates to ordering: virtual synchrony often weakens the message delivery ordering to gain performance. As mentioned above, doing so can sometimes increase robustness too. If different copies sometimes process events in different orders (doing so only when this won't have any impact on the ultimate state of the object), the copies may still be somewhat more robust against messages that cause exceptions. After all, many bugs are exquisitely sensitive to the exact sequence of events that a process experiences, so processes that see the same things but in different orders can often survive problems that would be fatal in some specific ordering.

But the other sense in which virtual synchrony is a weaker model relates to exactly what happens when some process crashes. Suppose that process p sends a multicast to a group G, and then p and some member of the group, say q, both crash. No process that remains operational has a copy of the multicast. What should the platform do?

In virtual synchrony, the group continues executing as if no message was ever sent. After all, there is no evidence to the contrary. P and q have both crashed, so they won't behave in a manner inconsistent with the model. Yet it is possible that q received p's message and delivered it to the application right before the crash. So there is a case in which virtual synchrony seems to lie: it behaves as if no message was sent, and yet the crashed processes might actually have exchanged a message.

This never happens in Paxos or transactional systems, which makes them a good match for updating database files on a disk. In both systems, if q later recovers and rejoins the group, any data it collected prior to crashing will still be valid, except to the extent that it missed updates delivered to the other group members while it was down. The cost of this guarantee is, however, quite high. Asynchronous Paxos, and transactional systems, impose a long delay before any process can deliver a message. First, these platforms make sure that the message reaches all of its destinations, asking them to delay the incoming message before delivering it. Only after the first step is completed are recipients told that it is safe to deliver the message to the application. (In one variant on these models, the platform only makes sure that a quorum (a majority) receive the message, but the delay is comparable).

The delay associated with this extra round of communication can have a big impact on performance.

Experience with virtual synchrony shows that for most applications, the weak but fast form of delivery is just fine. For rare cases where stronger guarantees are needed, the application programmer can request that a slower delivery be performed, paying an infrequent higher price, but only when necessary. The resulting performance will be much higher than if the slower, more conservative delivery property was used for every message.

Read more about this topic: Virtual Synchrony

Famous quotes containing the word failure:

“The book borrower of real stature whom we envisage here proves himself to be an inveterate collector of books not so much by the fervor with which he guards his borrowed treasures and by the deaf ear which he turns to all reminders from the everyday world of legality as by his failure to read these books.”
—Walter Benjamin (1892–1940)