In previous posts, we looked into leaderless replications and quorums. In this post we continue learning about them.
Quorums are not as fault-tolerant as they could be. A network interruption can easily cut off a client from a large number of database nodes.
Although those nodes are alive, and other clients may be able to connect to them, to a client that is cut off from the database nodes, they might as well be dead. In this situation, it’s likely that fewer than w or r reachable nodes remain, so the client can no longer reach a quorum.
In a large cluster (with significantly more than n nodes) it’s likely that the client can connect to some database nodes during the network interruption, just not to the nodes that it needs to assemble a quorum for a particular value. In that case, database designers face a trade-off:
-
Is it better to return errors to all requests for which we cannot reach a quorum of w or r nodes?
-
Or should we accept writes anyway, and write them to some nodes that are reachable but aren’t among the n nodes on which the value usually lives?
Sloppy Quorum – writes and reads still require w and r successful responses, but those may include nodes that are not among the designated n “home” nodes for a value.
Hinted Handoff – Once the network interruption is fixed, any writes that one node temporarily accepted on behalf of another node are sent to the appropriate “home” nodes. This is called hinted handoff.
Sloppy Quorums only assures that writes are successful given there are w nodes available somewhere. Reads could still be returning stale value until hinted handoff completes.
Sloppy quorums are optional in dynamo implementations, for Riak they are turned on.
Multi-datacenter operation
Leaderless replication is also suitable for multi-datacenter operation, since it is designed to tolerate conflicting concurrent writes, network interruptions, and latency spikes.
Cassandra and Voldemort – the number of replicas n includes nodes in all datacenters, and in the configuration you can specify how many of the n replicas you want to have in each datacenter. Each write from a client is sent to all replicas, regardless of datacenter, but the client usually only waits for acknowledgment from a quorum of nodes within its local datacenter so that it is unaffected by delays and interruptions on the cross-datacenter link. The higher-latency writes to other datacenters are often configured to happen asynchronously, although there is some flexibility in the configuration
Detecting Concurrent Writes
Dynamo-style databases allow several clients to concurrently write to the same key, which means that conflicts will occur even if strict quorums are used. Conflicts also occur due to read repair and hinted handoffs.
Application developers have to make sure these conflicts are resolved as databases don’t do that automatically.
-
Last write wins (discarding concurrent writes) – One approach for achieving eventual convergence is to declare that each replica need only store the most “recent” value and allow “older” values to be overwritten and discarded. For example, we can attach a timestamp to each write, pick the biggest timestamp as the most “recent,” and discard any writes with an earlier timestamp. This conflict resolution algorithm, called last write wins (LWW). LWW achieves the goal of eventual convergence, but at the cost of durability: if there are several concurrent writes to the same key, even if they were all reported as successful to the client (because they were written to w replicas), only one of the writes will survive and the others will be silently discarded.
-
The “happens-before” relationship and concurrency – whenever you have two operations A and B, there are three possibilities: either A happened before B, or B happened before A, or A and B are concurrent. What we need is an algorithm to tell us whether two operations are concurrent or not. If one operation happened before another, the later operation should overwrite the earlier operation, but if the operations are concurrent, we have a conflict that needs to be resolved.
The algorithm works as follows:
-
The server maintains a version number for every key, increments the version number every time that key is written, and stores the new version number along with the value written.
-
When a client reads a key, the server returns all values that have not been overwritten, as well as the latest version number. A client must read a key before writing.
-
When a client writes a key, it must include the version number from the prior read, and it must merge together all values that it received in the prior read. (The response from a write request can be like a read, returning all current values, which allows us to chain several writes like in the shopping cart example.)
-
When the server receives a write with a particular version number, it can overwrite all values with that version number or below (since it knows that they have been merged into the new value), but it must keep all values with a higher version number (because those values are concurrent with the incoming write).
- If you make a write without including a version number, it is concurrent with all other writes, so it will not overwrite anything—it will just be returned as one of the values on subsequent reads.
- If several operations happen concurrently, clients have to clean up afterward by merging the concurrently written values, also called sibling values. Merging sibling values is essentially the same problem as conflict resolution in multi-leader replication. A simple approach is to just pick one of the values based on a version number or timestamp (last write wins), but that implies losing data. Another option is to have a union of the two values, but that is also error prone. We will discuss later approaches for handling these sibling values.
-
- Version Vectors – When there are multiple replicas, we need to use a version number per replica as well as per key. Each replica increments its own version number when processing a write, and also keeps track of the version numbers it has seen from each of the other replicas. This information indicates which values to overwrite and which values to keep as siblings. The collection of version numbers from all the replicas is called a version vector.
Thanks for stopping by! Hope this gives you a brief overview in to sloppy quorums, hinted hand-offs, multi-data center operations and detecting concurrent writes when we are using leaderless replication. Eager to hear your thoughts and chat, please leave comments below and we can discuss.