So far in previous posts we have read about leader based replication. There is another way of handling replication and that is leaderless. It didn’t become popular until Amazon used it for DynamoDb.
As the name suggest there is no leader which accepts writes, instead client can send write requests to multiple replicas. In some leaderless implementations, the client directly sends its writes to several replicas, while in others, a coordinator node does this on behalf of the client.
Not only writes, but the read requests are also send to all replicas in parallel. The client may get different responses from different nodes; i.e., the up-to-date value from one node and a stale value from another. Version numbers are used to determine which value is newer .
Let’s say we have three nodes, and one node went down. Meanwhile we get a write request, this requests gets written to the two healthy replicas. Now the down replica comes back. How do we ensure that it gets the latest writes from other nodes?
Two mechanisms are often used in Dynamo-style datastores:
- Read repair – When a client makes a read from several nodes in parallel, it can detect any stale responses. And then can write the new data to the node from which it gets stale response. This approach works well for values that are frequently read.
- Anti-entropy process – In addition, some datastores have a background process that constantly looks for differences in the data between replicas and copies any missing data from one replica to another. Unlike the replication log in leader-based replication, this anti-entropy process does not copy writes in any particular order, and there may be a significant delay before data is copied.
Quorums for reading and writing
If there are n replicas, every write must be confirmed by w nodes to be considered successful, and we must query at least r nodes for each read. As long as w + r > n, we expect to get an up-to-date value when reading, because at least one of the r nodes we’re reading from must be up to date.
In Dynamo-style databases, the parameters n, w, and r are typically configurable. A common choice is to make n an odd number (typically 3 or 5) and to set w = r = (n + 1) / 2 (rounded up).
The quorum condition, w + r > n, allows the system to tolerate unavailable nodes as follows:
-
-
If w < n, we can still process writes if a node is unavailable.
-
If r < n, we can still process reads if a node is unavailable.
-
With n = 3, w = 2, r = 2 we can tolerate one unavailable node.
-
With n = 5, w = 3, r = 3 we can tolerate two unavailable nodes.
-
Normally, reads and writes are always sent to all n replicas in parallel. The parameters w and r determine how many nodes we wait for—i.e., how many of the n nodes need to report success before we consider the read or write to be successful.
-
Limitations –
With a smaller w and r you are more likely to read stale values, because it’s more likely that your read didn’t include the node with the latest value.
However, even with w + r > n, there are likely to be edge cases where stale values are returned. These depend on the implementation, but possible scenarios include:
-
If a sloppy quorum is used, the w writes may end up on different nodes than the r reads, so there is no longer a guaranteed overlap between the r nodes and the w nodes.
-
If two writes occur concurrently, it is not clear which one happened first. In this case, the only safe solution is to merge the concurrent writes. If a winner is picked based on a timestamp (last write wins), writes can be lost due to clock skew.
-
If a write happens concurrently with a read, the write may be reflected on only some of the replicas. In this case, it’s undetermined whether the read returns the old or the new value.
-
If a write succeeded on some replicas but failed on others, and overall succeeded on fewer than w replicas, it is not rolled back on the replicas where it succeeded. This means that if a write was reported as failed, subsequent reads may or may not return the value from that write.
-
If a node carrying a new value fails, and its data is restored from a replica carrying an old value, the number of replicas storing the new value may fall below w, breaking the quorum condition.
-
Even if everything is working correctly, there are edge cases in which you can get unlucky with the timing.
Thanks for stopping by! Hope this gives you a brief overview in to leaderless replication. Eager to hear your thoughts and chat, please leave comments below and we can discuss.