Weak consistency: case study of Amazon's Dynamo key-value store. * Some distributed systems prioritize being available at all times over returning very consistent results all the time. That is, you can trade off consistency for availability. Amazon's Dynamo key-value store is one example of such a system. * Dynamo provides eventual consistency, which isn't saying much. So, Dynamo returns multiple versions of a value for a get request on a key, and the application must be able to decide which one to use. It is suitable for applications that can handle occasional conflicting, stale results on get requests. * High level overview of Dynamo: Dynamo partitions the keys over the set of nodes, so that every key is stored at a subset N of the total nodes. During puts, the key is written to a subset W of the N nodes. During a get, the key is read back from some subset R of the N nodes. Dynamo chooses R,W,N such that R+W > N, so that the latest value can be returned. * How are keys assigned to nodes? Using the idea of consistent hashing. Every key is hashed to generate a number in a circular range [0, K-1]. Similarly every node/replica is also assgined an ID in the same space. A key is stored at the first N nodes which succeed the hash of the key in the circular ring. This list of N nodes is called the preference list of the key, and the first node on the list is the coordinator for the key. (Minor point: every node is assigned not one position on the ring, but multiple virtual nodes, for better load balancing and othre benefits.) * How does a client application know the preference list for a key? It can either link its code to the Dynamo library that has an updated list of nodes, or it can send its request to a load balancer which will redirect the request to the coordinator of the key. * Once the preference list for a key is found, the put operations tries to put the value at some W of the N nodes, and get operation contacts some R of the N nodes. However, put is asynchronous. That is, the system does not wait for confirmation from all W nodes before sending a reply to the client. In cases of network failures or outages, it is possible that the update wouldn't have reached all W nodes, and the get operation can return multiple versions of the key at different nodes. Why this design? Because one of the goals of the system is to be always writeable. That is, the system should never turn down a write request from a client. (Recall that a system with strong consistency like Raft can turn down client requests in case of failures.) * Since multiple versions of a key-value pair can exist, Dynamo uses the idea of vector clock to version the key-value pairs. A vector clock is a set of (node, count) pairs, where the count is incremented locally at every node. This vector clock version number (called context in the paper) is returned with every get to the client, and the client sends it along with its next put request. * For example, suppose a certain key is put at a replica A initially. It gets a version [(A,1)]. Another put at A updates the version to [(A,2)]. Now, the client reads this key, updates it and the write is handled by B since A was down. Then its version becomes [(A,2),(B,1)]. Now, suppose a node C has not been in sync with B about this latest put. If a client reads from C, it will get the older version [(A,2)]. If the client does a write on this old version and puts at C, then the value would be stored at C with the version [(A,2),(C,1)]. Note that this version of the key-value pair supercedes the versions [(A,1)] and [(A,2)], but conflicts with the version [(A,2),(B,1)]. Thus, clients could get both these conflicting versions when they do a get on this key at a later time, and the client application code must reconcile the two versions using application specific knowledge. * Vector clock version v1 precedes vector clock version v2 if all the counters of all nodes in v1 are less than the corresponding counters in v2. Such versions where one clearly precedes the other can be handled by Dynamo by discarding the older version. However, when the vector clock versions are conflicting, then Dynamo returns all such key value pairs to the client. * How do applications reconcile the two versions? Depends on the application semantics. The paper gives the example of a shopping cart application: if you have multiple versions of a shopping cart, you simply merge all items in them. This could result in some deleted items resurfacing (e.g., if the update at B deleted an item, but the version at C has the item, then the merged result will have the item), but no added items will be lost. This is ok in the case of this application, but may not be suitable for all applications. * After reading both conflicting versions, the client can put a merged value into the system (say at node A), and this merged key-value pair will have the version [(A,3),(B,1),(C,1)]. * What if some of the N nodes in the preference list are not available? To ensure that the key-value pair is replicated enough number of times, the system chooses some N nodes even if they are lower down the list. This is called a "sloppy" quorum, since the set of N nodes can keep changing based on failures. If a node not in the preference list of a key is forced to handle a key in this manner, it will try to contact the nodes in the original preference list to handoff the update, guaranteeing "eventual" consistency.