Strong consistency: Raft consensus algorithm. * How do you build a system that guarantees strong consistency? A building block is a consensus protocol that lets all nodes reach consensus on some event. Paxos is a classic consensus protocol on which distributed systems with strong consistency are built. Raft is a new and improved consensus protocol. * What do consensus algorithms do? Paxos lets all nodes agree on a single decision, while Raft is designed to let nodes agree on a log of multiple entries. Raft runs over all replicas in a distributed system, and provides a consistent replicated log across all replicas. That is, the protocol builds and maintains a log of all inputs to the replicated state machine, and makes sure that the inputs are all stored in the same order and in a consistent manner across all replicas. This replicated log service can be used to build a variety of distributed systems. * High level overview of Raft: one of the replicas of Raft is elected as a leader. The leader receives inputs from clients, and propagates the inputs to all other replicas (followers). All nodes write all inputs in a log. Raft ensures that the logs of all nodes are consistent, i.e., the input stored at a particular log index in the same across all replicas. When a leader fails, nodes elect another leader which takes over the process of building the log. * How does Raft provide consistency in spite of failures and network partitions? Raft replicates every entry at a majority of nodes in the network and returns a response to a client only after the entry has been stored at the majority of nodes. A Raft replica does not return a response to a client unless it can talk to a majority, and remains unavailable if a majority of nodes have failed or it is in a network partition with the minority. That is, a Raft system with 2f+1 replicas can tolerate up to f failures. Since any response to a client comes from a majority, and any two majorities intersect, clients will always see consistent results (i.e., it will never be the case that a client stored a value at one subset of nodes, and reads it from a disjoint subset). Such systems are called quorum-based systems. * Now, we go into the following details of Raft: how is a leader elected, how does a leader propagate the log to other nodes, and what happens when a leader changes. * Note: the description of the Raft protocol is in terms of RPCs between nodes. An RPC is an easy abstraction for communication over the network. * Time in Raft is divided into intervals called terms. A term ends when a leader fails and other nodes notice that the heartbeats from the leader have stopped. Each new term starts with the election of a leader. One or more followers promote themselves to candidate status and request votes to become leader. Only one will get a majority vote and will become the leader of that term. If multiple candidates exist, then votes may be split and no one gets a majority, then candidates wait for a random duration and retry. Note that some nodes may miss many terms altogether when they are down, and catch up on the current term number when they come up. * Can two leaders be elected with the same term? No, because getting elected as leader of a term requires majority vote. Term number will always monotonically increase across leaders. Thus, the term acts as a "logical clock" in the system. * Once a leader is elected, it begins accepting client requests and starts replicating them at other nodes. Every log entry consists of an index, term, and the changes to state machine it is proposing (Fig 6). Once an entry has been replicated at a majority of nodes, it is considered as committed, and all followers can apply it to their state machines. At this point, the leader returns a response to the client. * The leader ensures that all logs agree on the entries at all index values. The leader only appends entries to the log. With every append request, the leader sends the index and term of the previous log entry, and a node appends only if its previous index also matches. * During regular operation, logs of nodes will always be in sync. However, during leader changes and failures, logs may diverge. For example, suppose node N1 is the old leader, and entries 1--4 have been committed as they have been stored at a majority of nodes. However, N1 crashes before replicating log entry 5, 6, 7 to all nodes. N1: 1 2 3 4 5 6 7 N2: 1 2 3 4 N3: 1 2 3 4 N4: 1 2 3 N5: 1 2 3 4 Now, suppose N5 becomes the leader, and N1 also comes back up as follower later on. N5 as the new leader will start accepting client requests and updates its log as follows ("p" and "q" are new entries written in the fifth/sixth slot on the log; we will use different sets of numbers/alphabets to distinguish entries of different terms). N1: 1 2 3 4 5 6 7 N2: 1 2 3 4 N3: 1 2 3 4 N4: 1 2 3 N5: 1 2 3 4 p q Note that some followers (N1) have more entries than the leader (N5), while some followers (N4) are lagging behind. What should N5 do now? * When N5 tries to write the entry "p" at index 5, it will be inconsistent with some follower's entry at index 5 (e.g., N1). How is this inconsistency be detected? Along with the request to append an entry, recall that the leader also sends the index and term of the previous entry in its log. If they don't match with those of the follower, the follower will not accept the append entry request. The leader then goes one step back, and tries to append the entry at the previous index. In this manner, the leader will go back to the point where its log matches with that of the follower, and start adding entries from there on. * In this manner, a new leader will rollback the tail of the log of followers that differs from its own log, to ensure that all logs are consist with its own. It will also help the lagging followers catch up. So, eventually N5 will ensure that all logs look as follows. N1: 1 2 3 4 p q N2: 1 2 3 4 p q N3: 1 2 3 4 p q N4: 1 2 3 4 p q N5: 1 2 3 4 p q * What about the entries 5,6,7 received by N1? Note that they haven't been committed, so no reply would have been sent to client. So the client will realize that the leader failed before replying, and will retry with the new leader. The price of consistency is that the system will sometimes be unavailable and the client needs to have a way to retry. * Is this rollback dangerous? Can a leader rollback entries that have already been committed? In the previous example, what if N4 became the leader? N4 does not have entry 4 that is committed. So, if it rolls back this entry, then the client will see inconsistent results. This cannot happen in Raft. During leader elections, nodes will vote for a leader only if its log is "up to date" (that is, it has all the committed entries). So N2, N3, N5 would never have voted for N4 as leader. * The precise definition of being "up to date" and qualifying for election as leader is that the candidate's last log entry has a later term, or it has at least as long a log as its own in the current term. Note that it is not enough to simply have a longer log across all terms, as this could include useless uncommitted entries from older terms. * To understand the definition of "up to date" log better, lets consider the following questions: What if the node with the longest log (across all terms, not just the latest) is elected as leader? Does the node with the longest log always have the most up to date log? Not necessarily. Suppose node N5 as leader adds a large number of entries (r, s, t) to its log and crashes before committing any of them. When it is down, N1 becomes the leader and commits a small number of new entries (x, y) in a new term. Now N1 crashes and its election time. N5 has come up again just at election time. The logs look as follows. N1: 1 2 3 4 p q x y N2: 1 2 3 4 p q x y N3: 1 2 3 4 p q x y N4: 1 2 3 4 p q x y N5: 1 2 3 4 p q r s t Clearly N5 has the longest log, but it doesn't have all the committed entries from the latest term (x, y). So one of N2,N3,N4 should become the next leader so that committed entries are not rolled back. Because of the condition of "up to date" log (last log entry of N5 is from an older term), no one will vote for N5 and it can't become the leader. * Now, with the leader restriction, entries committed in previous terms will always persist. What about entries from old terms that were not replicated at a majority (and hence not committed)? The new leader does not explicitly worry about entries from old terms. The new leader adds entries to its log and replicates them. As part of this replication and ensuring sync of all logs, entries from previous terms may also be propagated to follower's logs and committed automatically. However, a new leader does not do anything explicitly about old uncommitted entries, as the extent of their replication and whether they will eventually be committed or not is not certain. For example, consider the following scenario, with N1 as leader. N1: 1 2 3 4 5 N2: 1 2 3 4 5 N3: 1 2 3 4 N4: 1 2 3 4 N5: 1 2 3 4 N1 committed entries 1--4, but entry 5 was replicated only at a minority before N1 crashed. Now, there are two possibilities in an election: either N2 becomes a candidate or one of N3/N4/N5 becomes candidate. If N2 becomes candidate and wins the election, it may add a new entry p to the log, and while propagating that entry, the old entry 5 will also be replicated (recall: leader ensures all followers logs are consistent with its own). However, N2 is not always guaranteed to be the leader. If N5 becomes a candidate before N2, it can still win the election (N2 has a more up to date log than N5, so N2 wont vote for N5, but N3 and N4 will). Now, N5 as leader will start adding log entries, and will overwrite the entry 5 at N2. That is, the fate of old uncommitted entries from old terms depends on who gets elected as leader, and they may get overwritten. * Consensus protocols like Raft can be used to build a variety of distributed systems. For example, how does one build a distributed key-value store over Raft? All put requests from clients are stored in the Raft log and propagated to all nodes. Once an entry has been stored at a majority of nodes and committed by Raft, then the replicas execute it in its key-value store. A get request can be served by the leader from its local state machine. However, what if the leader has been replaced but does not know it yet? So, to reply to a get request, a leader needs to make sure it is still the leader (say, by trying to push the get entry into logs of followers).