Distributed transactions. * Consider a distributed key-value store that partitions its keys across multiple nodes, like Dynamo. What if you wanted to modify multiple keys simultaneously? For example, consider a banking application that wishes to transfer money from account A to account B. The operations of deducting from A and adding to B must happen in an atomic transaction, that is, either both operations should happen or none should happen. If both A and B are stored in a single system, we could use something like a log-based filesystem. The application starts a transaction, and writes the changes to a log first before modifying data of A and B. Once the log has been committed to disk, the application edits both A and B. This way, even in case of a crash after doing one operation, the system can recover the log from disk and complete the transaction. * Now, what if the accounts of A and B are stored in different replicas of a distributed system? We need a "distributed transaction" to ensure that changes to both A and B happen together, or neither happens. We don't want to the case that the change happens on one node, but the other crashes before making the change. A protocol called 2 phase commit (2PC) that runs between the replicas can ensure atomicity of a distributed transaction. * 2PC overview: assume a node C coordinates the transaction between nodes A and B. Phase 1: C sends prepare messages to A and B. A and B can either reply Yes or No. Phase 2: If both A and B (in general, all nodes involved in the transaction) reply Yes, then C sends a message to A and B asking them to commit. If any one said No, C sends a message to abort. This way, a transaction commits and changes are made to the distributed data store only if all parties are willing to commit. * What about failures? What if one of A,B,C fail in between the protocol and forget what they have said? For example, what if B said Yes in phase 1 but doesn't agree to commit in phase 2? * Let us first consider coordinator failures. - If A/B are waiting for prepare message in phase 1 and coordinator crashes, A/B detect failure via timeout and decide to abort. In case C comes up in future, they can still vote No, no harm done. - If A voted No in phase 1, and it doesn't get the next phase message from coordinator, then no harm done and it is safe to abort. - If A voted Yes in phase 1, but doesn't hear from coordinator, then the node must block waiting for the coordinator to send an abort/commit message. It can neither commit nor abort the transaction. - If a coordinator fails during phase 1, the transaction will abort, so no harm done. If a coordinator fails in the middle of sending commit/abort messages in phase 2, it must remember whether it decided to commit/abort, so that it can restart and complete sending commit/abort messages to the remaining nodes. That is, coordinator must store "commit" on disk before starting to send commit messages. Otherwise, nodes that voted Yes would block waiting for coordinator's message. * Now, let us consider node failures. - If A fails in phase 1 before replying Yes/No, then C can abort transaction. C aborts if any node times out in phase 1. - If a node voted no in phase 1 and failed, no harm done. The transaction would be aborted anyways. If coordinator times out for a node that voted No, no harm done since the transaction is going to be aborted anyways. - If a node voted yes in phase 1 and failed, it must remember that it voted Yes. So, a node must write "Yes" to disk before it replies Yes in phase 1, so that it can remember that it has to commit the transaction. When it restarts, it must wait for coordinator's message or reach out to the coordinator to find out if commit/abort. - If a coordinator decides to commit, but cannot convey the commit message to any node (due to node failure), it must block and retry until it can successfully convey the commit message to all nodes. The coordinator must ensure that if any node said Yes in phase 1, it must know the decision to commit/abort. * Disadvantages of 2PC: several scenarios when nodes block. For example, if a coordinator crashes after collecting all votes, nodes will block waiting to hear a commit/abort decisions. Several improvements have been proposed to 2PC. * Note that 2PC solves a different problem from Raft. 2PC ensures that multiple nodes doing *different* things all agree to do their different operations in unison, while Raft ensures that multiple nodes doing the *same* thing do the operations in unison. In case of 2PC, each of the individual replicas may not be fault tolerant. If fault tolerance is required along with distributed transactions, each of the replicas in 2PC can run a Raft-replicated state machine for each update.