# Distributed agreement in practice ## Distributed commit We will discuss the concept of Distributed Commit, specifically focusing on **atomic commitment**. In a DS, the challenge arises when we want to commit operations to databases that are partitioned. Atomic commit refers to a transaction which refers to the **atomicity** [ACID](../../../BSc(italian)/Basi%20di%20Dati/src/05.Transazione.md###ACID) property: the transaction is either completely successful or completely unsuccessful, no intermediate state is possible. | Consensus | Atomic commit | |:------------------ |:----------------- | | One or more nodes propose a value | Every node votes to commit or abort | | Nodes agree on one of the proposed value | Commit if and only if all nodes vote to commit, abort otherwise | | Tolerates failures, as long as a majority of nodes is available | Any crash leads to an abort | - Termination: - If there are no faults, all processes eventually decide (weak) - All non-faulty processes eventually decide (strong) There are two different commit protocols: - Two-phase commit (**2PC**): sacrifices liveness (blocking protocol) - Three-phase commit(**3PC**): more robust, but more expensive so not widely used in practice General result (FLP theorem): you cannot have both liveness and safety in presence of network partitions in an asynchronous system. ### 2PC 2PC is a blocking protocol which satisfies the **weak** termination condition, allowing to reach an agreement in less than $f+1$ rounds. ![](images/dbc7dbb8deee89697465e5230a53267b.png) #### 2PC failure scenarios **1. Participant Failure:** - **Before Voting:** - If a participant fails before casting its vote, the coordinator can assume an **abort message** after a timeout - **After Voting (Before Decision):** - If a participant fails after voting to commit, the coordinator cannot proceed to a global commit without confirmation from all participants. **2. Coordinator Failure:** - **Before Vote Request:** - Participants waiting for a vote request (in `INIT` state) can safely abort if the coordinator fails. - **After Vote Request (Before Decision):** - Participants in the `READY` state, having voted but awaiting a global decision, cannot unilaterally decide. They must wait for the coordinator's recovery or seek the decision from other participants. **3. Consensus Inability:** - **Coordinator fail before decision make an indeterminate State:** - If the coordinator fails before sending a commit or abort message and any participant is in the `READY` state, that participant cannot safely exit the protocol. - Nothing can be decided until the coordinator recovers! - 2PC is vulnerable to a single-node failure (the coordinator) - Participants in the `READY` state are left in **limbo**, unable to commit or abort, demonstrating the inability of 2PC to reach consensus in the event of certain failures. ### 3PC 3PC is designed to overcome the blocking nature of 2PC by introducing an additional phase, which allows the protocol to avoid uncertainty even in the case of a coordinator failure. - **Phase 1 (prepare):** - The coordinator asks participants if they can commit. Participants respond with their agreement or disagreement. - **Phase 2 (prepare commit):** - If all participants agree, the coordinator sends a pre-commit message. Participants then prepare to commit but do not commit yet. - **Phase 3 (global commit):** - Finally, the coordinator sends a global-commit message to finalize the transaction. ![](images/429deefa738e7e7ec05e07166bb5b1b2.png) 3PC reduces the risk of blocking by ensuring that participants can make a safe decision even if the coordinator fails, provided they have reached the pre-commit phase. 3PC is a non-blocking protocol, which satisfies the strong termination condition but **may** require a **large number of rounds** to terminate. The good thing is that with no failures, only 3 rounds are required. Also **timeouts** are crucial in 3PC to ensure progress in the presence of failures. Participants will proceed to the next phase or abort based on timeouts. #### 3PC failure scenarios - **Coordinator Failure:** - If the coordinator fails before sending the pre-commit message, participants who have not received this message can safely abort. - If the coordinator fails after sending the pre-commit message but before the do-commit message, participants enter the 'uncertain' phase but will decide to commit once a timeout occurs, as they know that all participants were ready to commit. - **Participant Failure:** - Participant failures are handled similarly to 2PC, but with the additional pre-commit phase providing a buffer that can prevent the system from entering a blocked state. **Termination Protocols in 3PC:** - **Coordinator Termination:** - If the coordinator recovers from a failure, it performs a termination protocol to determine the state of the participants and decide on the next step. - **Participant Termination:** - Participants also have a termination protocol to follow if they recover from a failure, which involves communicating with other participants to determine the global state. ### CAP theory Any distributed system where nodes share some (replicated) shared data can have at most two of these three desirable properties - **C**: consistency equivalent to have a single up-to-date copy of the data - **A**: high availability of the data for updates (liveness) - **P**: tolerance to network partitions In presence of network partitions, one cannot have perfect availability and consistency. ![](images/33837838e272ee227b398e9a48f9b9f8.png){width=50%} For the *hodlers* out there we can say that the blockchain **trilemma** is a concept that was derived from CAP Theorem. ## Replicated state machine A general purpose consensus algorithm enables multiple machines to function as a unified group. These machines work on the same state and provide a continuous service, even if some of them fail. From the viewpoint of clients, this group of machines appears as a single fault-tolerant machine. ![](images/eaf5c71069e97a2ac574ed1544b46739.png) The idea is that the client connects to a leader which writes the operations onto a log. This log contains a sequence of operations that need to be propagated to the other nodes in the system through a consensus protocol. The ultimate objective is to maintain a **coherent view of the system** regardless of failures and network issues. ### Paxos Paxos, proposed in 1989 and published in 1998, has been the reference algorithm for consensus for about 30 years. However, it has a few problems. - allows agreement on a single decision, not on a sequence of requests. This issue is solved by multi-Paxos. - Paxos is difficult to understand, making it challenging to use in practice. - No reference implementation of Paxos: here is often a lack of agreement on the details of its implementation. ### Raft Raft is a consensus algorithm designed for managing a **replicated log**. It's used to ensure that multiple servers agree on shared state even in the face of failures. - **Server States:** - Servers can be in one of three states: leader, follower, or candidate. - The leader handles all client interactions and log them - The log is then replicated and passive followers respond to the leader's requests. - Candidates are used to elect a new leader if the current leader fails. - **Leader Election:** - In the event of a leader crash, a leader election process is initiated - Raft uses randomized timeouts to prevent multiple parallel elections from occurring simultaneously - If a follower receives no communication from the leader within a certain timeframe, it becomes a candidate and starts a new election. - **Log Replication:** - Followers append entries to their logs only if they match the leader's log up to the newest entry. ![](images/fa7b109317483697b1913e52886bfb5a.png) ![](images/78af81eae8a22f99479649ccd6625a5f.png) #### Log matching consistency **Log matching consistency** is a property that guarantees the consistency of log entries across different servers. To achieve this Raft divides time into **terms** of arbitrary length: - Terms are numbered with consecutive integers - Each server maintains a current term value - Exchanged in every communication - Terms identify obsolete information **This property ensures that if log entries on different servers have the same `<index,term>`, they will store the same command.** Furthermore Raft implements **leader completeness**: once a log entry is committed, all future leaders must store that entry. **Servers with incomplete logs cannot be elected as leaders**. Meanwhile, in terms of communication with clients, it's guaranteed that clients always interact with the leader: this because when a client starts, it connects to a random server, which communicates the leader for the current term to the client. ## Blockchains Blockchains can be seen as **replicated state machines**, where the state is stored in the replicated ledger, which acts as a **log** and keeps records of all operations or transactions. Also blockchains can be modeled inside a "byzantine environment": misbehaving user (byzantine failures) attempting to **double spend** (or actions inconsistent with the state) their money by creating inconsistent copies of the log. Overall, the choice of approach depends on the trade-offs between search expressivity, performance, and network fragility.