08.Consistency and Replication

# Replication and Consistency **Replication** is useful for: - more **performance** by allowing workload sharing and **reducing latency** for individual requests. It can also enhance **availability** by replicating data closer to users. - more **fault tolerance** through **redundancy** However, one of the main challenges in replication is ensuring **consistency** across replicas: changes made to one replica need to be propagated to all others, which can result in conflicts. The objective is to maintain consistency while **minimizing communication overhead**. Ideally you want the illusion of a single copy, but actually it's impossible and we have to rely on a **consistency model** is a **contract** between the processes and the data store. Consistency models can be divided based on the promises made by the contract/protocol: - **guarantees on content**: maximum difference between versions of different replicas - **guarantees on staleness**: timing constraints over propagation to all replicas - **guarantees on the order of the updates**: constraints of possible behaviors in the case of conflicts **Consistency protocols** are responsible for implementing **consistency models**. These protocols are designed with various strategies in mind to handle different assumptions: - **Single Leader Protocols**: - One replica is designated as the leader. - Clients send write requests to the leader - followers update synchronously, asynchronously or semi-synchronously: - sync if write operation completes only after the leader has received a confirmation from all followers. Safer but can lead to high latency. - async when the leader store the ops and all the follower updates happen asynchronously. - Hybrid solution is to consider an operation completed after confirmation from at least $k$ replicas. - Single leader protocols are widely adopted in distributed databases like PostgreSQL, MySQL, etc. - No write-write conflicts; read-write conflicts still possible. **Multi Leader Protocols:** - Writes are carried out at different replicas concurrently. - No single entity decides the order of writes, leading to potential write-write conflicts. - Often adopted in **geo-replicated** settings. Contacting a leader that is not physically co-located can introduce prohibitive costs - Multi leader protocols are more complex but can handle conflicts in many scenarios. - Remember that client always contact a single node! **Leaderless Protocols:** - Client contat multiple replicas to perform write/reads - Clients contact multiple replicas for writes/reads. - **Quorum-based** protocols are used to avoid conflicts, similar to a voting system where we need the majority of replicas to complete a write - Leaderless replication is used in modern key-value/columnar stores like Amazon Dynamo, Riak, Cassandra. ## Data-centric consistency models It is quite difficult to have a precise definition of **consistency** in the context of data-centric models. Here we consider as a contract dictating the guarantees on content, staleness, and update order. | Consistency | Description | | :---: | :---: | | Strict | Any read on data item $x$ returns the value of the most recent write on $x$ | | Linearizable | All processes must see all shared accesses in the same order. Operations behave as if they took place at some point in (wall-clock) time. | | Sequential | All processes see all shared accesses in the same order. Accesses are not ordered in time. | | Causal | All processes see causally-related shared accesses in the same order. | | FIFO | All processes see writes from each other in the order they were used. Writes from different processes may not always be seen in that order. | | Eventual | Updates are guaranteed to eventually propagate to all replicas, assuming no new updates are made to the given data item. | From the CAP theorem, it is not possible to simultaneously achieve both consistency and availability. Consistency refers to all replicas in a system having the same data at any given time, while availability refers to a system's ability to continue operating despite failures. Strong consistency models such as linearizability offer strong consistency but may have higher latency. On the other hand, weaker models like eventual consistency are less costly. ### Strict consistency > "Any read on data item $x$ returns the value of the most recent write on $xquot; All writes are instantly visible and the global order is maintained. However, determining the "most recent" write is only possible within a single processor machine and in a DS (without a global time) is ambiguous. ### Linear consistency (linearizability) **Linearizability** is a strong form of consistency, it ensures that all operations appear to occur instantaneously and exactly once at some point between their invocation and their response. It's a **real-time** guarantee. > "The system is **sequentially consistent** and also if $ts_{OP_1}(x) < ts_{OP_2}(y)$ then operation $OP_1(x)$ precedes $OP_2(y)$ in the operation sequence" If one operation appears to happen before another from any global perspective, then every process in the system must agree on this order. Linearizability is useful in scenarios where the application logic requires a certain ordering between operations to be enforced, and all writes become visible (as if they were executed) at some instant in time, maintaining a global order The difference between **strict consistency** and **linearizability** is subtle and often **theoretical** because, in practice, strict consistency is generally not achievable. Linearizability is often the strongest **practical** consistency model implemented in distributed systems. ### Sequential consistency Processes can agree on a sequence of operations, regardless of the real-world clock time. This agreed-upon sequence preserves the semantics of the writes and reads. Although the DS itself may not have a real clock, we can imagine ourselves outside the system, observing the real order of operations. Let's assume that the x-axis of the schedule represents the definition of real time. > "The result is the same as if the operations by all processes were executed in some sequential order, and the operations by each process appear in this sequence in the order specified by its program" The schedule is sequential, but … At this point in time $B$ thinks $x$ is already $1, C$ thinks $x$ is still 0. If they communicate (through a different channel), they break the **illusion of a single copy**. In practice: - Use a single coordinator (single leader replication): - Sequential consistency **limits availability** since it's necessary to contact the leader (which might be further away from the client) which must propagate synchronously the update to the replicas to achieve fault-tolerance - **No tolerance for network partitions**: in case of net. part. clients are blocked or leader is blocked to contacts followers - **Distributed agreement**: the use of leaderless protocols which are quorum-based where for each operation it's necessary a **quorum** (*quò·rum* is the quotient, in numbers or percentages, of the votes cast or of the voters, required for an election or resolution to be valid) of the servers which agrees on the version number of a resource. With $N R$ ($N W$) number of replicas that the clients contact to read (write) and $N$ the number of all replicas: - $NR + NW > N$ ensures that the sets of replicas involved in read and write operations overlap. It means that at least one replica is common between read and write sets, helping **to avoid read-write conflicts**. - $N W> \frac{N}{2}$ ensures that more than half of the replicas must agree for a write operation to be committed; **to avoid write-write conflicts** In practice, the quickest way to determine if a sequence of operations is sequentially consistent is to identify a valid **interleaving** of operations that is acceptable. It's important to note that an interleaving is valid only if all processes observe the same sequence of operations. When dealing with multiple variables, you need to consider each one individually, as various cases and patterns can emerge. ### Causal consistency Causal consistency is a weaker form of consistency when compared to linearizability or sequential consistency but provides a **balance between availability and consistency**. > "Writes that are potentially causally related must be seen by all processes in the same order. Concurrent writes may be seen in any order at different machines." Causal consistency indeed weakens sequential consistency based on Lamport's notion of happened-before and it's not a total order but a **partial order**. This means that only causally related operations need to be ordered with respect to each other, while concurrent operations can appear in any order). Causal consistency is favored in DSs because it is easier to guarantee with smaller overhead and is easier to implement compared to stronger consistency models. Remember that causal order is **transitive**: it's important to know for understanding how causal relationships are established across different operations. In practice, in the exercises, for each couple of "write-write" and "read-write" in each process, you make a constraint which has to be respected by the reads of all processes. #### FIFO consistency >"Writes done by a single process are seen by all others in the order in which they were issued; writes from different processes may be seen in any order at different machines" Super simple consistency where causality across processes is dropped. It's implied by Causal consistency. In practice, in the exercises, for each couple of "write-write" in each process you make a constraint which has to be respected by the reads of all processes. #### Eventual consistency There are scenarios where simultaneous updates are unlikely and where read operations are more prevalent: examples include web caches, DNS, and geo-distributed data stores like Facebook/Instagram. In these types of systems, eventual consistency is often deemed satisfactory, as it guarantees that updates will **eventually** propagate to all replicas: - Very easy to implement - Very few conflicts in practice - Today's networks offer fast propagation of updates This is widely used in practice, in scenarios where the order of messages is not important. Eventual consistency doesn't imply any fifo/causal/sequential consistency. ## Client-centric consistency models Client centric consistency models take the pov of a single client who is reading/writing from multiple locations. | Concept | Definition | Example | |------------------|-------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------| | Monotonic Reads | Subsequent reads by a process will always see the same or newer values. | Reading a forum thread will always show the latest replies, never reverting to older posts. | | Monotonic Writes | Writes by a process are completed in order. | Comments on a blog are published in the order they're written by a user. | | Read Your Writes | A process will see its own writes in successive reads. | After posting a tweet, you'll see your tweet when you refresh the page. | | Writes Follow Reads | Writes by a process reflect the latest read value. | Replying to a message only after you've seen the most recent messages in the conversation. | ### Client-centric consistency implementations In this scenario, each operation has a unique identifier (ReplicaID + a sequence number) and there are two sets assigned to each client: - the **read-set**: write identifiers relevant for the read operations executed by the client - the **write-set**: the identifiers of the write performed by the client. These sets can be represented using **vector clocks**, which keep track of the latest read/write identifiers from each replica. ## Design Strategies ### Replica placement |Type|Description| |---|---| |**Permanent Replicas**|Statically configured, used in systems like DNS and CDNs.| |**Server-Initiated**|Created dynamically to cope with access load, moving data closer to clients.| |**Client-Initiated**|Rely on client cache, can be shared among clients for enhanced performance.| ### Update Propagation #### What to Propagate |Method|Description| |---|---| |**Notification Only**|Update performed, only notification propagated; used with invalidation protocols. Best if `#reads << #writes`.| |**Transfer Modified Data**|Transfer the modified data to all copies; efficient when `#reads >> #writes`.| |**Enable Update Operation**|Propagate information for the update operation at other copies (active replication).| #### How to Propagate |Approach|Description| |---|---| |**Push-Based**|Update propagated to all replicas, regardless of need; preserves high consistency.| |**Pull-Based**|Update fetched on demand; convenient if reads < writes, manages client caches.| |**Leases**|Used to switch between push and pull approaches.| | |State of server|Messages sent|Response time at client| |---|---|---|---| |**Push-based**|List of client replicas and caches|Update (and possibly fetch update later)|Immediate (or fetch-update time)|| |**Pull-based**|None|Poll and update|Fetch-update time| #### Propagation Strategies |Protocol|Description| |---|---| |**Leader-Based**|Synchronous, asynchronous, or semi-synchronous propagation.| |**Leaderless**|Includes read repair and anti-entropy processes.| ## Case studies ### Case study: Spanner [Spanner](https://cloud.google.com/spanner/docs/true-time-external-consistency) is a globally-distributed database developed by Google. It is specifically designed to handle very large databases, utilizing a partitioned approach where each partition is replicated. - **Design**: Spanner is designed for very large databases with many partitions, each of which is replicated. - **Techniques**: It uses standard techniques: - **single-leader replication** with **Paxos** for fault-tolerant agreement on followers and leader - **2PC** for **atomic commits** - **timestamp protocols** for concurrency control - **TrueTime**: Spanner's novelty lies in TrueTime, which uses very precise clocks (atomic clocks + GPS) to provide an API that returns an uncertainty range. The "real" time is guaranteed to be within this range, which is crucial for deciding when to commit read-write transactions. - **Transactions**: Read-write transactions use TrueTime to decide when to commit, waiting until the uncertainty range has certainly passed. This ensures transactions are ordered based on time, achieving **linearizability**. Read-only transactions also acquire a timestamp through TrueTime, allowing them to read the latest value at that time without locking, optimizing frequent read-only operations ### Case study: Calvin Calvin is designed for the same settings as Spanner. It adopts a sequencing layer to order all incoming requests, both read and write. - **Guarantees**: Calvin provides **linearizability** through a sequencing layer that orders all incoming requests, both read and write. - **Operation**: It uses a replicated log implemented using Paxos, requiring operations to be deterministic and executed in the same order everywhere. - **Advantages**: The system reduces lock contention by achieving agreement on the order of execution before acquiring locks and eliminates the need for 2PC because transactions are deterministic (they either succeed or fail in all replicas) ### Case study: VoltDB - **Developer Role**: Developers specify how to partition database tables and transactions. Specifying this allows the database to organize data efficiently based on query hints. - **Execution**: Single-partition transactions can execute sequentially on that partition without coordinating with other partitions.