Introduction:

In a database with multiple nodes, inconsistencies occur because write requests arrive at different times on different nodes. This issue persists regardless of the replication method used (single-leader, multi-leader, or leaderless replication). Eventually consistent databases do not guarantee when replicas will converge to the same value, only that they will “eventually” do so.

Linearizability, a strong consistency model, addresses this by making the system appear as if there is only one copy of the data. This ensures that all operations are atomic and every client sees the same view of the data, eliminating concerns about replication lag (the time it takes for changes made on one replica to be reflected on other replicas is effectively hidden from the client). The exact definition of linearizability is subtle, but it ensures that applications do not need to worry about multiple replicas.

Linearizable storage: All writes will be in order. When we read from a linearizable storage our reads never go back in time! (most recent and not stale value should be read!

In other words, linearizability is a recency guarantee.

Example (Violation of Linearizability):

Alice and Bob check the 2014 FIFA World Cup final score on their phones.
Alice refreshes first, sees the final score, and tells Bob.
Bob refreshes after Alice, but his phone shows the game is still ongoing due to lagging database replica.
Bob expects his result to be at least as recent as Alice’s since he refreshed after her.
Bob getting a stale result violates linearizability, where all updates should be immediately visible to all clients.

When linearizability is useful?

1. Locking and Leader election:

Ensures only one node can acquire a lock and become the leader in distributed systems.
Single-leader replication requires a linearizable lock to prevent split-brain scenarios.
No matter how this lock is implemented, it must be linearizable: all nodes must agree which node owns the lock; otherwise it is useless.
Used by coordination services like Apache ZooKeeper and etcd to implement fault-tolerant leader election.

2. Constraints and Uniqueness Guarantees:

Enforces uniqueness constraints such as unique usernames or email addresses in databases.
Prevents concurrent transactions from creating duplicate records or overbooking resources.
Example: Ensuring a bank account balance never goes negative or preventing double booking of a seat.

3. Cross-channel Timing Dependencies:

Avoids race conditions between different communication channels in a system.
Example: If the storage service is not linearizable, the message queue might deliver the resize request faster than the storage service can replicate the image. This can cause the re-sizer to fetch an outdated or missing image, leading to inconsistencies between the full-size and resized images in the storage.

Implementing Linearizable Systems:

Only use single copy of data? No, won’t be fault tolerant. Will suffer from data loss on crash. The most common approach to making a system fault-tolerant is to use replication.

Single-Leader Replication (Potentially Linearizable):

Leader and Followers: The leader handles all writes and maintains the primary data copy, while followers keep backup copies.
Linearizability: Achievable if reads are made from the leader or “synchronously updated followers” and not from followers which aren’t updated yet!
Challenges:
- Correctly identifying the current leader is crucial. A node mistakenly believing it is the leader can violate linearizability because of split-brain problem.
- Asynchronous replication and failovers can lead to loss of committed writes, affecting both durability and linearizability.

Consensus Algorithms (Linearizable):

Similarity to Single-Leader Replication: Consensus algorithms share some characteristics with single-leader replication.
Features:
- Prevent split-brain scenarios and stale replicas.
- Ensure safe linearizable storage.
- Examples: ZooKeeper and etcd implement consensus algorithms to maintain linearizability.

Multi-Leader Replication (Not Linearizable):

Concurrent Writes: Multiple nodes process writes simultaneously and replicate them asynchronously.
Conflicts: Concurrent writes can lead to conflicts that require resolution.
Impact: The lack of a single copy of data generally results in non-linearizable behavior.

Leaderless Replication (Probably Not Linearizable):

Quorum Reads and Writes: Some claim strong consistency through quorum reads and writes $(w + r > n)$ , but this is not always true.
Challenges:
- Last Write Wins: Methods relying on time-of-day clocks, like in Cassandra, are prone to inconsistencies due to clock skew.
- Sloppy Quorums: Practices like sloppy quorums and hinted handoff compromise any chance of achieving linearizability.

Figure 9-6 Explanation

Initial Setup:
- Initial value of x is 0.
- A writer updates x to 1, sending the write to all three replicas (n = 3, w = 3).
- Two readers, A and B, read from a quorum of two nodes (r = 2).
Concurrent Operations:
- Reader A: Reads from a quorum of two nodes and sees the new value 1 on one of the nodes.
- Reader B: Reads from a different quorum of two nodes and gets back the old value 0 from both.
Quorum Condition: The condition $(w + r > n)$ is met, but the execution is not linearizable:
- Reader B’s request begins after Reader A’s request completes but returns the old value while Reader A returns the new value.

Note

Linearizability in Dynamo-Style Quorums

Making Quorums Linearizable:

Read Repair: A reader must perform read repair synchronously before returning results to the application.

Writer Behavior: A writer must read the latest state of a quorum of nodes before sending its writes.\

Performance Trade-Off: > - Synchronous read repair incurs a performance penalty. > - Riak does not perform synchronous read repair to avoid performance loss. > - Cassandra performs synchronous read repair but loses linearizability if there are multiple concurrent writes to the same key due to last-write-wins conflict resolution.

The Cost of Linearizability

Assumption: Network within datacenter is working and clients can reach the datacenters, but the datacenters can’t connect to each other.

Multi-Leader Database: Each datacenter can continue operating normally. Writes(from one datacenter are asynchronously replicated) $>$ Queued Up $>$ exchanged when network restores.

Single Leader Database:

No Writes and No Seariablizable Reads (clients connected to follower DBs which can’t contact Leader DB).
Application needs linearizable reads and writes?
- Unavailability: The application becomes unavailable in datacenters that cannot contact the leader in another datacenter. This is because linearizable reads and writes require access to the leader to ensure the most recent write is read.
- Non-Linearizable Reads: Reads performed by clients connected to follower databases will not be linearizable. These reads might return stale data since the followers may not have the most recent updates from the leader.

CAP Theorem:

This issue is not just a consequence of single-leader and multi-leader replication: any linearizable database has this problem, no matter how it is implemented. The issue also isn’t specific to multi-datacenter deployments, but can occur on any unreliable network, even within one datacenter.

The trade-off is as follows:

If your application requires linearizability, and some replicas are disconnected from the other replicas due to a network problem, then some replicas cannot process requests while they are disconnected: they must either wait until the net‐ work problem is fixed, or return an error (either way, they become unavailable).
If your application does not require linearizability, then it can be written in a way that each replica can process requests independently, even if it is disconnected from other replicas (e.g., multi-leader). In this case, the application can remain available in the face of a network problem, but its behavior is not linearizable.

Warning

Network Partition: A network partition occurs when a distributed system splits into two or more groups of nodes that cannot communicate with each other due to network failures. This results in isolated segments where nodes within a segment can communicate with each other but not with nodes in other segments.

The Unhelpful CAP Theorem

CAP is sometimes presented as Consistency, Availability, Partition tolerance: pick 2 out of 3. Unfortunately, putting it this way is misleading because network partitions are a kind of fault, so they aren’t something about which you have a choice: they will happen whether you like it or not. At times when the network is working correctly, a system can provide both consistency (linearizability) and total availability. When a network fault occurs, you have to choose between either linearizability or total availability.

Thus, a better way of phrasing CAP would be in the presence of a partition, a distributed system can provide either consistency or availability, but not both. A more reliable network needs to make this choice less often, but at some point the choice is inevitable.

The CAP theorem as formally defined is of very narrow scope: it only considers one consistency model (namely linearizability) and one kind of fault (network partitions, or nodes that are alive but disconnected from each other).
It doesn’t say anything about network delays, dead nodes, or other trade-offs. Thus, although CAP has been historically influential, it has little practical value for designing systems.

Linearizability and Network Delays:

Although linearizability is a useful guarantee, surprisingly few systems are actually linearizable in practice.

Example with Multi-Core CPUs:
- Modern multi-core CPUs are not linearizable.
- One CPU core writing to a memory address and another core reading it shortly afterward may not guarantee the updated value due to caching mechanisms.
Memory Caches and Store Buffers:
- Each CPU core has its own memory cache and store buffer.
- Memory access typically goes to the cache first, with changes asynchronously written to main memory.
- Multiple copies of data exist (in cache and main memory), leading to asynchronous updates and loss of linearizability.

Why this trade-off? It makes no sense to use the CAP theorem to justify the multi-core memory consistency model: within one computer we usually assume reliable communication, and we don’t expect one CPU core to be able to continue operating normally if it is disconnected from the rest of the computer. The reason for dropping linearizability is performance, not fault tolerance. The same is true of many distributed databases that choose not to provide linearizable guarantees: they do so primarily to increase performance, not so much for fault tolerance. Linearizability is inherently slow and this is true all the time, not only during a network fault.

Question: Can’t we maybe find a more efficient implementation of linearizable storage? Solution: It seems the answer is no

Order in the Chaos: The Wacky World of Linearizability, Leaders, and Timestamps

A linearizable register behaves as if there is only a single copy of the data, and that every operation appears to take effect atomically at one point in time. This definition implies that operations are executed in some well-defined order. It turns out that there are deep connections between ordering, linearizability, and consensus!

Ordering and Causality : Why You Can’t Answer a Question Before It’s Asked (And Other Time-Travel Taboos)

Ordering helps preserve causality.

What is causality?

Causality in distributed systems refers to the principle that cause precedes effect, meaning that events should occur in a logical order where a cause (e.g., a question) must be observed before its effect (e.g., an answer). This concept is crucial in ensuring consistency across systems, particularly in scenarios involving replication, transactions, and concurrent operations.

Causality violations can lead to confusing outcomes, such as observing an answer before its corresponding question or an update to a row before the row is created. These violations can occur due to network delays or concurrent operations, where the order of events may not be preserved across different nodes or transactions.
In snapshot isolation, a consistent snapshot means one that respects causality, where the effects of operations are only visible if their causes are also present. Similarly, write skews in transactions can result from overlooked causal dependencies, leading to decisions based on outdated or incomplete information.

Causality tracking is essential for maintaining the integrity of operations in distributed systems, ensuring that the logical order of events is preserved, thereby avoiding anomalies such as read or write skews. Overall, understanding and enforcing causality is key to achieving consistency and correctness in distributed systems.

1. Partial Order:

Partial Order means that not all events are comparable with each other. Some events can be arranged in a sequence because one event causes or influences another, but other events might be independent or concurrent, meaning they don’t have a defined order relative to each other. Example: Suppose we have three events:

A: Boiling water
B: Pouring the water into a cup
C: Taking a mug from the cupboard

Here, “Boiling water” (A) must happen before “Pouring the water” (B). So, A and B are causally related and can be ordered: A → B.

However, “Taking a mug” (C) can happen independently of “Boiling water” (A) or “Pouring the water” (B). So, we can’t definitively say whether C happens before or after A or B. These events (A and C, B and C) are incomparable in time and thus are not ordered.

The result is a partial order: some events are ordered (A before B), and some are not (A and C, B and C).

2. Total Order:

Total order means that all events can be arranged in a single, linear sequence where every event is comparable to every other event. In other words, you can always say which event happened first, second, third, and so on. Example: If you had a timeline of events where every action follows another, like:

A: Turn on the stove
B: Boil water
C: Pour the water into a cup

In a total order, you can definitively say A → B → C, with no ambiguity about the sequence.

Causal (Partial) order $! =$ Total Order

Total Order vs. Partial Order in Consistency Models:
1. Linearizability: Ensures a total order of operations, where all actions appear as if they occur sequentially on a single copy of data. In a linearizable system, you can always determine the exact order of operations, with no concurrency allowed.
2. Causality: Defines a partial order of operations based on causal relationships. Some operations are ordered if they are causally related, but others may be concurrent and thus incomparable, meaning their order can’t be definitively determined.
Concurrency and Linearizability:
1. In a linearizable datastore, there is no concurrency. All operations are totally ordered along a single timeline, with no branching or merging of operations.
2. Concurrency would create branches in the timeline, where operations on different branches are incomparable.
Distributed Version Control Systems (e.g., Git):
1. Version histories in systems like Git reflect a graph of causal dependencies.
2. Typically, commits happen in a straight line (sequentially), but branches occur when multiple people work concurrently.
3. Merges represent the combination of concurrently created commits, illustrating the concept of partial ordering and causal dependencies in a version control context.

Linearizability is stronger than causal consistency!

Linearizability implies causality! Linearizability is inherently slow, but systems that offer this are easy to understand but very less performant because of which many distributed systems have abandoned linearizability. The good news is that a middle ground is possible. Linearizability is not the only way of preserving causality—there are other ways too! (NO NEED TO GO INTO DETAILs)

Causal consistency is the strongest possible consistency model that does not slow down due to network delays, and remains available in the face of network failures.

How do we order our writes?

Single Leader Replication: Replication Log
Multi-Leader Replication:
1. Version Vector
2. Lamport Clocks

Shreyansh's Everything Notebook 📖

Explorer