Consistency Guarantees in Distributed Systems Explained Simply

Kousik Nath
34 min readMay 16, 2021

--

Unrelated and Unusual Note: I am devastated seeing India’s helpless COVID-19 second wave situation due to severe Oxygen, medical equipment supply shortage and broken supply chain. I have been in a dilemma whether to publish any article at all since so many people are posting all around to seek help, the overall mood is a bit gloomy. However, it’s part of life and learning should never stop as long as we are healthy. So, here I am with the hope and prayer to see India coming out stronger.
Let’s pray for India together and continue our learning!

Introduction

When we discuss system design whether for a real or an imaginary system, most of us quite frequently use buzz words like strong consistency, eventual consistency etc. But do we really know what these terms exactly mean? What low level guarantees does strong consistency provide that eventual consistency does not provide? How do we reason about using such concepts in our design? There are lot of confusion in how different sources define consistency guarantees, language used is also very confusing, many articles even contradict with each other.

In this article, we’ll dive deep into the details of few important consistency guarantees in easy to understand words.

What and Why Consistency Guarantees

Imagine there is only a single thread accessing an object in your system. There is no interference from any other thread. How easy is it to define the correctness of the operations that happened on that object? Since operations are happening one by one by in a single thread, each operation completes only after the previous operation finishes. Hence, if we take inputs one by one, run the program to manipulate the object and manually verify the output, as long as the output is consistent as per the input provided, the operations are valid. But how would you define correctness in a multi threaded, multi processor environment where multiple threads / processes may access the same memory location together?

As developers, we need certain guarantees from the system regarding how different operations would be ordered, how updates would be visible and essentially what would be the impact on performance and correctness of the system in the presence of multiple threads of execution or processes. Consistency model defines that abstraction. It’s not a fancy new term in distributed systems, in fact, the concept already exists in the world of computer memory. Distributed systems like new age databases, cache, file system etc extend the same concept in their own way.

Distributed systems are all about choosing the right trade-offs, similarly, we’ll shortly see, consistency models are trade-offs between concurrency of operations vs ordering of operations or in other words, between performance and correctness of the operations.

The History

Historically, database systems and distributed systems are actually two different world. While distributed systems world concerns about highly concurrent machines interacting among each other (across geo locations), traditional databases research used to be more focused on non-distributed systems. But as the world became increasingly more online, massive amount of data started pouring into companies resulting in serious thoughts in distributed (cross geography) databases — NoSQL and then NewSQL etc. Essentially, a distributed database is a database built on the principle of distributed systems, hence it had to borrow the notion of consistency from the world of distributed systems.

Consistency is confusing, it’s an over loaded term in computing. It has different meaning in different contexts:

Consistency in Database Systems: Database consistency ensures the data base entities are in healthy state i.e; they follow all the application integrity constraints. Ex: any operation must make sure maximum data type length, unique key, primary key, foreign key constraints etc are absolutely maintained. It does not care about the operation operating on the given data, rather the relevance of the outcome of the operation is more important.

Consistency in Distributed Systems: Consistency in distributed systems means every node / replica has the same view of data at a given point in time irrespective of whichever client has updated the data. When you make a request to any node, you receive the exact same response (even if it’s an error) so that from the outside, it looks like there is a single node performing all the operations.

Note: In this article, we may use the terms distributed system, distributed database, data store or data storage interchangeably across different examples.

The Challenge with consistency in Distributed Systems

For serving massive online traffic in the modern internet, it’s very common to have an infrastructure set up with multiple replicas (or partitions). Whether it’s your Facebook activity, GMail data or Amazon’s order history — everything is replicated across datacenters, availability zones and possibly across countries to ensure data is not lost and the systems are always highly available in case one of the replica crashes. This poses a challenge — how to have consistent data across replicas? Without consistency, a mail that you have sent recently through GMail might disappear, an item deleted from Amazon cart might reappear. Or even worse, a financial transaction might be lost causing thousands of $$$ loss. While losing cart items is okay at times but losing $$$ is a big NO NO!

Hence different systems have different consistency requirements and systems should offer flexibility in how to configure consistency for different kind of operations.

Consistency Guarantees

Before we dive deep into discussing consistency guarantees, let’s look at few important definitions we are going to encounter frequently.

Program Order
A program is typically a sequence of instructions. When the program runs as a process, the instructions are run in the same order and this is called program order.

Unit of execution
An actor which operates on a memory location / object / variable is called an unit of execution in our article. A process in a computer system, a thread in a multi threaded environment or a client making a request (REST / GRpc call whatever) to an external (distributed or non-distributed) system is an unit of execution. Note that any such client request boils down to being a thread or process in the machine which processes the request since essentially a distributed system is nothing but a collection of machines capable of running multiple threads and processes concurrently or in parallel. We’ll use this term heavily in our article.

History
A sequence of operations executed ( possibly by multiple different units of execution) on an object / memory location is called history.

Perspective
There are majorly two different perspectives based on how the consistency guarantee synchronizes data:
Data-centric:
The system is aware of multiple units of execution acting on given data and it synchronizes data access operations from all of them to guarantee correct result.

Client-centric: The system only cares about data access operations from the concerned process operating on the given data and it assumes that other unit of executions either don’t exist or don’t have any significant impact.

The following shows few important consistency guarantees which we’ll discuss in this article.

Figure 1: Consistency guarantees scale

Broadly, there are two kinds of consistency: Weak Consistency and Strong Consistency.

Weak Consistency

I believe most of us are familiar with NoSQL data stores like MongoDB, Amazon Dynamo DB, Cassandra etc. If not, at least we have heard about it. These systems are usually known for built in high availability and performance. In the presence of partition and network issues, they embrace weakness in consistency to support such behaviour. As you can see in Figure 1, weaker consistency means higher availability, performance and throughput although more anomalous data.

Let’s look at few types of weak consistency guarantees below.

Eventual Consistency

Eventual consistency is the weakest and probably the most popular consistency level, thanks to the NoSQL movement and all content creators who have PhD in eventual consistency :P.

In a distributed system, replicas eventually converge to the same state. Given no write operation is in progress for a given data item, eventual consistency guarantees that all replicas start serving Read requests with the last updated value.

Figure 2: Eventual Consistency with replication. Courtesy: Google Cloud

The above example uses a replication scenario to explain eventual consistency:

  • All the replicas i.e; Node B and Node C are highly available for reads.
  • A write on Node A for X originates in Data Center 1. Since Node B
    co-exists with Node A in the same data center, the replication is fast there possibly apparently instantaneous.
  • However, it takes some time for cross data center replication i.e; from
    Node A in Data Center 1 to Node C in Data Center 2.
  • While the cross data center replication is going on, if any Read request occurs in Node C for X, it returns old X whereas the same request could have received the latest value if it had landed on Node B meaning the overall system is eventually consistent.

Properties:

  • There is absolutely no ordering guarantee on Reads and Writes. Arbitrary order applicable.
  • Any unit of execution which writes some value to an object, upon reading the same object back from another replica, the update could be invisible.
  • Reading the same data from different nodes simultaneously may return stale data.

Visualization:

Figure 3: Musical node illustration of eventual consistency. Courtesy: Microsoft

Metrics:

  • Data consistency: Lowest
  • App availability: High to Highest
  • Latency: Low
  • Throughput: Highest
  • Perspective: Client-centric

Real Life Examples:

  • The best and well known example is Internet Domain Name System (DNS) that successfully caters to billions of requests daily. DNS is a hierarchical, highly available system and it takes some time to propagate the update for a given entry across DNS servers and clients.
  • Review or ratings of products in Amazon.
  • Count of likes in Facebook.
  • Views on YouTube videos.
  • Stream of comments on Facebook live videos.
  • Ticket price shown on the front page in an airline website.
  • Fetching how many Facebook friends / WhatsApp contacts are online.

Consistent Prefix Read

Consistent Prefix Read guarantees no replica reads data out of order even though the data read is stale at a given point in time. If a data x has gone through 3 versions say A, B, and C respectively with C being the most updated version, every replica receives these updates in the same order. At some point in time t, when you query a replica for x:

  • It serves version A if it has replicated version A only till time t. Any future query won’t read x with version less than A.
  • It serves version B if it has replicated version A, B in order till time t. Any future query won’t read x with version less than B.
  • It serves version C if it has replicated version A, B, C in order till time t. Any future query won’t read x with version less than C.
  • and so on…
Figure 4: Consistent Prefix Read

Here,

  • Replica 1, 2, and 3 replicates different versions of x in the order in which it’s written to the Primary node.
  • However, there could be huge gap between any two such replications. Example: Replica 3 receives x = 1 at time t3 and x = 2 at time
    t7 > t3.
  • At time t5, if you query all these nodes for x, the Primary returns
    x = 3, Replica 1 returns x = 2, Replica 2 returns x = 3 and Replica 3 returns x = 1. Hence different nodes return different versions of the same data at the same point in time, still they guarantee Consistent Prefix Read here.

Properties

  • Dirty Read / Stale Read is possible.
  • Global ordering guarantee applies for a given piece of data across replicas. Hence any unit of execution reads the operations in the same order.
  • No bound on staleness. A replica can replicate the latest version of a data in 2 ms whereas another can do in 100 ms or 200 ms or any arbitrary time. This makes Consistent Prefix Read a weaker consistency guarantee.
  • At time t1, if a replica serves version A of a data, at time t2 > t1, it’ll either serve version A or higher if a newer version of the data gets replicated but never any lesser one.

Visualization:

Figure 5: Musical node illustration of Consistent Prefix Read. Courtesy: Microsoft

Metrics:

  • Data consistency: Low
  • App availability: High
  • Latency: Low
  • Throughput: Moderate
  • Perspective: Data-centric

Real Life Examples:

  • Sports apps which track score of soccer, cricket etc.
  • Social media timelines (sorted by recency).

Session Guarantees

Session: It’s an abstract concept to correlate multiple Read and Write operations together as a group. Example: when you login to Amazon for shopping, a session is created internally which keeps track of your activities and browsing history, cart updates for that session. Sessions can be identified with a unique id called session id. The life time of a session could be few seconds to days or more depending on the business use case. When a group of Read / Write operations are to be performed, we can bind all these operations together within a session, the client can keep on passing the corresponding session id with all requests to help correlate them. We can provide a bunch of consistency guarantees on all the operations in a session. These guarantees ensure that the unit of execution does not at least see any anomaly in Read and Write operations during the active session. Across sessions, the guarantees may not apply. Hence, these guarantees come under weak consistency.

Session guarantees are particularly useful when strong consistency is required for the concerned unit of execution in a session but not globally.

Let’s look at four important session guarantees below:

Read-Your-Own-Write (RYOW) / Read-My-Write
In a distributed system, depending on the architecture, either any replica can accept Read or Write requests or a particular designated node (called leader) accepts writes and all other nodes (called followers) accept Read requests. In reality, there is always some replication lag especially for cross datacenter traffic and it causes the follower nodes or Read serving nodes to return stale data in response to Read requests even in the same session. This is a very common case in many modern NoSQL based systems. If this happens, we say, Read-Your-Own-Write / Read-My-Write consistency is violated.

Consider the following example.

Figure 6: Read Your Own Write Violation
  • An unit of execution P1 writes x = 7 and some replica R1 handles the request.
  • Then it updates x to 5 and another replica R2 handles the request.
  • Next time, when P1 reads x, it expects to see 5, rather it observes the older value 7 probably because the replica R1 has handled the Read request and it has not yet seen the latest update handled by R2.
  • The same concept can be extended in computer memory system also by replacing replica with processors.

This is a clear violation of RYOW consistency since there is only a single unit of execution performing the Write and Read operations yet it’s unable to see its own update.

Q. How could we make the above example guarantee RYOW consistency?
A.
There could be couple of ways:

  • Make a single replica say R1 in our example handle all requests from the same unit of execution. This makes RYOW sticky in nature. In a leader-less distributed system, a specific node can be designated to serve all Read and Write requests for a particular data. A separate configuration store can maintain the mapping between the data and chosen server.
  • Or, if multiple replicas have to be used, ensure Write operation by any processor is synchronized to all other processors. In case of a leader-follower based distributed system, a Write on the leader succeeds only when it’s synchronously replicated to all the followers. Or, without using synchronous replication, replicas can be configured to load data lazily
    (on-demand) from the leader if they don’t find the required result.

Properties

  • If a Read R follows a Write W by the same unit of execution, R should see W.
  • As long as all the requests in a session are served by a single processor / replica or a set of suitable processors / replicas (sticky requests) who know about all updates on the concerned variable / object, the unit of execution should be able to see all updates performed by itself.
  • In case of a distributed system, sticky request also means — if a network partition occurs, the client must connect to the same server with whom it was talking to before the partition.
  • Other units of execution may not see (no guarantee) the update done by the current unit of execution.
  • Since RYOW is only about consistency in a session, it’s overall a weaker consistency guarantee.

Real Life Example

  • If you delete a mail from Gmail and you get the confirmation, upon refreshing the page, you should not see the mail again in RYOW consistency.

Monotonic Read
Monotonic Read guarantees — if a Read R0 observes Write W for a piece of data, further Read R1 by the same unit of execution for the same data in the same session should at least observe W or more recent value.

In case of a distributes system having multiple replicas, the client can reach out to any replica as long as they have enough updates at least till W for the concerned data.

Let’s look at the following example:

Figure 7: Monotonic Read

Here,

  • Initially all the replicas contain the same x = 5.
  • At time t1, the client P writes x = 12 and let’s assume it’s immediately replicated to R1.
  • As time t2 > t1, P makes a Read request for x to R1 and as R1 is already aware of the update, it returns the latest value 12 to P. This is Monotonic Read since from time t1, R1 always returns either 12 or any later value.
  • At time t3 > t2 > t1, replica R3 is still unaware of the update due to replication lag, hence it still holds the older value x = 5. When the same process P requests R3 for the latest x, it returns 5 which confuses P since P has already seen the latest x = 12 in the earlier call. This Read is non-monotonic read.

Properties

  • It’s not a global consistency guarantee, it’s local to a session, hence updates performed by the current unit of execution might be invisible to others.
  • If we assume, with every write, the version of an object increases, each subsequent Read should see the same or monotonically increasing version of the object.
  • Monotonic Read is also sticky in nature like RYOW. As long as the same processor / server or same set of designated processors / servers handle the requests for the same unit of execution in the same session, Monotonic Read should be guaranteed.
  • After network partition, in a distributed database, the client should connect back to the same server / same set of servers.

Real Life Example

  • You open Gmail. The mail client makes a request to fetch all the latest aggregated mails. Now you click on the top mail. The mail client makes another request to fetch the content of the mail. What if the second request lands on a server which is unaware of the latest update and fetches nothing? This is possible in a Non-Monotonic Read scenario. The second request should be routed to such a server which has seen the latest data that’s already reflecting in the first request.

Write Follows Read (WFR)
Let’s say there is a server S1 where Write W1 precedes Read R1 i.e; R1 has seen the effect of W1. R1 precedes another Write W2 and W2 depends on R1. In such a case, WFR provides 2 guarantees:

Ordering guarantee: W2 should follow all those Writes which caused R1 (all relevant Writes to R1). In our case, W1 is the only relevant Write for R1. Hence, the Write order should place W2 after W1 in S1. This guarantee applies within the session. Eventually, all other replicas should see the same order of Write operations, hence this order essentially acts as a global order.

Write propagation guarantee: If R1 occurs in server S1 at time t1, then for any server S2, if S2 has seen W2 at time t2 > t1, S2 must have seen all relevant Writes to R1 i.e; in our example, W1 prior to W2. It means all other replicas in the system should apply a Write after they have seen all the previous Writes on which it depends. This guarantee applies outside of session. Note: propagating the changes may take some time depending on implementation, WFR does not require any real time or instantaneous propagation of changes.

Consider the following example:

Figure 8: Write Follows Read
  • For simplicity purpose, let’s say the replica R1 is serving all Read and Write requests. x is already set to 5 by some earlier Write. At time t1, session 1 reads x = 5 and writes y = 10 to R1 at time t2. Session 1 is aware of the latest state of x and y.
  • Time t2 on-wards, replica R1 serves x = 5 and y = 10 to all requests. And as per the definition, this behaviour maps to ordering guarantee of WFR since y = 10 takes place at t2 after observing x = 5 at t1.
  • R1 instantaneously replicates y = 10 at time t2 to replica R2.
  • When session 2 makes a Read request to R2 at time t3, it observes older x = 0 but latest y = 10 which does not follow the propagation guarantee in WFR. Hence it does not maintain WFR consistency.
  • Due to some issue or replication lag, R1 replicates x = 5 to R2 at time t5.
  • From t5 on-wards, R2 also has latest x = 5 and y = 10. Hence any Read request served by R2 observes both the latest values and it appears as if these requests obey WFR consistency.

Properties

  • Ordering first applies within the involved session.
  • Outside of the client’s session, propagation of writes in order of their occurrence should be done to guarantee WFR. Note that, eventually, changes outside of session will be visible to all other units of execution as well due to the propagation guarantee.
  • The propagation can be lazy or non-real time thus making the guarantee weak.
  • This consistency is also called session causality.

Real Life Example

  • Consider replying to a tweet. You can only do that when the tweet is already written to the system and is visible to you. Both reading and replying could be done in the same session.

Monotonic Write (MW)

A write operation by a process on a data item x is completed before any successive write operation on x by the same process.

Similar to Write-Follows-Read, Monotonic Write can also be broken into two different guarantees:

  • Ordering Guarantee: An unit of execution should see its own successive updates on a particular variable / object in the order of their occurrence. This guarantee applies within the session.
  • Propagation Guarantee: Eventually, all other replicas should see the writes on the object in the same order. This applies outside of the session.

Consider the example below:

Figure 9: Monotonic Write
  • At time t1, x = 5 is updated in replica R1 and at time t2, this update is replicated to replica R2.
  • At time t2 itself, y = 10 is updated in R1 however it never gets replicated for some reason.
  • At time t3 > t2, the client session updates x = 12 in R2. Since R2 is now aware of the latest update on x due to earlier replication, this current update operation on x obeys MW consistency.
  • Similarly, at time t4, the client session updates y = 16 in R2. Since the latest y = 10 in R1 is not replicated to R2 yet, current update operation in R2 operates on older y = 0. Hence this operation does not obey MW consistency.

Properties

  • It’s a weak consistency guarantee given the guarantee concerns about an unit of execution within a session only.
  • If a Write W1 happens before another Write W2 in a session, still the unit of execution is unable to see W1 while executing W2, the session is said to be out of order.
  • While an unit of execution is under MW guarantee within a session, other units of execution might not see the same updates on the same object at that point in time. MW does not give any guarantee on propagation time.
  • Similarly, operations on the same object by other units of executions are not also guaranteed to show up during the current unit’s session.

Real Life Example

  • Consider editing an Wikipedia article. The system should guarantee that version n + 1 always replaces version n for updates performed by the same client, not the other way around. For other clients, these group of updates could be propagated at a later point in time in order. This can be guaranteed by Monotonic Write within a session.

Session Consistency Visualization:

Figure 10: Musical node representation of Monotonic Write. Courtesy: Microsoft

Session Consistency Metrics:

  • Data consistency: Moderate
  • App availability: High
  • Latency: Moderate
  • Throughput: Moderate
  • Perspective: Client-centric

Session Consistency Real Life Examples:

  • Shopping cart. If you add some item to cart in amazon.in, those items won’t be visible in amazon.co.uk as that’s another session.
  • Updating profile picture on social media like Facebook, Twitter. You can see your own updates but there is no guarantee others see it during initial few seconds at least.

Causal Consistency

Causal consistency enforces ordering of only related writes across units of executions.
What does related Write mean?
Say, if an unit of execution reads a variable x and depending on its value, updates another variable y, we say, Write of y happens after (causally dependent on) Read of x. Causal consistency guarantees that all units of execution observes new value of y only after observing the related value of x (dependency).

Let’s consider the following examples:

Example 1

Figure 11: Not a Causal Consistency

Here,

  • P1 writes x = 5 to the system.
  • P2 reads x = 5 and then writes y = 10, let’s assume that there is some program logic which determines the value of y based on the value of x. This makes the Write of y by P2 causally dependent on the earlier Write of x by P1 i.e; they are related. After the successful Write of y by P2, if any unit of execution wants to read both x and y, causal consistency expects them to observe x first then y even with stale value. Here an important thing is: in which order the variables are observed is more important than the actual value observed while reading.
    Q. What if an unit of execution, P5 only reads x or y, not both?
    A.
    That also qualifies for causal consistency since only a single value is read.
  • P3 observes the latest x = 5 first, then an older y = 0, then the latest
    y = 10. Fetching older y even though it’s already updated in the system creates an impression that update on x happened before update of y. Since y is causally dependent on x, this observation is completely fine and P3’s operations are causally consistent.
  • P4 first sees y = 10 and then older x = 0 instead of the latest x = 5. It gives an impression to P4 that y is written to the system before x which is actually incorrect. Hence it violates causal consistency guarantee. If P4 would have observed x = 0 before y = 10, it would have been causally consistent too, x = 0 is allowed instead of x = 5 since it might happen the processor / node executing P4 has not yet seen the other processor / node writing x = 5 which has executed P2.

Example 2

The following example is causally consistent:

Figure 12: Causal Consistency

Here, unlike the previous example, P2 does not read x first, it simply writes
y = 10 without depending on x. Essentially y is not causally related to x. Hence other units of execution can read x and y in any order and all such cases are causally consistent.

Example 3

Figure 13: Causal Consistency with interleaving operations
  • P2 reads x = 5 and writes y = 10, z = 15 depending on that. Both y and z are causally related (dependent) on x.
  • P3 reads the latest x = 5, then older y = 0, then the latest y = 15. Observing the update sequence of y, P3 knows y happened before x and this is true indeed.
  • Similarly, P4 reads the latest x = 5, then older z = 0, then the latest
    y = 15 interferes, then z = 15. Hence the x, the cause of z is visible to P4 in the expected order: x happened before z. Note that the interference of y = 15 does not make any difference since it’s the latest value only. P4 has no way to identify whether x happened before y or y happened before x. Hence, the sequence is causally consistent.
  • Note that the sequence of operations observed by different processes / servers don’t need to be same to qualify for causal consistency. P3 and P4 observe two different sequence, yet they are causally consistent.

Properties

  • Only related writes are ordered in the order of their occurrence across units of execution. Unrelated writes can be placed in any order. Hence, there is no notion of global ordering in a causally consistent system.
  • No real time constraints imposed.
  • As mentioned, order in which variables are observed is more important than the real value observed at the time of operation.
  • Different units of operation might observe different causally consistent sequences at the same time.
  • Causal order is transitive: A happens before B, B happens before C means A happens before C.

Metrics:

  • Data consistency: Moderate
  • App availability: High
  • Latency: Moderate
  • Throughput: Moderate
  • Perspective: Data-centric

Real Life Examples:

  • You post an important status on Facebook asking for some help. After sometime, you realize there is some mistake in the information provided, you go ahead and update the status. Now your online friends should get the update as soon as possible. They can receive the update at different time depending on how their feeds are formed. If eventual consistency is used, some of your friends may still see the older status with wrong info even after long time. But since the event of updating the status causes feed change of online friends, it can be considered as causal consistency.

Bounded Staleness Consistency

Bounded staleness is very near to strong consistency and an extension to consistent prefix reads guarantee but with the flexibility to configure the staleness threshold of the data. Basically users can define how much stale is stale for their use case.

User can configure the staleness threshold in couple of ways:

  • Time: The Reads on a data item can be configured to be stale (lag behind the Writes) by maximum specified time. Example: If the configured stateless time is 5 seconds and current time is t = 11:00 AM, updates on the item done before (t — x) = 10:55 AM are to be considered stale. Updates performed within the last 5 seconds window is allowed.
  • No of versions / update operations: For a given item, the reads might lag behind writes by maximum k updates or versions.
  • Out of these two conditions, whatever is smaller / reached earlier gets applicable.

Properties:

  • Like Consistent Prefix Read, global ordering guarantee is there but coupled with configurable threshold as mentioned. So, Reads are consistent beyond the threshold.
  • If you identify the threshold suitable for your use case, the performance could be better than strong consistency.
  • Far more expensive than session, consistent prefix, eventual consistency etc.

Visualization:

Figure 14: Musical node representation of Bounded Staleness Consistency. Courtesy: Microsoft

Metrics:

  • Data consistency: High
  • App availability: Low
  • Latency: High
  • Throughput: Low
  • Perspective: Data-centric

Real Life Example:

  • Stock ticker applications.
  • Weather tracking apps.
  • Mostly, any status tracking apps should be bounded in staleness.
  • May be, online gaming apps.

Strong Consistency

Conceptually, strong consistency is exact opposite to eventual consistency where all the replicas read the same value for a given data item at the same point in time. Certainly, ensuring strong consistency across data center even across multiple nodes in a single data center is expensive.

Figure 15: Strong Consistency with replication. Courtesy: Google Cloud

The above figure shows how we intuitively think of strong consistency in our mind when we talk about it. This is just a conceptual illustration, not necessarily a real life implementation.

  • When Node A receives an update for item X, it locks the item across replicas immediately to propagate the update at once.
  • While the item is locked, Read operation is blocked on it at all replicas to prevent inconsistent or dirty reads.
  • Locks are released once the item is updated across replicas.

Let’s look into different variant of Strong Consistency below:

Sequential Consistency

In a single threaded environment, a thread can invoke an object multiple times, but the current invocation happens if and only if the previous invocation completes. Hence, in such a scenario, we have a chain of Invocation (I), Response (R) messages and note carefully, each invocation I is followed by the corresponding response R. This is a sequential history and it’s easy to explain that the operations are sequentially consistent since no other thread interferes the executing thread.

I = Invocation, R = ResponseSequential History : I1 R1 I2 R2 I3 R3 ... In Rn

Q. How do we extend the concept to multi threaded environment? How do we order operations in such a case?
A.
Sequential consistency ensures that each thread individually executes operations in the program order. In the following example, thread A always invokes A1 operation, gets response, then execute A2 operation and gets response, hence operations order is: IA1 -> RA1 -> IA2 -> RA2.

Similarly, thread B executes operations in IB1 -> RB1 -> IB2 -> RB2 order.

I = Invocation, R = Responsethread A:         IA1----------RA1               IA2-----------RA2
thread B: | IB1---|---RB1 IB2----|----RB2 |
| | | | | | | |
| | | | | | | |
real-time order: IA1 IB1 RA1 RB1 IB2 IA2 RB2 RA2
-------------------------------------------> time

But there is no ordering guarantee among multiple threads, there could be interleaved operations across threads but threads individually obey program order. Hence, any of the following history is legal and valid in sequential consistency:

1. IA1 RA1 IB1 RB1 IB2 RB2 IA2 RA2
2. IB1 RB1 IA1 RA1 IB2 RB2 IA2 RA2
3. IB1 RB1 IA1 RA1 IA2 RA2 IB2 RB2
4. IA1 RA1 IB1 RB1 IA2 RA2 IB2 RB2
5. IA1 RA1 IA2 RA2 IB1 RB1 IB2 RB2
6. IB1 RB1 IB2 RB2 IA1 RA1 IA2 RA2

The above example is purely from Write perspective. What about Read operations?

Consider the following examples:

Example 1

Figure 16: Sequential Consistency

Here,

  • P1 writes x = 5, P2 writes y = 10 . They don’t see each other’s update. Hence these two operations are sequentially consistent.
  • P3 reads the latest x = 5, then older y = 0, then newer y = 10. Since, P3 has observed the update sequence of y after fetching x, it gets an impression that x is written to the system before y.
  • Now P4 straight forward observes the latest y = 10 and x = 5. Note that these are independent Read operations by P4. No update sequence for any of the variables has been observed by P4. Hence P4 has no way to identify if x is written before y or vice versa. Any sequence is fine for it.
    Note: The update sequence is the key differentiation here.
  • Thus both P3 and P4 agrees to the same sequence: x is written before y in the system. Hence the system has sequential consistency guarantee.
  • If both threads would have observed only the latest value of x and y in any order, they would have no way to identify whether x is written before y or vice versa. Hence that sequence would be sequentially consistent too.

Example 2

Figure 17: Not Sequentially Consistent
  • P3 again observes the update sequence of y from 0 to 10 after fetching x. Hence P3 thinks x is written before y in the system.
  • However, P4 this time observes older x = 0 after fetching latest y = 10. P4 thinks y is written before x in the system. If P4 fetches x again, it could possibly observe the latest x = 5.
    Note: Again the possible update sequence is the key differentiation here.
  • Thus P3 and P4 observes completely opposite sequence of operation and they are not sequentially consistent.

What are the observations from both these examples?
The update sequence. If any thread observes an older value of a variable, later observes a newer value for the same variable, that sequence defines order of operations on the variable with respect to the other variables in the system for that particular thread. For the same set of variables, as long as all the units of execution observe the sequence of operations in the same order, the operations are sequentially consistent.

Note: Sequential consistency does not enforce Read operations to happen exactly in the order of Write operations for a given set of variables. Read could be performed in any order as long as same or equivalent update sequence of concerned variables is observed by all units of execution. Essentially, it’s all client view that matters.

Consider the following example to understand it better:
Example 3

Figure 18: Sequential Consistency
  • P3 reads the latest y = 10 , then reads the older x = 0. But P3 does not know yet that x = 0 is actually a stale one. Hence P3 does not identify whether x is inserted before y in the system or vice versa. Any ordering is okay.
  • P4 observes the latest y = 10 then the latest x = 5. Both are latest values meaning P4 would not know whether x is inserted before y in the system or vice versa. Hence, P4 is fine with any order.
  • Actually, in reality, P1 writes x = 5 before P2 writes y = 10. Since both the threads P3 and P4 do not know which operation has occurred before whom , they can assume y is inserted before x. Thus they both agree on the same view meaning the Read operations are sequentially consistent.

Properties

  • Guarantees some valid, legal serial order of execution of Write operations.
  • Guaranteed Write operation execution in program order for individual unit of execution.
  • Has the freedom to shuffle Write operations ((I, R) pairs) across different unit of execution for efficiency purpose.
  • While reading, only the update sequence of variables observed matters. The Read order could be different than actual Write order as long as the same update sequence is seen across all units of execution regardless of the actual value observed for any particular variable.
  • If any unit of execution observes the latest value for a set of variables without observing any earlier value, it can’t really identify which variable is written before whom. Hence, any update sequence is for those variables is fine for it. Look at P4 in Example 1.

Linearizability

Linearizability is an extension to sequential consistency guarantee but stricter and often referred to as strong consistency. Any sequential history can be treated as a linearizable history if operation across units of execution maintain real time order i.e; writes (including unrelated writes) made by different units of execution are ordered in accordance with real time and they tend to be instantaneous (a valid response follows each request).

Consider the same example as in sequential consistency:

I = Invocation, R = Responsethread A:         IA1----------RA1               IA2-----------RA2
thread B: | IB1---|---RB1 IB2----|----RB2 |
| | | | | | | |
| | | | | | | |
real-time order: IA1 IB1 RA1 RB1 IB2 IA2 RB2 RA2
-------------------------------------------> time

Here, the following histories are linearizable:

1. IA1 RA1 IB1 RB1 IB2 RB2 IA2 RA2
2. IB1 RB1 IA1 RA1 IB2 RB2 IA2 RA2
3. IB1 RB1 IA1 RA1 IA2 RA2 IB2 RB2
4. IA1 RA1 IB1 RB1 IA2 RA2 IB2 RB2

Note that (IA1, RA1), (IB1, RB1) overlap meaning we actually don’t know their exact real time order. Linearizability acknowledges such overlaps: there is always a gap between making a request and getting response for that request — especially in distributed systems where probably the Write is replicated across geography during this time period. Hence linearizability allows such operations in any order i.e;
(IA1, RA1) can be ordered before (IB1, RB1) and vice versa. Similar logic applies for (IA2, RA2), (IB2, RB2). The group of operations [(IA1, RA1), (IB1, RB1)] fully precedes the group [(IA2, RA2), (IB2, RB2)]in terms of real time ordering. Hence any history where (IA2, RA2) or (IB2, RB2) comes before (IA1, RA1) or (IB1, RB1) is not linearizable by definition. The following history is not a valid linearizable history:

5. IA1 RA1 IA2 RA2 IB1 RB1 IB2 RB2

So, in short, if there is a variable V, a Write operation W(V = 3) happens at time t2, if you fetch the variable’s value at time t3 where t3 > t2, you must see the latest value of the variable i.e; V = 3 irrespective of whichever unit of execution actually has written the value.

...W(V = 2)..W(V = 3) ....... R(returns 3 not 2)
------t1--------t2-----------------t3-----> time
Condition: t1 < t2 < t3

The above talks about Write operations only. What are the constraints for Read operation to guarantee linearizability?
The Read operations must also follow real time order i.e; any unit of execution must see the most recent value of a variable in real time irrespective of whichever unit of execution has initiated the Write request for the concerned variable. But yes, if a Read operation by an unit of execution T1 for a variable x overlaps with a Write operation on the same variable x by another unit of execution T2, real time constraint could be discounted and older value of the variable could be fetched.

Consider the examples below:

Example 1

Figure 19: Linearizability
  • As usual P1 and P2 independently execute single operation on different variables x and y respectively, so they are linearizable.
  • P3 reads the most recent x = 5 and y = 10 in two independent operations. Note that P3 starts reading after both P1 and P2 finishes. Hence observing the latest value for x and y maintains linearizability consistency.
  • Similarly, P4 reads the latest y = 10 and x = 5 in two independent operations. P4 starts slightly after P3 meaning P4 starts after P1 and P2 finishes. Hence it observes the latest x and y.
  • Since none of P3 or P4 observes any older value, by definition, this is a happy case of linearizability.

Example 2

Figure 20: Linearizability with overlapping operations
  • While P1’s Write operation on x is going on, P3 initiates a Read operation on x. P3 fetches older x = 0. Since these operations are overlapping, no real time constraint on ordering is imposed and hence these operations are linearizable.
  • P4 initiates reading of y and x long after they have been written to by P2 and P1 respectively. Hence P4 gets the most recent values and it’s linearizable consistency.

Properties

  • All units of execution obey program order and Write operations (including unrelated writes) are ordered by real time.
  • Overlapped Write operations could be ordered in any way.
  • Read operations must maintain real time constraint too. Only in case, a Read overlaps with a Write for the same variable, Read can return stale value.
  • In a distributed system, if operation on each object is linearizable, then all operations in the system are linearizable.

Strict Consistency

Like linearizability, strict consistency is an extension to sequential consistency too but it is strictly coupled with real time ordering. We saw above that linearizability takes overlapped operations in a relaxed way, but strict consistency does not give that freedom. Overlapped operations also need to be ordered in strict real time order. This is the only difference between strict consistency and linearizability.

Let’s consider the following examples:

Example 1

Figure 21: Not Strict Consistency

Here,

  • P1 writes: x = 5, P3 reads x at the same time, both operations overlap but P3’s Read operation finishes a little while after P1’s Write operation completes. Strict consistency guarantee expects P3 to see P1’s Write i.e;
    x = 5 whereas for linearizability, reading x = 0 is okay since they overlap. Hence, the above sequence of operation is linearizable but not strictly consistent.

Example 2

Figure 22: Strict Consistency
  • P3 and P4 both initiates their Read operations after P1 and P2 completes in terms of real time ordering.
  • P3 reads the latest x = 5 first then the latest y = 10.
  • Similarly, P4 also observes the latest values of y and x respectively. Hence all these operations are strictly consistent.

Example 3

Figure 23: Not Strictly Consistent
  • Here, P3 starts after both P1 and P2 finishes completely. Strict consistency requires any unit of execution to see the latest values if they are interested in reading the variables.
  • P3 observes the latest x = 5, then an older y = 0. This older value violates strict consistency since they system already has an updated value of y written by P2 a while ago.
  • Similarly, P4 also observes older x = 0 way long after P1 wrote x = 5. Hence it’s also a violation of strict consistency.

Properties

  • Stricter than Linearizability.
  • Maintains strict real time ordering for all Write operation including overlapped Write operations performed by any client / thread.
  • Read operations across clients also observe the most recent value of variables in real time order.
  • Strong real time adherence makes it the strongest level of consistency guarantee and practically impossible to implement in real life distributed systems.

Strong Consistency Visualization:

Figure 24: Musical node representation of Strict Consistency. Courtesy: Microsoft

Strong Consistency Metrics:

  • Data consistency: Highest
  • App availability: Lowest
  • Latency: High / Very High
  • Throughput: Lowest
  • Perspective: Data-centric

Strong Consistency Real Life Examples:

  • Financial systems executing order payments flow or billing process.
  • E-Commerce Flash sale apps (inventory related apps).
  • Ticket booking flow while confirming the ticket.
  • Meeting scheduling kind of apps.

Consistency in Real Life

We are now aware of the trade-offs offered by different consistency models. Building real life data systems like data storage systems is hard and they have to address varying use cases originating from different customers and no one Read or Write consistency guarantee can fit all the scenarios. Hence they usually offer different APIs for different use cases. Generally, in a distributed storage, operations concerning small set of entities like get(), set() etc can be strongly consistent. But again, there is no hard and fast rule regarding which APIs or what kind of entities (record, collection, whole storage) come under which bucket.

However, there are couple of factors to consider while choosing a consistency model in real life:

  • How easy is the model to understand and develop for developers.
  • Can your system handle weak / eventual consistency to certain level?
  • Does adopting a different consistency level add significant storage and network cost?
  • Does changing the consistency model affect the system performance? What is the measure?

Check out the following if interested to know how different popular data storage solutions offer their consistency guarantees:

Strong Consistency in Amazon S3

As a real life implementation of strong consistency, let’s look at how Amazon S3, the most popular object (most popular for files) storage system has recently incorporated strong consistency. There is not much details available, however, I will keep it short. More details can be found here.

Figure 25: Bird’s Eye View of Amazon S3’s latest architecture

S3 was originally developed as an eventually consistent, reliable, highly available and high performing object storage system. But over the years, customers started using it in different use cases which require strong consistency guarantee. Hence in December 2020, S3 rolled out strong consistency without any visible impact on original guarantees. At least this is what Amazon and its current CTO Werner Vogels claim.

Let’s look into few important components:

Persistence: It’s the per-object metadata caching layer.

  • If an object is written, deleted or moved to another bucket, the information is tracked here.
  • Highly resilient.
  • The system consists of many nodes.
  • Eventually consistent. Hence, in the presence of network partition, Write and Read requests for the same object can be served by partitioned nodes with different view of the object.

Disk: Nodes actually storing the objects on disk.

The inconsistency issues actually came from the metadata cache layer described above due to its inherent eventual consistency support.

In order to make S3 strongly consistent yet ensuring the same level of performance and availability, Amazon implemented strongly consistent metadata cache layer by the following techniques:

  • CPU cache coherence: CPU Cache Coherence is implemented to make sure the CPU cores in a cache node are synchronized and don’t see stale data since S3 is highly concurrent. Over the top, replication logic is implemented which replicates object updates in-order from one node to another and publishes these updates through internal notification mechanism.
  • Witness: Just in case a cache node has stale data, how to identify that? Witness helps here to keep the cache consistent with the storage layer. It’s a highly available sub-system that is notified of every change per-object and maintains a little bit of state. Every time a Read is performed from the metadata cache, instead of checking with the storage layer, the cache actually checks with witness if it has the latest data for the object. In case the data is stale, the cache is updated with the latest data from disk storage before serving to the client request. Witness nodes store all data in memory, hence it supports high throughput and easy to scale up when a node dies.

This is how S3 overall achieved strong consistency without losing availability and performance guarantees.

Conclusion

If you have come this far, congratulations!!! You have conquered a tough topic in distributed systems. It’s funny to think that when we talk about strong consistency, we most probably don’t even know whether we are referring to linearizability or a little less stricter bounded staleness. As a distributed system architect, it’s very important to know which use cases would require what kind of consistency guarantees, otherwise incorrect trade-offs would become costly in near future.

It’s difficult to write articles on such topic. However, I hope all the examples, explanation and real life examples mentioned here per consistency guarantee make sense to you and help you understand them without much complexity.

Please provide multiple claps if you have liked the article and share on social media to make it reach bigger audience.

Feedback is appreciated.

Reference

  1. https://fauna.com/blog/demystifying-database-systems-introduction-to-consistency-levels
  2. https://www.cs.cornell.edu/courses/cs734/2000FA/cached%20papers/SessionGuaranteesPDIS_1.html
  3. https://stackoverflow.com/questions/9762101/what-is-linearizability
  4. http://csis.pace.edu/~marchese/CS865/Lectures/Chap7/Chapter7fin.htm
  5. https://dba.stackexchange.com/questions/279413/consistent-prefix-reads
  6. https://blog.jeremylikness.com/blog/2018-03-23_getting-behind-the-9ball-cosmosdb-consistency-levels/
  7. https://lennilobel.wordpress.com/2019/02/17/understanding-consistency-levels-in-azure-cosmos-db/
  8. https://cloud.google.com/datastore/docs/articles/balancing-strong-and-eventual-consistency-with-google-cloud-datastore
  9. https://jisajournal.springeropen.com/articles/10.1186/s13174-020-0122-y

--

--

Kousik Nath

Deep discussions on problem solving, distributed systems, computing concepts, real life systems designing. Developer @Uber. https://in.linkedin.com/in/kousikn