You know the saying: “You can have it done fast, cheap or right; choose two”. This is something no boss or customer wants to hear, but it is also very true. It is like a law of nature. You cannot talk your way around it. You cannot outsmart it.
My opinion is that for someone to fully understand this saying, she has to examine all the cases. Fortunately, there are only three cases:
Case 1: If you do it fast and cheap, prove that it will not be right.
Case 2: If you do it fast and right, prove that it will not be cheap.
Case 3: If you do it cheap and right, prove that it will not be fast.
By proving these three cases, the saying is proven in its entirety. Ok, let’s do that. (I beg your pardon?… Oh, the CAP theorem. Yes, we might get there, eventually. All right then). After this brief interruption, let’s prove all three cases of the saying.
Case 1: You do something in a manner that is fast and cheap. Fast means that you do not spend a lot of time in the design, making and testing. Thus, for the product or service to come out right, you need to have a lot of resources and employees. But you also do it in manner that is cheap, which means that you do not allocate a lot of resources and employees, because these cost. So it is impossible for the product or service to come out in a rightful manner.
Case 2: You do something in a manner that is fast and right. Since it is done fast, you have to allocate a lot of resources and employees, in order do it right. Thus, it cannot be done cheaply, because resources and employees cost. Let me give you an example. I do not like watching cars racing, but there is a lesson to be learned here. When a Formula 1 car is stopping to change tires, this is done fast and right. Have you seen how it is done? A lot of technicians with special equipment and tools rush to the car. They lift it with a special mechanism and there are dedicated technicians allocated to each and every tire. Now this could have been done more cheaply, if there was only one technician, but this would not have been fast. Or it would have been half-done.
Case 3: You do something in a manner that is cheap and right. That may mean that one employee will have to do the whole thing. And she will have to take the time to see that it is done right. So it cannot be fast.
So we have proven that the saying is correct. I love these thought experiments and proofs, don’t you?
The CAP theorem reminds me a lot about the saying. The CAP theorem sort of states that you can have a system that is Consistent, Available or Partition-tolerant, choose two. And this is difficult to understand at first. But if we examine and prove all the cases, one by one, the CAP theorem will have been reduced to an easy concept to grasp. Of course, it is obvious that we also have three cases to study here:
Case 1: If a system is consistent and available, it cannot be partition-tolerant.
Case 2: If a system is consistent and partition-tolerant, it cannot be available.
Case 3: If a system is available and partition-tolerant, it cannot be consistent.
Let’s prove these three cases by studying the simplest example there is, without any loss in generalization. Let us assume that we have two different computers on the same LAN. Each of these two computers has a local database with exactly the same schema and no data (or exactly the same data) at first. We also have a separate hardware load balancer on the LAN that makes these two computers appear as one to client applications. The client applications do not know about the two computers; they only know and contact the hardware load balancer. The hardware load balancer sends client requests to whichever of the two computers happens to be available, without any notion of history, previous requests, state etc. For example, a client application could access one server at first, then for a second request could access the other server. The hardware load balancer does not care. It will send the request to whichever server happens to be available, in a manner that will make both servers process roughly the same amount of requests, and that’s all. Each client application writes to the database and reads from the database of the server that it contacted.
So this is our system: two servers (each with its local database) and a hardware load balancer. For our system, three things are desirable: consistency, availability and partition-tolerance. Let’s define these attributes according to our system.
Consistency: Our system is considered to be consistent, when the answer that a client gets is the same, no matter which server the client contacted. That means that the local databases in both servers are synchronized (that is, contain exactly the same data). If the databases are not synchronized, the system must not answer a client’s query, because the answer will not be consistent; that means that the answer will depend on the particular server that the client request was sent to by the hardware load balancer. If the system answers a query without the databases being synchronized, the system will behave inconsistently. So, for the system to be consistent, the system has to answer queries when the databases are synchronized, and also, the system has to decline to answer queries, when the databases are not synchronized. To decline to answer queries when the databases are not synchronized is not bad for a consistency point of view. It is only bad from an availability point of view.
Availability: Our system is considered to be available, if a query gets answered, even if only one of the servers is down or the connection between the two servers is down. We do not examine the case in which both servers are down. If both servers are down, then, of course, the system will not be available, but this is not what availability means in our case. In our case, we only consider the case of at least one server being operational. You might ask why this is so. Well, when you want to augment the availability of our system, you just add more servers and more redundant connections (and more resilient load balancers). But even if nothing works and all servers are down, this does not mean that the system was not built with availability in mind. Let me give you an extreme example: If earth, or the solar system, or the universe is destroyed, our system will not work. This does not mean that our system had no availability built in. So we study availability only as the responsiveness of our system even if part of it is responsive. In our case, if only one server is up, the system should respond, otherwise we consider it to lack availability. And if one or both servers are up but they have no connectivity between them (but the client can access one or both of them), the system should respond to the client, otherwise we again consider it to lack availability. If our system declines to answer a query because it is not sure whether the two databases are synchronized, means that our system lacks availability. If our system answers a query with data that depend on the server that was used, this means that the databases were not synchronized and that our system is inconsistent, but, in this case, our system is available, nonetheless.
Partition-tolerance: This may be a new concept to some people, so it certainly needs to be defined. The way we built our system, it functions as a whole. But if, for some reason, the two servers cannot communicate with each other, the system is broken in two parts; it becomes partitioned. So, if the servers cannot communicate with each other, the system is divided in two independent, autonomous partitions (each server constitutes one these partitions). Now this partitioning can happen for a lot of reasons. The LAN connection between the servers might malfunction, there might be a network configuration or permissions problem, the software that handles the communication might exhibit a bug, etc. It does not matter what the problem is, just the result: the servers cannot communicate with each other, thus each server is unaware of the changes in the other server, so the databases cannot be synchronized. In such a case, the system can either respond and thus be inconsistent, or not respond and thus be unavailable. Also, it is important to understand that partition-tolerance is a good thing. We want (if possible) a system to be partition-tolerant, as we want it to be consistent and available. Partition-tolerance means that the system continues to operate and function correctly even when it has become partitioned, i.e. the servers cannot communicate with each other. Partition-tolerance is another phrase for “we can tolerate the servers being unable to communicate with each other; no problem”.
Now that the desired system attributes have been defined, we can analyze the three cases:
Case 1: The system is consistent and available. This means that every request is answered and it is answered with consistency. Since every request gets a consistent answer, the servers communicate with each other, in order to guarantee consistent results. Thus, the servers have their databases synchronized. Since they also answer and do not decline (the system is available), this means that a partition is not tolerated. The communication between the servers cannot be cutoff. If the servers would stop to communicate between them, then their databases would not be synchronized. In that case, the system would either have to stop responding to clients (be unavailable) or respond with unsynchronized data (be inconsistent). Thus, for the system to be both consistent and available, the servers have to be able to communicate with each other and thus, a partition of the system in two is not tolerated.
Case 2: The system is consistent and partition-tolerant. The system is consistent, which means the servers communicate with each other and keep their databases in sync, or the servers cannot communicate and the system does not respond in such an event. Since the system is partition-tolerant, we can have the servers stop communication between them. Since the servers cannot communicate with each other, they cannot synchronize their data. And since we have consistency, this means that the system will not respond to queries. The system cannot respond because the data is not synchronized and any response would be inconsistent. Since the system cannot respond to queries, the system is unavailable.
Case 3: The system is available and partition-tolerant. Since the system is partition-tolerant, the servers do not have to be able to communicate with each other, thus we can assume that their databases are not synchronized. Since the system is available, the system responds to queries. Thus, the system’s responses will be inconsistent, since each server will provide its own version of the “truth”, i.e. the data.
Since the three cases are now proven, the CAP theorem is proven in its entirety.
I wrote this blog post for three reasons. First of all, when I first learnt about the CAP theorem, it reminded me of the saying I mentioned in the beginning of the post. So I wanted to bring the saying and the CAP theorem “together” in the same blog post. Second, I wanted to explain the CAP theorem through a simple system, where I would define and explain the three desirable attributes of the system. Third, I wanted to analyze the three cases of the CAP theorem separately. I believe that the CAP theorem is hard for some people to understand, because the three attributes are not defined and explained thoroughly and the three cases are not analyzed separately.
I would also to address how the saying and the CAP theorem are used “in the wild”. From what I have seen, there are rather unfair approaches to the saying, whereas this is not the case with the CAP theorem.
Bosses and customers in general do not want to come to terms with the saying. They do not like what the saying denotes, perhaps because it is against their own selfish interests. What bosses and customers want is fast, cheap and right. All three. But this is impossible. Bosses and customers do not want to be told that something is impossible. What they really want is for something to be done fast and cheap. They also want the extra costs associated with this approach to burden the employee (if we talk about bosses) or the provider of the product or service (if we talk about customers). So this is their way of having something done fast, cheap and right. In my opinion, this way is not only unfair, it is also dumb and inconsiderate. And I am not really sure which of these three (unfair, dumb and inconsiderate) is worse.
The CAP theorem on the other hand is used in a fair manner. Actually, there is no way to misuse it. Obviously, there can be no loopholes for unfairness and inconsideration as far as the CAP theorem is concerned. So, in the “wild”, there are cases where a single database server is used and thus it contains a single version of the “truth”. But in cases where it is needed to scale out rather than up (actually, you cannot scale up indefinitely), then the CAP theorem provides a much needed guideline. A usual scenario is that of eventual consistency. Eventual consistency is all the rage, today. (You can sing the last sentence in the tune of the last sentence in Jamiroquai’s “Virtual Insanity” song: “Virtual insanity is what we’re living in”.)
What eventual consistency means is that the system is built in a way that consistency is not upheld. So, a system with eventual consistency is not consistent. Only systems in which there is immediate consistency can be described as consistent. From the CAP theorem we understand that a system with eventual consistency can be available and partition-tolerant. The term “eventual consistency” describes how the system deals with its inherent inconsistency. It is built in a way that it constantly tries to synchronize the data in all its partitions. So, all changes “eventually” reach all partitions. This is a particularly good model for a lot of applications, those that do not require perfect consistency: applications such as business intelligence, log analysis and social data analytics. Such systems (systems that do not require consistency) can scale easily, because they are available and partition-tolerant.
I would also like to express my most sincere admiration to the people who came up with the CAP theorem. They are extremely perceptive and insightful. The CAP theorem brings the matters of consistency and availability under a whole new light. Suddenly, with the CAP theorem, these matters have become crystal clear. Who knows what triggered this brilliant thinking. Let me speculate. They way I imagine it, there might have been some “wise guys” claiming their systems were both consistent and available. Although they might have been correct, they might have also thought that they had achieved perfection. So the CAP theorem discoverers set the record straight by denoting that the Achilles heel of such systems is that they are not partition-tolerant. Of course, this is pure speculation from my part. Things may have played out differently, to inspire these awesome people.
Although the saying and the CAP theorem have no conceptual correspondence with each other, in my mind they share something in common, something that I find very important. When I first found out about the saying, I thought that it would be cool if for every skill that I acquired, I would learn to do it fast-and-cheap, fast-and-right and cheap-and-right. This way I could provide the full spectrum of services when needed. I think that you truly know how to do something when you are able to do it with all three ways. The same goes for system design. I believe that you truly know how to design systems if you can design them as consistent-and-available, consistent-and-partition-tolerant and available-and-partition-tolerant, depending, of course, on the particular needs they will fulfill.
So there you have it: the CAP theorem analyzed and explained in simple terms. And I did it fast, cheap and right. All three. Oh, wait! This cannot be done! Let me go once more through my notes…