What Is the C.A.P Theorem and Why You Need It To Pick the Right NoSQL Database?
The good old “you can’t have your cake and eat it too” theorem
In the NoSQL world, there are three characteristics you need to look for on any database. Not because you need to make sure that you have all three, but rather because you need to understand that you have at least two.
That’s right, the C.A.P theorem states that out of these three very sought-after properties of a NoSQL database, you can only have two, at most. And that’s why you need to know about it — because you need to know what you’ll be sacrificing when choosing between two very similar alternatives.
So let’s quickly cover the theorem, and then some examples for each category so you know where to go from here.
The 3 Magic Properties of All NoSQL DBs
C.A.P stands for Consistency, Availability, and Partitioning.
But let’s dive a bit deeper.
Consistency is all about data consistency, or in other words, making sure that within a distributed environment, every node of the database has exactly the same information at any given time. Imagine having two nodes with purchase orders from your ecommerce site. If there is no data consistency amongst them and they’re acting as a unique cluster, the moment your client app queries the outdated node, it might show you missing transactions. And if the code is not only showing, but also making calculations based on that data, the results might be disastrous.
So consistency is definitely an important characteristic of any distributed NoSQL database. However, not all of them can provide it. So what do they do instead? They go for something called “eventual consistency.” Meaning that while at one point the cluster may not be consistent, it will eventually be so. This helps in making sure that you don’t get the types of problems I mentioned before.
Availability stands for “high availability” or in other words the ability of the database to always be available, no matter what happens. This is not the same and should not be confused with “fault tolerance” however. A highly available database is usually one that has replicas in multiple geographical zones, that way if there is a big network outage, it’ll still be accessible through one of its other replicas. For example, a system that’s only installed and working on one of our servers can’t be highly available because the moment that server fails, we’ll lose our database.
Partitioning stands for “partitioning tolerance” or in other words, having the ability to support broken links within the cluster the database is distributed in. Think about a graph representing your database cluster. You have multiple nodes sharing data and working wonderfully and suddenly there is a problem and a section of that cluster fails. If the database is “partition tolerant” it’ll still work despite the sudden lack of some of its nodes.
Now, with that in mind, the theorem basically states that you can’t have all three options in a single system. You get to pick two, meaning, you can have:
- AP systems. Highly available and partitioning tolerant databases that will always respond to your requests, no matter what happens. They do this, but they can’t ensure that if there is a partial network outage all remaining nodes will have the latest version of the data. In other words, if you query the wrong node, you’ll be using old data.
- CP systems. Consistent and partitioning tolerant databases that will care for the data no matter what. Every node will always stay consistent. But they do so at the expense of availability. Every time a partition occurs, the nodes that are left inconsistent need to be shut down, which means that eventually a full system shutdown might be required.
- AC systems. Highly available and consistent databases will always respond to all your requests from any node with the latest version of the data. Unless, of course, a partition happens in the cluster, then the whole thing goes down the drain because you can’t ensure consistency in that scenario.
Visual representation of the CAP diagram
Now, there is a caveat with the AC systems: in distributed environments partitioning is almost inevitable, so while you might want to try and have an AC system, chances are you will never be able to achieve one. This is usually the problem all SQL-based databases have, even if they can be clustered and the information can be replicated or shared across nodes, a partition will cause problems.
Let’s now take a quick look at examples of interesting AP and CP systems you might want to use.
Highly Available and Partition Tolerant systems
As mentioned before, AP systems will always (there is no 100% availability in IT, but you get the point) be there for you no matter what. The caveat with them is that the data may not always be updated everywhere.
Examples of these systems are:
- Cassandra DB. This is a distributed NoSQL database that can handle huge amounts of information. While it can be configured to be data-consistent, in doing so you’ll lose availability, thus shifting it from an AP system into a CP one. By default, its data consistency level puts it inside this category. Use it, as many other systems in this category, if you don’t mind reading stale data in some situations.
- CouchDB. This is a JSON-based database, meaning that the information is stored in JSON records. It provides nice integration with many modern web-based technologies and it even allows for JavaScript-base transformations. Couch’s consistency model is that of “eventual consistency” and they achieve it by having incremental replication, meaning they replicate document changes periodically from server to server. Eventually, the whole cluster is consistent.
- MemoryDB. Recently AWS announced the launch of a Redis-based managed data store. They also have an eventual consistency model by replicating their transactions log amongst multiple availability zones. This is a great alternative if you’re looking for a reliable key-value store.
AP systems are strong and reliable, and they’re great options as long as you can deal with some stale data from time to time.
Consistent and Partition Tolerant databases
Now let’s talk about the other side of the coin (because remember, we’re leaving out AC systems here): CP systems.
These systems are not highly available (not by default at least), which leaves out any cloud-based version of them.
Examples of CP systems are:
- HBase. An open source NoSQL database modeled after Google’s Bigtable paper. It provides real-time access to big data (we’re talking billions of rows and records, as an example). By default, HBase is considered CP because it provides a very narrow and forced consistency model. Every write operation is routed through what they call “RegionServer” and there is only one of them. If that server fails, the cluster becomes unreachable while it comes back online. That also removes the “A” from Availability. Like all other alternatives listed here, it can be configured to become highly available at the expense of data consistency of course.
- MongoDB. Probably one of the most common NoSQL databases out there, MongoDB is a document-based database much like CouchDB (which also goes for a JSON format for their records). However, unlike Couch, Mongo is considered CP by default, meaning that it prefers to stay data-consistent in a distributed environment (because all read and write operations are by default served by the primary node on a replica set) and as long as half of the nodes of the replica set are connected to each other, if this doesn’t happen then no new master can be chosen and the set goes down.
CP systems are usually what you want to have when data is of the utmost importance to you and you always have to rely on it being updated. Of course, this leaves the door open to catastrophic network problems that completely deny access to your database. That being said, you have to consider the following: given your current infrastructure, how likely is such a scenario possible and if so, what’s the cost of coming up with alternatives to solve it quickly?
If you liked what you’ve read so far, consider subscribing to my FREE newsletter “The rambling of an old developer” and get regular advice about the IT industry directly in your inbox
Conclusion
There is a lot of discussion around CAP capabilities and distributed NoSQL databases. In fact, some developers even go as far as to say that we should stop talking about CAP given how limited its scope is (reading between lines you can see that it only cares about what happens if there is a partition in the cluster, but it doesn’t say anything about latency, server problems, and other real-world problems.
So take this with a grain of salt and not as the ultimate bible that some people consider it to be. However, I do believe that understanding what the CAP theorem represents and how each property interacts with the other is crucial to understand how big-data distributed architectures work. So there is that.
What about you? What do you think about this topic? Are you a follower of the CAP? Leave your thoughts in the comments, and let’s discuss!