What is CAP theorem ?

What is CAP theorem ?

Distributed System design has made a long run since 1960s and it influences the way we create software these days.

·

3 min read

Let's shout out to Eric Brewer for his states about distributed system.

Introduction

Based on Wikipedia, CAP stands for

Consistency: every read receives the most recent write or an error.

Availability: every request receives a (non-error) response, without the guarantee that it contains the most recent write.

Partition tolerance: the system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.

Short brief

In the other way,

  • if a database is consistency, it ensures that data is the same through out the cluster, which means you can retrieve data from any nodes in a cluster.
  • if a database is availability, it ensures that the cluster still be able to process incoming requests although there is/are node(s) were down.
  • if a database is partition tolerance, it ensures that the cluster still be able to process incoming request although there is a break in communication between nodes.

Deep dive

image.png

Moreover, CAP theorem also stated that, a distributed system can only achieve 2 over 3 above criteria, which means there CA, CP, AP systems but there is not CAP sytem (a database that satisfies all 3 criteria)

CA: consistency - availability system

  • A system that ensure both consistency and high availability but make it partition tolerance.
  • You can think of a PostgreSQL cluster, each node contain a full version of data. So that it can achieve consistency across cluster (full data in each node) and high availability (cluster still can serve request eventhough there are node(s) down)

CP: consistency - partition tolerance system

  • A system that ensure both consistency and partition tolerance but cannot make it high availability.
  • A NoSQL database is a good choice for this type of distributed system. You can think of MongoDB, a NoSQL, schema-less database. MongoDB can be scaled to multiple nodes, but there is only one node considered primary node, each node contains a collection of partitions (this point highlight partition tolerance attribute) and only primary node can recevie write request (a Single Point Of Failure, not high availability) and forward that request across cluster to update new changes to every partition. In case of failure primary node (due to disconnection), a new node will be voted to be primary node (which takes a while before a secondary node become a primary node).

AP: availability - partition tolerance system

  • A system that ensure both high availability and partition tolerance but cannot make it consistency.
  • Apache Cassandra should be the first thing came to your mind. This database does scale to multiple instances (high availability) and it does divide its records into partitions (partition tolerance) but it follow masterless architecture, which means there is no such terms like primary node or secondary node. So that every node can receive write request, which leads to there are chances that you do not retrieve the same data for every request you made.

Final thoughts

Every scheme has their pros and cons which gonna does it best on a specific situtation. It would be great if you fully understand the needs of your business to take fully advantage of your system.