[Database] ACID and CAP
Data Integrity and Consistency
In a database, data integrity refers to whether the data is correct and free of defects. Synonyms include correctness, completeness, reliability, and validation. Integrity mainly concerns whether data within a single database or table is correct. An integrity violation means incorrect (invalid) data is stored. Examples of integrity violations include storing string data in numeric fields, storing duplicates where duplicates are not allowed, and storing null values where nulls are not permitted.
To maintain integrity, validation processes should be performed when storing and modifying data so that correct data is kept. Generally, databases provide four integrity constraints to maintain integrity.
- Domain constraint: Ensures only data matching the data types supported by the database can be stored.
- Entity constraint: Prevents storing null values for data that cannot be null by using primary keys.
- Referential integrity constraint: Uses foreign keys so that referencing tables reference the primary key of the referenced table to ensure correct data.
- Unique constraint: Uses primary or unique keys to prevent duplicate storage of data.
In a database, data consistency refers to whether the data agrees with each other. A synonym is coherence. Consistency mainly concerns whether relationships between multiple databases or multiple tables are logically non-contradictory. A consistency violation means data does not match across locations. Examples of consistency violations include cases where the same data is distributed across different tables and only one table has the change applied while the other remains unchanged, or where identical tables in different databases are not synchronized so one side has the latest data and the other has older data.
To maintain consistency, foreign key constraints provided by the database are used to set relationships between tables so that the same data can be stored in both tables. In distributed systems, to maintain consistency among multiple databases, more complex processes such as replication, distributed transaction handling, and compensating transactions are used to maintain eventual consistency.
Transaction ACID Properties
ACID refers to the four properties a transaction must have to guarantee safe execution of database transactions.
- Atomicity: The operations of a transaction must all succeed or all fail. If a transaction’s operations are being performed and one operation fails, all previously successful operations must be rolled back. Partial success or partial failure is not allowed.
- Consistency: After a transaction succeeds, the database must remain in a valid state. A valid state means a state that does not violate database-defined rules (data types, nullability, other constraints, etc.). If the database was in a valid state before the transaction, it must be in another valid state after the transaction completes. In other words, the database must consistently maintain a valid state before and after transaction execution. Transactions that would violate database consistency must always fail.
- Isolation: A transaction must not be aware of the progress of other transactions. One transaction’s operations should not interfere with another’s. Isolation levels affect database consistency and performance. Higher isolation increases consistency but lowers performance.
- Durability: The results of a successfully completed transaction must be preserved permanently and not lost even in the event of system failures.
CAP Theorem
The CAP theorem states that a distributed system that stores data across multiple nodes is guaranteed to provide at most two of the following three properties at any given time. The CAP theorem concerns distributed systems where the database is spread across a network, so it includes the concept of network partition. The CAP theorem is a theory related to databases, and the distributed system here refers to a distributed data store that partitions or replicates data across multiple database nodes. In this case, different nodes can communicate with each other over a network.
A network partition means communication between different nodes in a distributed system is interrupted. Causes of network partitions can include hardware or software failures of nodes, network environment issues, and so on. The CAP theorem addresses which property to choose between consistency and availability under the assumption that network partitions can occur.
The three properties discussed in the CAP theorem are:
- Consistency: Every read receives the most recent write (the latest committed write) or fails. A highly consistent distributed system has all nodes holding the same latest data at a given point in time. Conversely, a low-consistency system may have some nodes with older data and some with the latest data. If a network partition occurs in a system maintaining consistency, consistency may be broken.
- Availability: Every request received by a non-failing node must result in a response. Availability means the ability to always respond to requests. The node’s response does not have to be the latest data. If a network partition occurs in a system maintaining availability, availability may be broken.
- Partition tolerance: The system continues to operate and handle requests even if network partitions occur. The distributed system as a whole must continue to function despite intermittent, temporary, or permanent failures of some nodes.
Network partitions are inevitable in distributed systems, so system design must accommodate them while meeting requirements. According to the CAP theorem, a distributed system is guaranteed to satisfy at most two of the three properties at any given time. If no network partition occurs, a system can meet both consistency and availability, but when a network partition occurs, one must choose between consistency and availability.
In distributed systems, when a write occurs on one node, the data synchronization process to other nodes takes place. The method of synchronization (synchronous or asynchronous) affects levels of consistency and availability.
In systems where consistency is more important than availability, if a network partition prevents guaranteeing that data is up to date, requests should be treated as errors or timed out rather than returning a response successfully. Consistency is ensured by sacrificing availability. In replication setups, the distributed system behavior to maintain consistency according to operations is as follows. A replication setup distinguishes database instances into a master node and slave nodes (replicas) to separate read and write operations. The master node handles writes, and slave nodes handle reads. Usually, one master node and one or more slave nodes are used.
- Write operations: If a write request arrives at the master node and the master cannot communicate with all slave nodes due to a network partition, the write must fail. The master either immediately returns a failure to the client or blocks until synchronization completes so the client can time out.
- Read operations: If a read request arrives at one of the slave nodes and the master’s latest data has not been synchronized to that slave due to a network partition, the read must fail. The slave either immediately returns a failure to the client or blocks until synchronization completes so the client can time out.
In systems where availability is more important than consistency, requests must be processed successfully even if a network partition prevents guaranteeing that the data is up to date. Availability is ensured by sacrificing consistency. In replication setups, the system behavior to maintain availability according to operations is as follows.
- Write operations: If a write request arrives at the master node, the write must be processed successfully even if the master cannot communicate with all slave nodes due to a network partition. The master completes its own write and returns success to the client regardless of whether other slaves have synchronized the data.
- Read operations: If a read request arrives at one of the slave nodes, the read must be processed successfully even if the master’s latest data has not been synchronized to that slave due to a network partition. The slave does not block waiting for synchronization and immediately returns older data to the client.
The notion of consistency in the CAP theorem differs from the consistency in ACID. ACID consistency refers to maintaining a valid state within a single database, while CAP consistency refers to the level of synchronization among distributed nodes in a distributed system.
The notion of availability in the CAP theorem differs from the availability concept in software architecture. In software architecture, availability refers to how long the system operates correctly without problems, whereas in the CAP theorem, availability refers to whether requests are handled normally during a network partition, regardless of whether the data is the latest.
Comments