Let’s start with a summary of what eventual consistency is and what its limitations are. Eventually consistent systems in the style of Amazon Dynamo are designed to be fast and highly available during failures, even when parts of the system fail. We’re guaranteed that all copies of data will converge to the same state at some future point. However, this puts the burden of handling temporary inconsistencies on the application. Here are some examples of guarantees that eventual consistency can’t provide:
We’re given these limitations of eventual consistency. How can we design a system that does enforce uniqueness constraints and maintain high availability?
To characterize coordination avoidance, we need a system model. We’re aiming for a system with these key properties:
We’re trying to build a system where data is replicated across many servers, transactions can happen on any server without talking to others, and yet we can guarantee that our data remains consistent and that all servers eventually agree.
Our database as a collection of data items1. Each has multiple versions. Application clients submit requests to the database in the form of transactions, or groups of operations2 that should be executed together. Each transaction operates on a logical replica, or set of versions of the items mentioned in the transaction. Transactions operate over “snapshots” of database state. Upon commit3, the replica state is merged into the set of versions on at least one server. We assume this merge operator is commutative, associative, and idempotent. For example, if server
$R_x = \{v\}$
and
$R_y = \{w\}$
then
$R_x \sqcup R_y = \{v, w\}$
To determine whether a database state is valid according to application correctness criteria, we use invariants. You need usernames to be unique. Other kinds of constraints are very similar: An account balance never goes negative, a meeting room doesn’t have overlapping bookings. A transaction can commit, or abort4 if committing the transaction would violate a declared invariant over the replica state of its set of transactions $T$. A transaction can only abort if it explicitly chooses to abort itself or if committing would violate invariants over the transaction’s replica state. A system is convergent iff, in the absence of new writes, the servers eventually contain the same version for any item they both store. We apply the merge operator to produce a convergent state. A system provides coordination-free execution for $T$ iff the progress of executing each $t \in T$ is only dependent on the versions of the items $t$ reads5. That is, in a coordination-free execution, each transaction’s progress towards commit/abort is independent of other operations6 being performed on behalf of other transactions.
If $\mathcal I$-confluence holds, there exists a correct, coordination-free execution strategy for the transactions. Two servers have independently made changes that are individually correct7. Can we always merge those changes and still have a correct state? A set of transactions $T$ is $\mathcal I$-confluent with respect to invariant $\mathcal I$ if, for all $\mathcal I-T$-reachable states $D_i, D_j$ with a common ancestor state, $D_i \sqcup D_j$ is $\mathcal I$-valid. Under $\mathcal I$-confluence, the states produced by these sequences1 must be valid under merge.
Figure -1
graph TD
subgraph PRECONDITION
Ds("Ds<br><font color='blue'>Initial state: empty meeting room schedule</font>") --> Di1("Di1")
Ds --> Dj1("Dj1")
Di1 --> Din("Din")
Dj1 --> Djm("Djm")
Din --> Di1
Djm --> Dj1
style Ds fill:#f9f,stroke:#333,stroke-width:2px
style Di1 fill:#ccf,stroke:#333,stroke-width:2px
style Dj1 fill:#ccf,stroke:#333,stroke-width:2px
style Din fill:#f9f,stroke:#333,stroke-width:2px
style Djm fill:#f9f,stroke:#333,stroke-width:2px
Ds -->|"I(Ds) = True"| Di1
Ds --> |"I(Ds) = True"| Dj1
Di1 --> |"I(Di1) = True"| Din
Dj1 --> |"I(Dj1) = True"| Djm
Di1 -.-> |"-ti2<br><font color='blue'>T1: Alice books Room A, 10:00-11:00</font>"| Din
Dj1 -.-> |"-tj2<br><font color='blue'>T2: Bob books Room A, 11:00-12:00</font>"| Djm
Din -.-> |"-tin"| Di1
Djm -.-> |"-tjm"| Dj1
Ds -.-> |"-ti1"| Di1
Ds -.-> |"-tj1"| Dj1
note_local[/"<font color='green'>Transactions complete locally</font>"/]
Di1 & Dj1 --> note_local
end
subgraph IMPLICATION
Din_Djm("Din ⊔ Djm<br><font color='green'>Merged state: both bookings are present, no conflict</font>")
Din_Djm --> |"I(Din ⊔ Djm) = True"| Din
Din_Djm --> |"I(Din ⊔ Djm) = True"| Djm
style Din_Djm fill:#ccf,stroke:#333,stroke-width:2px
Din -.-> Din_Djm
Djm -.-> Din_Djm
end
PRECONDITION -.-> |"valid divergence from initial state"| IMPLICATION
IMPLICATION -.-> |"merge must be valid"| PRECONDITION
Theorem 1: A globally $\mathcal I$-valid system can execute a set of transactions $T$ with coordination-freedom, transactional availability, convergence if and only if $T$ is $\mathcal I$-confluent with respect to $\mathcal I$.
The theorem establishes $\mathcal I$-confluence as necessary and sufficient for coordination-free execution. If $\mathcal I$-confluence holds, there exists a correct, coordination-free execution strategy for the transactions; if not, no possible implementation can guarantee these properties for the provided invariants and transactions.
$\rightarrow$ If $\mathcal I$-confluence holds, each server can independently check if a transaction violates the invariant based on its local replica. There exists a coordination-free execution strategy for the transactions.
$\leftarrow$ If we have coordination-freedom, $\mathcal I$-confluence must hold. The forwards direction uses a partitioning argument to derive a contradiction. $\bot$ To prevent invalid states, at least one of the transaction sequences will have to forgo coordination-freedom.
Writes are performed in the same, well-defined order8. The merge procedures9 are deterministic so that servers resolve the same conflicts in the same manner. The Bayou system uses a primary commit scheme. One server designated as the primary takes responsibility for committing updates. The primary is responsible for deciding the final order of committed operations. Truncating the logs guarantees that they can catch up with the latest state. As a multi-tenant database, Manhattan needs to provide high quality of service to each customer without overwhelming the log.
Example 1: Handling a money transfer
sequenceDiagram
participant Client
participant RequestLog as Request Log (partition)
participant DebitStream as Debit Instruction Stream (partition)
participant CreditStream as Credit Instruction stream (partition)
Client->>RequestLog: Submit request (request_id, from, to, amount)
Note over RequestLog: Validate & persist request
RequestLog-->>Client: Request acknowledged (request_id)
RequestLog->>DebitStream: Debit instruction (request_id, from_account, amount)
Note over DebitStream: Process debit
DebitStream--xRequestLog: Debit result (success/failure)
RequestLog->>CreditStream: Credit instruction (request_id, to_account, amount)
Note over CreditStream: Process credit
CreditStream--xRequestLog: Credit result (success/failure)
Note over RequestLog: Aggregate results, update request status
RequestLog-->>Client: Final request status (success/failure/partial)
Apply every request exactly once to both the payer and payee accounts. We can consider more complex invariants, such as foreign key constraints. Insertions are $\mathcal I$-confluent, while deletions are more challenging10.