Hey I just met you The network's laggy But here's my data So store it maybe
— Kyle Kingsbury, Carly Rae Jepsen and the Perils of Network Partitions (2013)
A recurring theme is how systems handle things going wrong - for example, replica failover, replication lag, and concurrency control. Working with distributed systems is fundamentally different from writing software on a single computer. We’ll now turn our pessimism to the max. Experienced operators assume that anything that can go wrong will go wrong. Dealing with failures explodes the possible state space that needs testing and validation. It doesn’t make sense to expect a single-node database to be fast if its disk system gets really slow. Semi-failures are common. The core argument that gets used for these systems is that if a node has a probability $P$ of failure and if the software can replicate data $N$ times to survive $N - 1$ failures, then the probability of losing a particular piece of data is $P^N$. So for any desired reliability $R$ and any single-node failure probability $P$, you can pick some replication $N$ so that
\[P^N < R\]which is hogwash. The reasoning assumes failures are independent. Failures are correlated. Identical hardware runs on the same power supply. The same bug exists on all the machines. One client sends corrupt requests; it’s able to sequentially bring down all the servers. When you push the bad config, it’s game over no matter how many machines you push it to. There’s really only one thing to talk about with respect to reliabiity: continuous hours of successful production operations. I call this empirical reliability1.
A common term that applies to distributed-system architecture is service-oriented architecture (SOA). While SOA may conjure memories of WS-standards and the nebulous Common Object Request Broker Architecture, the core idea is solid: loosely coupled services, each serving a small function and working independently from one another. For each of the services identified, you must consider the following items:
To illustrate these principles, consider a service that resizes images. Although it seems trivial, operating this at scale disabuses engineers of any optimism; the inherent complexity of distributed systems is ineluctable.
100K users; 10K requests per second (RPS) per region. the 99
Assume an average of 256KB per image, so the system can process 10 conversions per second per CPU core. We need 1K cores to support 10K RPS. We need ~32 servers using 32-core machines. Adding 20% headroom for surges, we require a fleet of $\geq 40$ dedicated processing systems. This scale confirms that a simple single-server monolith is impossible; we need a distributed stack.
Figure 1 breaks the system down into a few coarse services:
graph TB
subgraph TOP_ROW [ ]
direction LR
style TOP_ROW fill:none,stroke:none
subgraph QL ["Queue Layer<br/>Consistency Priority"]
direction TB
Q1[Queue Node 1<br/>Leader]
Q2[Queue Node 2<br/>Follower]
Q3[Queue Node 3<br/>Follower]
PAXOS[Paxos consensus<br/>Auto failover]
end
subgraph API_TIER ["Customer-facing API Tier<br/>99.99% availability"]
direction TB
LB[Load balancer<br/>10K RPS total]
API1[API server 1<br/>5K RPS]
API2[API server 2<br/>5K RPS]
API3[API server 3+<br/>Buffer capacity]
end
end
subgraph MID_ROW [ ]
direction LR
style MID_ROW fill:none,stroke:none
subgraph AUTH ["Identity & Auth<br/>Availability Priority"]
direction TB
EDGE[Edge cache<br/>Identity data]
MGMT[Management plane<br/>Customer management - rare]
DATA[Data plane<br/>Auth - every request]
end
subgraph USAGE ["Usage Tracking<br/>Read-light"]
direction TB
AGG1[Local aggregator 1]
AGG2[Local aggregator 2]
AGG3[Local aggregator 3]
BATCH[Batch store<br/>Eventually consistent]
end
end
subgraph PROC ["Processing & Storage"]
direction TB
MR[MapReduce<br/>Batch processing]
BILL[Billing aggregation]
COLD[Cold storage<br/>Compliance records]
end
LB -->|Distribute| API1
LB -->|Distribute| API2
LB -->|Distribute| API3
API1 -->|Buffer for scale| Q1
API2 -->|discrepancy| Q2
API3 --> Q3
PAXOS -.->|Leader election| Q1
PAXOS -.->|Failover| Q2
PAXOS -.->|Failover| Q3
API1 -->|Every request| EDGE
API2 -->|Every request| EDGE
API3 -->|Every request| EDGE
EDGE -.->|Cached from| DATA
MGMT -.->|Rare updates| DATA
API1 -->|Local logs| AGG1
API2 -->|Local logs| AGG2
API3 -->|Local logs| AGG3
AGG1 -->|Batched| BATCH
AGG2 -->|Batched| BATCH
AGG3 -->|Batched| BATCH
BATCH -->|Process| MR
MR -->|Aggregate| BILL
MR -->|Archive| COLD
TOP_ROW ~~~ MID_ROW
MID_ROW ~~~ PROC
style API1 fill:#90EE90
style API2 fill:#90EE90
style API3 fill:#90EE90
style Q1 fill:#FFB6C1
style EDGE fill:#87CEEB
style BATCH fill:#DDA0DD
style PAXOS fill:#FFD700
Figure 1
Distributed systems are the default approach to scaling, but no one benefits from irrational exuberance. It takes about a decade for a file system to mature. Good software can’t run reliably with bad operations. Check invariants10. If you have a NoSQL vendor, ask for the observed real-life distribution of MTTF and MTTR.
as opposed to theoretical reliability ↩
requires precise CPU/size tracking ↩
the front door ↩
plus headroom ↩
FIFO ordering ↩
creating accounts ↩
a new user might have to wait a few seconds before his credentials work ↩
a Dynamo-style key-value store ↩
tape or low-cost disk ↩
“Are all messages sent actually received?” ↩