The distributed systems I advocate are shared-nothing systems: collections of machines connected by a network. The network is the sole communication channel; I assume each machine has its own disk, and one machine can’t access another’s resources except by making requests over the network. This shared-nothing architecture is dominant for internet services because it’s cost-effective using commoditized cloud infrastructure and achieves high reliability through redundancy. However, this reliance on the network introduces unreliability. The internet is an asynchronous packet network. A node can send a message1 to another in this model, but the network provides no guarantees about its delivery. Many things can go wrong when you send a request and expect a response. Your request packet may be lost in transit2. Your request may be waiting in a queue and will be delivered later3. The remote node may have crashed, been powered down, or become otherwise unavailable. The remote node may have paused4 but will resume responding later. The remote node may have processed your request, but the response packet was lost on its way back. The remote node may have processed the request, but the response has been delayed in transit.
Persistent network unreliability is so fundamental that it tops Peter Deutsch’s list of “Eight Fallacies of Distributed Computing.” This is unsurprising, as distributed programs’ reliance on shared network channels defines them. Distributed computing revolves around the (im)possibilities of computation under varying network conditions.
For instance, the FLP impossibility result demonstrates the inability to guarantee consensus in an asynchronous network5 with one faulty process. Basic operations such as modifying the set of machines in a cluster6 aren’t guaranteed to complete in the event of network asynchrony and individual server failures. The implications aren’t simply academic: these impossibility results have motivated a proliferation of systems. Under a more reliable network that guarantees timely message delivery, FLP no longer holds. These impossibility results have spurred system development. By making stronger guarantees about network behavior, we can circumvent the programmability implications of these impossibility proofs.
The degree of reliability circumscribes the kinds of operations that systems can reliably perform without waiting. Some claim that networks are reliable and that we’re too concerned with designing for theoretical failure modes. This would make extensive failure-mode design unnecessary. Others surmise differently - James Hamilton of AWS summarizes, “Network partitions should be rare but net gear continues to cause more issues than it should.” So who’s right?
A key challenge is the privation of evidence. We have few normalized bases for comparing application reliability and even less data. We can track packet loss, but the end-to-end effect on applications is far more elusive. The evidence is difficult to generalize: it’s deployment-specific and tied to particular vendors and topologies. Organizations share specifics about their network’s behavior. Finally, distributed systems resist failure, which means that noticeable outages depend on complex interactions of failure modes. Applications silently degrade when the network fails, and the resulting problems may not be understood for some time.
Much of what we believe about real-world distributed system failures is founded on guesswork. Sysadmins swap stories over beer7, but detailed network availability surveys are few and far between. I bring a few of these stories together. This is a step toward an open discussion of real-world partition behavior. Anecdotal evidence shows that network problems are surprisingly common, even in controlled environments like a company-operated data center. One study in a midsize data center found ~12 network faults per month, of half disconnected a single machine and the other half disconnected an entire rack. Another study measured failure rates of load balancers. It found that adding redundant netgear doesn’t always palliate faults as much as hoped against human error8, a major cause of outages.
Public clouds services are notorious for transient network glitches, affecting a particular availability zone. Failures occur between specific subsecions of the data center, revealing planes of cleavage in the underlying hardware topology. A problem during a software upgrade for a switch could trigger a network topology reconfiguration. Cloudflare deployed a new firewall rule in response to a DDoS (distributed denial-of-service) attack. Juniper routers encountered the rule and then proceeded to consume all their RAM until they crashed. Recovery was arduous. Calling on-site engineers to reboot routers takes time. Even though some data centers came back online intially, they fell back over again because all traffic in Cloudflare’s entire network hit them. Recovery was complete after an hour of unavailability. WAN (wide area network) are also common. Researchers analyzed five years of operation in the CENIC (Corporation for Education Network Initiatives in California) WAN. They discovered more than 500 isolating network partitions. Median duration ranged from 2.7 minutes for software-related failures to 32 minutes for hardware failures and a 95th percentile of 3.7 days for hardware-related failures. There have been several global Internet outages related to BGP misconfiguration. In 2008, Pakistan Telecom, responding to an edict to block YouTube.com, incorrectly advertised its block to other providers, which hijacked traffic from the site and rendered it unreachable. Similar incidents occurred in 1997, 2005, 2006, and 2010. Sharks bite undersea fiber-optic cables. Google wrapped its trans-Pacific cables in Kevlar to guard against shark bites.
If the error handling of network fault isn’t defined and tested, arbitrarily bad things could happen: the cluster could become permanently unable to serve requests, or it could delete all of your data.
This was one of the worst outages in the history of GitHub, and it’s not at all acceptable to us. I’m very sorry that it happened and our entire team is working hard to prevent similar problems in the future.
– Mark Imbriaco, “Downtime Last Saturday”
The network may fail - there’s no way around it. Handling these faults doesn’t always mean tolerating: if your network is normally reliable, a valid approach is to simply show an error message to users. However, you need to know how your software reacts and ensure that the system can recover. It makes sense to deliberately test the system’s response (this is the idea behind Chaos Monkey).