The Eight Fallacies of Distributed Computing
Explore the classic assumptions developers make about networked systems that lead to failures. Learn how to avoid these pitfalls in distributed architecture.
The Eight Fallacies of Distributed Computing
Peter Deutsch and others at Sun Microsystems formulated the Eight Fallacies of Distributed Computing in 1991. The gist: almost every assumption developers make about networked applications turns out to be wrong. More than three decades later, these fallacies still cause production outages.
The Eight Fallacies
Fallacy 1: The Network is Reliable
The first and most dangerous fallacy. Production networks fail constantly: cables get cut, switches reboot, routers misconfigure. In any large enough system, you should plan for network failures daily.
graph TD
A[Client Request] --> B[Network]
B -->|Packet Loss| C[Request Timeout]
B -->|Latency Spike| D[Request Slow]
B -->|Partition| E[Request Failed]
C --> F[Retry Logic]
D --> F
E --> G[Fallback Behavior]
F --> G
The fix: implement retries with exponential backoff, circuit breakers, and fallback behavior. When designing a distributed system, assume the network will fail and design accordingly. See the CAP theorem for what happens during network partitions.
Fallacy 2: Latency is Zero
Light takes about 67ms to cross the US at the speed of fiber. A round trip between New York and Tokyo runs 100-150ms. Developers who treat network calls like function calls soon discover their “simple” distributed system responds in seconds, not milliseconds.
// Naive approach: multiple sequential network calls
async function getUserDashboard(userId) {
const user = await fetchUser(userId); // 50ms
const posts = await fetchPosts(userId); // 80ms
const friends = await fetchFriends(userId); // 60ms
const notifications = await fetchNotifs(userId); // 40ms
// Total: 230ms sequential
// Better: parallel calls
const [user, posts, friends, notifs] = await Promise.all([
fetchUser(userId),
fetchPosts(userId),
fetchFriends(userId),
fetchNotifications(userId),
]);
// Total: ~80ms (time of slowest call)
}
Even without partitions, latency and consistency trade off against each other. See the PACELC theorem. Synchronous replication for strong consistency adds latency. Async replication is faster but introduces potential inconsistency.
Fallacy 3: Bandwidth is Infinite
Early network proponents assumed bandwidth was effectively unlimited. You hit the wall fast when replicating large datasets or streaming video across data centers.
Real bandwidth limits hit in ways that surprise teams:
- Replicating a 500GB database across regions at 100Mbps takes over 11 hours
- Streaming 4K video to 10,000 concurrent viewers needs ~300Gbps
- A single gigabit network link maxes out at about 125MB/s of actual throughput
The geo-distribution post covers how to architect when data must span regions—and the latency and bandwidth costs that come with it.
Fallacy 4: The Network is Secure
Internal networks get compromised. Services get repurposed. Attackers find ways in. Zero trust architecture exists because assuming internal networks are safe has killed systems. Every service should authenticate every request, even from within the same data center.
The mTLS post covers mutual TLS for service-to-service authentication. The secrets management post discusses how to handle credentials in a distributed environment.
Fallacy 5: Topology Doesn’t Change
Static network diagrams go stale the moment you draw them. Services get added, removed, migrated, and scaled. Containers move between hosts. VMs get provisioned and deprovisioned. Your “stable” service discovery configuration becomes stale within weeks.
graph LR
A[Service A] -->|Initially| B[Service B]
A -->|After migration| C[Service B in DC2]
A -->|With service mesh| D[Service B via mesh]
Dynamic service discovery (covered in DNS Service Discovery and Service Registry) handles topology changes. But you still need to design for instances appearing and disappearing without warning.
Fallacy 6: There is One Administrator
In a distributed system, multiple teams manage different parts of the infrastructure. The network team controls routers. The database team manages replicas. The platform team handles Kubernetes. The security team enforces firewall rules. Changes made by any one can break things for all.
Coordination overhead grows fast with system size. A simple change like adding a new firewall port requires tickets, approvals, and maintenance windows. Distributed systems require strong processes alongside strong architecture.
The service boundaries post discusses how to structure teams and services to minimize coordination overhead while maintaining proper separation.
DevOps and SRE Organizational Patterns
Fallacy 6 manifests most painfully in multi-team environments. When hundreds of services are managed by dozens of teams, coordination becomes the bottleneck. These patterns help manage the complexity.
Team Topologies for Distributed Systems
The Team Topologies book from SKF and Red Hat identifies four fundamental team patterns that map well to distributed systems:
| Team Type | Responsibility | Interaction Pattern |
|---|---|---|
| Stream-aligned | Deliver user-facing features | Self-sufficient, flow work |
| Platform | Provide internal platforms/tools | Enable stream teams |
| Enabling | Help teams overcome obstacles | Temporary, embed with teams |
| Complicated Subsystem | Own complex domain expertise | Consulted by stream teams |
In distributed systems, platform teams become critical. They own the shared infrastructure that stream-aligned teams consume: service mesh, observability stack, database platforms, networking policies.
SRE Practices That Counteract Fallacy 6
SRE (Site Reliability Engineering) was invented at Google to bridge the gap between engineering and operations in large-scale distributed systems:
Error Budgets: Rather than mandating that every change goes through a lengthy approval process, error budgets allow teams to move fast as long as the system remains reliable. If your SLA is 99.9% uptime, you have 0.1% downtime budget to spend on risk. When the budget is full, slow down. When it is healthy, move fast.
Toil Reduction: SRE defines toil as manual, repetitive, automatable work. Teams drowning in toil cannot move fast. SRE practices push organizations to automate operational work so engineers can focus on improving the system rather than running it.
SLOs and SLIs: Service Level Objectives give teams shared definitions of reliability. When everyone agrees that “reliable” means 99.95% availability with p99 latency under 200ms, cross-team discussions become easier. SLOs become the common language for reliability decisions.
Blameless Postmortems: When failures happen in complex distributed systems, the answer is almost never “one person made a mistake.” The real answer involves multiple contributing factors across team boundaries. Blameless postmortems focus on systemic causes rather than individual fault, which encourages reporting of near-misses and加快了 learning.
# Example: SRE error budget policy
# .sre/config.yaml
error_budget_policy:
availability_target: 99.95 # 4 hours downtime per year
slo_window: 30d
warning_threshold: 50% # Remaining budget
critical_threshold: 25% # Remaining budget
# Actions tied to budget consumption
when_below_50%:
- Increase testing rigor
- Require additional review for risky changes
when_below_25%:
- Freeze non-critical deployments
- Root cause analysis required
- Executive notification
when_depleted:
- Emergency retrospective
- Post-mortem mandatory
- Deployment freeze until resolved
Cross-Team Coordination Mechanisms
As distributed systems grow, you need formal mechanisms to coordinate changes across team boundaries:
Architecture Decision Records (ADRs): When a decision affects multiple teams, an ADR documents the context, decision, and consequences. This creates a searchable record of why the system is built the way it is. New team members can read ADRs to understand architectural decisions made before they joined.
Service Level Agreements (SLAs) Between Teams: Downstream teams depend on upstream services. Formalizing those dependencies as SLAs—with explicit latency, availability, and error rate commitments—makes it clear what each team owes to others.
Design Reviews for Cross-Cutting Changes: Changes to shared infrastructure, API contracts, or data schemas should go through a design review process that includes affected teams. This catches breaking changes before they happen.
Dark Launches and Feature Flags: Rather than coordinating deployments across teams, feature flags let teams deploy code independently. A dark launch exposes new functionality to a small subset of traffic while the team monitors for issues. This decouples deployment from release.
# Feature flag example for cross-team coordination
# When Team A wants to change an API that Team B depends on,
# Team A can:
# 1. Deploy the change behind a feature flag
# 2. Test with their own traffic
# 3. Gradually increase traffic while monitoring
# 4. Roll back instantly if issues arise
# No coordination with Team B required until the flag is fully enabled
def gradual_rollout(flag_name: str, initial_percentage: float = 1.0,
increment: float = 10.0, interval_seconds: float = 300):
"""
Gradually increase traffic to a new feature.
Returns when flag reaches 100% or is disabled.
"""
current_percentage = initial_percentage
while current_percentage < 100.0:
if is_flag_disabled(flag_name):
print(f"Flag {flag_name} disabled - rolling back")
return False
metrics = get_flag_metrics(flag_name)
if metrics.error_rate > 0.01: # 1% error threshold
print(f"Error rate {metrics.error_rate} exceeds threshold - rolling back")
disable_flag(flag_name)
return False
current_percentage = min(current_percentage + increment, 100.0)
set_flag_percentage(flag_name, current_percentage)
print(f"Flag {flag_name} at {current_percentage}%")
sleep(interval_seconds)
return True
Practical Guidelines for Reducing Coordination Overhead
-
Embrace self-service: Build platforms and tooling that let teams operate independently. If deploying a new service requires filing a ticket with the platform team, you have a bottleneck.
-
Make defaults safe: Design systems so the safe choice is the default. If encryption should be on by default, make it impossible to turn off. This reduces the number of decisions teams must make and mistakes they can make.
-
Invest in observability: When something breaks in a distributed system, the first question is always “which service?” Good observability—logs, metrics, traces—lets teams diagnose issues without coordinating with other teams.
-
Use service meshes for network policies: Rather than coordinating firewall rules across teams, a service mesh like Istio or Linkerd lets you express network policy as code. Teams define what their service needs to access; the mesh enforces it.
-
Create shared libraries for common patterns: When multiple teams implement the same pattern (retries, timeouts, circuit breakers), a shared library means they all benefit from improvements and bug fixes without coordination.
Fallacy 7: Transport Cost is Zero
This extends “latency is zero” but focuses on serialization and deserialization. Converting an object to JSON, sending it across the network, and converting it back takes CPU time and memory. For high-throughput systems, serialization alone can dominate processing time.
// Efficient binary protocol vs JSON
const userJson = JSON.stringify(user); // CPU: ~500μs for complex object
const userProtobuf = protobuf.encode(user); // CPU: ~50μs, 60% smaller
// Impact at scale: 10,000 requests/second
// JSON: 5 seconds CPU just for serialization
// Protobuf: 0.5 seconds CPU
Protocol choices matter. gRPC with Protocol Buffers outperforms REST with JSON for internal service communication. The API contracts post covers designing service interfaces that minimize overhead.
Fallacy 8: The Network is Homogeneous
Assuming all network paths are equal leads to subtle performance problems. A call from service A to service B in the same data center might take 1ms. The same call routed through a load balancer might take 5ms. A call that crosses a WAN might take 50ms. A call that goes through a poorly configured firewall might time out.
Network paths are never uniform. Quality of service policies, routing anomalies, and congestion all affect performance unpredictably. Load balancing and load balancing algorithms posts cover how to route traffic intelligently despite network heterogeneity.
Real-World Consequences
These fallacies compound in production. A team assumes latency is negligible, so they make synchronous calls between services. Then a network partition hits. Because latency is actually non-zero, timeouts trigger slowly. Retries overwhelm the system. The cascade spreads.
The availability patterns and resilience patterns posts cover how to build systems that survive when assumptions fail—because they always do.
Real Incident Case Studies
Understanding fallacies through real-world failures helps cement why these assumptions are dangerous.
AWS US-EAST-1 Outage (December 2021) - Fallacy 1: Network is Reliable
What happened: A US-EAST-1 availability zone experienced a power failure that cascaded. The outage lasted over 7 hours and affected thousands of services including major platforms like Amazon, Netflix, and Disney+.
Root fallacy violations:
- Network is Reliable: The single-AZ design assumed failure wouldn’t spread across availability zones properly
- Latency is Zero: Synchronous cross-service dependencies meant one failing component brought down entire systems
- One Administrator: The change management process for restoring power took hours of coordination
Key lessons:
- Assume any single component can fail catastrophically
- Design for partial availability - degrade gracefully
- Have runbooks for multi-step recovery procedures
Cloudflare DNS Outage (July 2019) - Fallacy 1 & 4: Network is Reliable, Network is Secure
What happened: A Layer 3 network routing misconfiguration caused Cloudflare’s DNS service to drop approximately 15% of global traffic. Over 400 million websites and services were unreachable.
Root fallacy violations:
- Network is Reliable: A single routing configuration error propagated globally
- Network is Homogeneous: The assumption that all network paths behave consistently proved false
Key lessons:
- BGP routing errors can propagate globally in minutes
- Have multiple DNS providers for critical services
- Monitor for unexpected routing changes
GitHub Outage (February 2018) - Fallacy 2: Latency is Zero
What happened: GitHub experienced 24 hours of degraded performance after a routine maintenance operation caused excessive load on their primary MySQL databases. The cascading failure affected over 35 million developers.
Root fallacy violations:
- Latency is Zero: Maintenance operations weren’t scheduled with latency impact analysis
- Transport Cost is Zero: The serialization of large MySQL result sets caused memory pressure
Key lessons:
- Always analyze latency impact of maintenance operations
- Have rollback procedures that account for replication lag
- Monitor database connection pool exhaustion
The Compounding Effect: How Fallacies Cascade
flowchart TD
A[Assume latency is zero] --> B[Make synchronous calls between services]
B --> C[Network partition occurs]
C --> D[Timeouts trigger slowly due to no timeout budgets]
D --> E[Retries overwhelm healthy services]
E --> F[Cascade failure spreads across system]
G[Assume network is reliable] --> H[No circuit breakers configured]
H --> C
I[Assume topology doesn't change] --> J[Hardcoded IP addresses used]
J --> K[Service migration causes connection failures]
K --> C
The compounding pattern: Violating one fallacy often amplifies violations of others. A network partition (fallacy 1) becomes catastrophic when combined with no timeout budgets (fallacy 2) and no circuit breakers (fallacy 1 again).
Designing for Reality
Use these fallacies as a checklist when designing distributed systems:
graph TD
A[Designing Distributed System] --> B{Ask These Questions}
B --> C[What happens when network calls fail?]
B --> D[What is actual latency between services?]
B --> E[Are we exceeding bandwidth budgets?]
B --> F[Do we authenticate internal calls?]
B --> G[How does topology changes affect us?]
B --> H[Who coordinates cross-team changes?]
B --> I[Is serialization cost included in latency budgets?]
B --> J[Are all network paths equally reliable?]
If you cannot answer these questions confidently, your system will fail in production. Not might—will.
When to Use / When Not to Use
| Scenario | Recommendation |
|---|---|
| Greenfield distributed system | Design explicitly for all eight fallacies |
| Adding distribution to monolith | Start with async communication |
| Multi-region deployment | Assume all fallacies apply doubled |
| Internal microservices | Treat internal traffic as untrusted |
| High-throughput systems | Profile serialization cost early |
Production Failure Scenarios
| Failure Scenario | Root Fallacy | Mitigation |
|---|---|---|
| Cascading timeout during network blip | Latency is zero, Network is reliable | Circuit breakers, bulkheads, timeouts with budgets |
| Database replication falls behind | Bandwidth is infinite | Monitor replication lag, compress streams, batch |
| Service discovery returns stale endpoints | Topology doesn’t change | Use dynamic discovery with short TTLs |
| Lateral movement after internal breach | Network is secure | Zero trust, mTLS, every call authenticated |
| Single team bottleneck on deployment | One administrator | Decentralize ownership, self-service platforms |
Common Pitfalls / Anti-Patterns
Pitfall 1: Treating Network Calls Like Function Calls
Problem: Calling a remote service as if it were a local function, without timeouts, retries, or error handling.
Solution: Every network call must have: timeout, retry logic (with backoff), fallback behavior, and proper error handling.
Pitfall 2: Assuming Homogeneous Network Conditions
Problem: Treating all environments (dev, staging, production) as having similar network characteristics.
Solution: Test against production-like network conditions. Use chaos engineering to inject latency and partition services.
Pitfall 3: Ignoring Serialization Overhead
Problem: Choosing JSON because it is readable, ignoring CPU and bandwidth costs at scale.
Solution: Profile serialization cost. For high-throughput internal services, consider Protocol Buffers, Avro, or MessagePack.
Self-Assessment Quiz: Is Your Architecture Violating These Fallacies?
Test your system design against each fallacy:
Fallacy 1: Network is Reliable
- Do you have retry logic with exponential backoff for all network calls?
- Are circuit breakers configured for all downstream services?
- Do you have fallback behavior when services are unavailable?
- Have you tested what happens when a network partition occurs?
Quiz: Fallacy 2 - Latency is Zero
- Have you measured actual latency between all service pairs?
- Are timeouts set based on measured P99 latency, not arbitrary values?
- Do you use parallel calls where possible instead of sequential chains?
- Is latency included in your SLA calculations?
Quiz: Fallacy 3 - Bandwidth is Infinite
- Have you calculated bandwidth requirements for data replication?
- Do you compress data streams between datacenters?
- Are large data transfers batched or streamed rather than bulk-loaded?
- Do you monitor bandwidth utilization and alert on limits?
Fallacy 4: Network is Secure
- Do you use mTLS for all internal service-to-service communication?
- Is every API endpoint authenticated, even internal ones?
- Do you rotate secrets and certificates automatically?
- Do you log and audit all cross-service communication?
Quiz: Fallacy 5 - Topology Doesn’t Change
- Do you use dynamic service discovery instead of static IPs?
- Are services designed to handle instances appearing/disappearing?
- Do you use Kubernetes or similar for dynamic load balancing?
- Are DNS TTLs short enough for rapid changes?
Quiz: Fallacy 6 - There is One Administrator
- Do you have clear ownership boundaries for each service?
- Are cross-team changes coordinated through ADRs (Architecture Decision Records)?
- Do you have self-service deployment pipelines?
- Is there a clear escalation path for incidents?
Quiz: Fallacy 7 - Transport Cost is Zero
- Have you profiled serialization costs at your expected throughput?
- Are you using efficient serialization (Protobuf, Avro) for internal communication?
- Do you stream large datasets rather than loading them all at once?
- Is serialization cost included in latency budgets?
Fallacy 8: Network is Homogeneous
- Do you use latency-aware routing (not just round-robin)?
- Have you tested behavior across different network paths?
- Do you monitor per-path latency, not just aggregate?
- Do you have quality-of-service policies for different traffic types?
Scoring
Count your unchecked boxes:
- 0-4 unchecked: Your architecture is resilient - well done!
- 5-10 unchecked: Some vulnerabilities exist - review weak areas
- 11+ unchecked: High risk - prioritize fixing these issues before production
Quick Recap
- The eight fallacies describe assumptions that seem reasonable but are always wrong in production.
- Assume the network is unreliable, latency is non-zero, and bandwidth is finite.
- Treat internal traffic as untrusted—security boundaries exist at every network hop.
- Design for topology changes. Use dynamic service discovery.
- Account for transport cost (serialization) in performance budgets.
Copy/Paste Checklist
- [ ] Every network call has timeout and retry logic
- [ ] Circuit breakers protect against cascade failures
- [ ] Fallback behavior defined for when services are unavailable
- [ ] Authentication required for all service-to-service calls
- [ ] Dynamic service discovery instead of static configuration
- [ ] Serialization cost included in latency budgets
- [ ] Network topology changes handled gracefully
- [ ] Monitoring for latency, error rates, and bandwidth utilization
For deeper exploration of failure handling, see Chaos Engineering and Circuit Breaker Pattern.
Category
Related Posts
Microservices vs Monolith: Choosing the Right Architecture
Understand the fundamental differences between monolithic and microservices architectures, their trade-offs, and how to decide which approach fits your project.
Distributed Systems Primer: Key Concepts for Modern Architecture
A practical introduction to distributed systems fundamentals. Learn about failure modes, replication strategies, consensus algorithms, and the core challenges of building distributed software.
Graceful Degradation: Systems That Bend Instead Break
Design systems that maintain core functionality when components fail through fallback strategies, degradation modes, and progressive service levels.