System Design: Netflix Architecture for Global Streaming
Deep dive into Netflix architecture. Learn about content delivery, CDN design, microservices, recommendation systems, and streaming protocols.
System Design: Netflix Architecture for Global Streaming
Netflix serves over 250 million subscribers across 190 countries, streaming billions of hours of content monthly. The technical challenges span content delivery, real-time encoding, recommendation systems, and building a resilient microservices platform.
This case study examines Netflix’s architecture and how to design a streaming platform at scale.
Requirements Analysis
Functional Requirements
Users need to:
- Browse and search a catalog of movies and TV shows
- Stream video content on various devices
- Create and manage profiles
- Continue watching across devices
- Rate and review content
Non-Functional Requirements
The platform must:
- Stream in 4K HDR with surround sound
- Start playback in under 2 seconds
- Handle 15+ million concurrent streams
- Maintain 99.99% availability
- Work on 1000+ device types
Capacity Estimation
| Metric | Value |
|---|---|
| Subscribers | 250 million |
| Peak concurrent streams | 15+ million |
| Content library | 50,000+ titles |
| Average stream bitrate | 8 Mbps |
| Peak bandwidth | 120 Tbps |
| CDN edge locations | 100+ |
Content Delivery Architecture
Netflix’s content delivery is the core of their engineering. They built their own CDN called Open Connect.
Open Connect CDN
graph TB
A[Netflix Origin] --> B[ISP Interconnects]
B --> C[Open Connect Appliances]
C --> D[Residential Routers]
D --> E[Devices]
subgraph "ISP Network"
B
end
subgraph "Customer Premise"
C
D
end
Open Connect appliances are custom-built servers deployed at ISP data centers worldwide. They cache popular content close to users.
Request Flow
sequenceDiagram
participant D as Device
participant OCA as Open Connect Appliance
participant CS as Control Plane
participant LS as License Server
D->>CS: GET /manifest.m3u8
CS->>D: Return manifest with URL to nearest OCA
D->>OCA: GET /video/segment1.ts
OCA-->>D: Video segment
D->>LS: GET /license for DRM
LS-->>D: License key
D->>OCA: GET /video/segment2.ts
OCA-->>D: Video segment
Video Encoding Pipeline
graph LR
A[Source Video] --> B[Transcode Cluster]
B --> C{HD or 4K?}
C -->|4K| D[4K Encode Farm]
C -->|HD| E[HD Encode Farm]
D --> F[Output Profiles]
E --> F
F --> G[CDN Origins]
G --> H[Edge Cache]
Netflix encodes each title in multiple resolutions and bitrates for adaptive streaming:
| Profile | Resolution | Bitrate | Codec |
|---|---|---|---|
| 4K HDR | 3840x2160 | 16 Mbps | H.265/VP9 |
| 1080p | 1920x1080 | 5 Mbps | H.264/H.265 |
| 720p | 1280x720 | 2.5 Mbps | H.264 |
| 480p | 854x480 | 1 Mbps | H.264 |
| Audio | Stereo/5.1/Atmos | 192-768 kbps | AAC |
Microservices Architecture
Netflix decomposed their monolith into hundreds of microservices.
Service Decomposition
graph TB
subgraph "Edge Layer"
GW[API Gateway]
GS[Gateway Service]
end
subgraph "Backend"
M[Metadata Service]
R[Recommendation Engine]
P[Playback Service]
S[Search Service]
A[Auth Service]
end
subgraph "Data"
EV[EVCache]
DB[(Cassandra)]
ES[Elasticsearch]
end
GW --> GS
GS --> M
GS --> R
GS --> P
GS --> S
GS --> A
M --> DB
M --> EV
R --> EV
S --> ES
Key Microservices
| Service | Responsibility | Data Store |
|---|---|---|
| API Gateway | Request routing, aggregation | None |
| Metadata | Titles, episodes, images | Cassandra |
| Playback | Streaming session management | EVCache |
| Recommendations | Personalized suggestions | Elasticsearch |
| Search | Full-text search | Elasticsearch |
| User Profile | Account, profiles, settings | Cassandra |
| Billing | Subscriptions, payments | PostgreSQL |
API Gateway
public class ApiGatewayApplication {
public static void main(String[] args) {
// Zuul routes configuration
addRequestThreadFilters();
addResponseFilters();
// Route definitions
configureRoutes(new ZuulRouteBuilder()
.route("/api/v1/metadata/**", "metadata-service")
.route("/api/v1/playback/**", "playback-service")
.route("/api/v1/recommendations/**", "recommendation-service")
.route("/api/v1/search/**", "search-service")
);
}
}
Recommendation System
Netflix’s recommendation engine drives 80% of content consumption.
Recommendation Pipeline
graph LR
A[User Events] --> B[Event Pipeline]
B --> C[Feature Store]
C --> D{Ranking Models}
D --> E[Personalized Ranking]
D --> F[Similar Titles]
D --> G[Top Picks]
E --> H[API Response]
F --> H
G --> H
Ranking Model Features
class RankingFeatures:
def __init__(self, user_id: int, title_id: int):
self.user_features = self._get_user_features(user_id)
self.title_features = self._get_title_features(title_id)
self.context_features = self._get_context_features()
def compute_features(self) -> Dict[str, float]:
return {
# User attributes
"user_age_days": self.user_features.age_days,
"user_avg_watch_time": self.user_features.avg_watch_time,
"user_rating_avg": self.user_features.avg_rating,
"user_genre_preferences": self.user_features.genre_scores,
# Title attributes
"title_popularity_score": self.title_features.popularity,
"title_recency": self.title_features.release_days_ago,
"title_rating": self.title_features.avg_rating,
"title_match_score": self._genre_match(),
# Context
"time_of_day": self.context_features.hour,
"day_of_week": self.context_features.day,
"device_type": self.context_features.device
}
Ranking Service
class RecommendationService:
def __init__(self, model: RankingModel, cache: RedisCache):
self.model = model
self.cache = cache
async def get_ranked_list(
self,
user_id: int,
row_count: int = 20,
evidence_count: int = 5
) -> List[RankedTitle]:
# Check cache
cache_key = f"recs:{user_id}:{row_count}"
cached = await self.cache.get(cache_key)
if cached:
return self._deserialize(cached)
# Get candidate titles
candidates = await self._get_candidates(user_id, 500)
# Compute features for each
ranked = []
for title in candidates:
features = self._compute_features(user_id, title)
score = await self.model.predict(features)
ranked.append((score, title))
# Sort and return top N
ranked.sort(key=lambda x: x[0], reverse=True)
results = [title for _, title in ranked[:row_count]]
# Cache for 5 minutes
await self.cache.setex(cache_key, 300, self._serialize(results))
return results
Streaming Protocol
Netflix uses adaptive bitrate streaming for optimal viewing experience.
HLS/DASH Manifest
sequenceDiagram
participant D as Device
participant CDN as CDN
participant L as License Server
Note over D: Initial playback request
D->>CDN: GET /title/1234/manifest.m3u8
CDN-->>D: M3U8 with quality levels
Note over D: Parse quality levels
D->>CDN: GET /title/1234/video_4k.m3u8
CDN-->>D: Segment list
D->>CDN: GET /title/1234/video_4k/segment1.ts
CDN-->>D: Video segment
D->>L: GET /widevine/license
L-->>D: DRM license
Note over D: Decode and display
D->>CDN: GET /title/1234/video_4k/segment2.ts
CDN-->>D: Video segment
Adaptive Bitrate Logic
class AdaptiveBitrateController:
def __init__(self, bandwidth_calculator: BandwidthCalculator):
self.bandwidth_calculator = bandwidth_calculator
self.current_quality = "auto"
self.quality_levels = ["4k", "1080p", "720p", "480p", "360p"]
def select_quality(self, buffer_level: float, throughput: float) -> str:
# Rules-based adaptation
if buffer_level < 10: # Buffer running low
return self._downgrade()
elif buffer_level > 60: # Buffer healthy
return self._upgrade(throughput)
else:
return self.current_quality
def _downgrade(self) -> str:
current_idx = self.quality_levels.index(self.current_quality)
if current_idx < len(self.quality_levels) - 1:
return self.quality_levels[current_idx + 1]
return self.current_quality
def _upgrade(self, throughput: float) -> str:
# Choose highest quality that fits throughput
for quality in self.quality_levels:
if self._required_throughput(quality) < throughput * 0.8:
return quality
return self.quality_levels[-1]
Data Storage
Cassandra for Metadata
CREATE TABLE titles (
title_id UUID PRIMARY KEY,
title_type TEXT, -- 'movie' or 'show'
title_name TEXT,
synopsis TEXT,
release_year INT,
duration_secs INT,
rating TEXT,
genres LIST<TEXT>,
-- Denormalized for query performance
genres_sorted SET<TEXT>,
created_at TIMESTAMP,
updated_at TIMESTAMP
);
CREATE TABLE episodes (
show_id UUID,
season_num INT,
episode_num INT,
episode_id UUID,
title_name TEXT,
duration_secs INT,
synopsis TEXT,
PRIMARY KEY ((show_id), season_num, episode_num)
);
CREATE TABLE title_by_genre (
genre TEXT,
release_year INT,
popularity_score DOUBLE,
title_id UUID,
PRIMARY KEY ((genre), popularity_score, release_year)
);
EVCache for Playback State
# Playback state cached in EVCache
PLAYBACK_STATE_TTL = 7200 # 2 hours
async def get_playback_state(user_id: int, title_id: int) -> PlaybackState:
key = f"playback:{user_id}:{title_id}"
cached = await evcache.get(key)
if cached:
return PlaybackState(**json.loads(cached))
# Fetch from database
state = await db.fetch_playback_state(user_id, title_id)
if state:
await evcache.setex(key, PLAYBACK_STATE_TTL, json.dumps(state))
return state
async def save_playback_position(user_id: int, title_id: int, position: int):
# Write to EVCache immediately
key = f"playback:{user_id}:{title_id}"
state = PlaybackState(user_id=user_id, title_id=title_id, position=position)
await evcache.setex(key, PLAYBACK_STATE_TTL, json.dumps(state))
# Persist to Cassandra async
asyncio.create_task(
db.save_playback_state(user_id, title_id, position)
)
Global Architecture
Multi-Region Setup
graph TB
subgraph "US-East (Primary)"
ZU[Zuul Gateway]
SVCS_E[Services]
DB_E[(Cassandra)]
end
subgraph "EU-West (Replica)"
ZU2[Zuul Gateway]
SVCS_W[Services]
DB_W[(Cassandra)]
end
subgraph "Asia-Pacific"
ZU3[Zuul Gateway]
SVCS_A[Services]
DB_A[(Cassandra)]
end
SVCS_E <--> DB_E
SVCS_W <--> DB_W
SVCS_A <--> DB_A
ZU --> SVCS_E
ZU2 --> SVCS_W
ZU3 --> SVCS_A
Traffic Routing
Netflix uses latency-based routing to direct users to the nearest region:
class LatencyRouter:
def route_request(self, user_id: int, service: str) -> str:
# Get user's last known region
user_region = self._get_user_region(user_id)
# Check if region is healthy
if self._is_region_healthy(user_region):
return f"{service}.{user_region}.netflix.com"
# Fallback to lowest latency
latencies = self._measure_all_regions(service)
return min(latencies, key=latencies.get)
def _measure_all_regions(self, service: str) -> Dict[str, float]:
return {
"us-east-1": self._ping(f"{service}.us-east-1.netflix.com"),
"eu-west-1": self._ping(f"{service}.eu-west-1.netflix.com"),
"ap-northeast-1": self._ping(f"{service}.ap-northeast-1.netflix.com")
}
Resilience Patterns
Circuit Breaker
class CircuitBreaker:
def __init__(self, failure_threshold: int = 5, timeout: int = 60):
self.failure_threshold = failure_threshold
self.timeout = timeout
self.failure_count = 0
self.last_failure_time = None
self.state = "closed"
async def call(self, func: Callable, *args, **kwargs):
if self.state == "open":
if time.time() - self.last_failure_time > self.timeout:
self.state = "half-open"
else:
raise CircuitOpenException()
try:
result = await func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise e
def _on_success(self):
self.failure_count = 0
self.state = "closed"
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "open"
Bulkheads
Isolate service dependencies to prevent cascading failures:
class BulkheadExecutor:
def __init__(self, max_concurrent: int = 100):
self.semaphore = asyncio.Semaphore(max_concurrent)
async def execute(self, func: Callable, *args, **kwargs):
async with self.semaphore:
return await func(*args, **kwargs)
# Different bulkheads for different service calls
metadata_bulkhead = BulkheadExecutor(max_concurrent=50)
recommendation_bulkhead = BulkheadExecutor(max_concurrent=20)
playback_bulkhead = BulkheadExecutor(max_concurrent=30)
API Design
Streaming Endpoints
| Endpoint | Description |
|---|---|
| GET /api/v1/browse | Get personalized rows |
| GET /api/v1/titles/{id}/metadata | Get title details |
| GET /api/v1/titles/{id}/similars | Similar titles |
| GET /api/v1/playback/session | Initialize playback |
| POST /api/v1/playback/position | Update position |
| GET /api/v1/search?q={query} | Search titles |
Response Example
{
"data": {
"id": "81239481",
"title": "Stranger Things",
"type": "show",
"poster_url": "https://cdn.netflix.com/poster.jpg",
"backdrop_url": "https://cdn.netflix.com/backdrop.jpg",
"rating": "TV-14",
"year": 2024,
"duration": "4 seasons",
"synopsis": "When a young boy vanishes...",
"genres": ["Drama", "Horror", "Sci-Fi"],
"seasons": [
{
"season_num": 1,
"episodes": [
{
"num": 1,
"title": "The Vanishing",
"duration_secs": 3600,
"thumbnail": "https://cdn.netflix.com/s1e1.jpg"
}
]
}
]
},
"meta": {
"request_id": "abc123",
"version": "2.1"
}
}
Conclusion
Netflix’s architecture demonstrates how to build a globally distributed streaming platform:
- Open Connect CDN places content at ISP facilities worldwide
- Adaptive bitrate streaming optimizes quality for each connection
- Microservices enable independent scaling and deployment
- Recommendation algorithms drive content discovery
- Multi-region deployment ensures global availability
DRM and Entitlement Systems
Digital Rights Management (DRM) protects content from unauthorized copying. Netflix uses multiple DRM schemes for different platforms.
DRM Architecture
graph TB
D[Device] -->|1. Initialize| L[License Server]
L -->|2. Device ID + Content Key Request| E[Entitlement Service]
E -->|3. Check subscription| DB[(User DB)]
DB -->|4. Entitlement OK| E
E -->|5. Grant license| L
L -->|6. Encrypted License| D
D -->|7. Decrypt with device key| K[Content Key]
K -->|8. Play| V[Video Decrypt]
Multi-DRM Support
class DRMManager:
"""Handle multiple DRM schemes per platform"""
DRM_SCHEMES = {
"widevine": "com.widevine.alpha", # Android, Chrome, many devices
"playready": "com.microsoft.playready", # Windows, Xbox
"fairplay": "com.apple.fairplay", # iOS, Safari
"clearkey": "org.w3.clearkey" # Web fallback
}
def get_supported_drm(self, device_type: str) -> List[str]:
"""Return DRM schemes supported by device"""
capabilities = {
"android": ["widevine"],
"ios": ["fairplay"],
"web_safari": ["fairplay", "clearkey"],
"web_chrome": ["widevine", "clearkey"],
"windows": ["playready", "widevine", "clearkey"],
"smarttv": ["widevine", "playready"]
}
return capabilities.get(device_type, ["widevine"])
Entitlement Checking
class EntitlementService:
"""Verify user can access specific content"""
async def check_entitlement(
self,
user_id: int,
title_id: str,
device_type: str
) -> EntitlementResult:
# Get user's subscription tier
subscription = await self.user_service.get_subscription(user_id)
# Get title's required tier
title = await self.metadata_service.get_title(title_id)
if subscription.tier < title.required_tier:
return EntitlementResult(
allowed=False,
reason="subscription_tier_too_low",
upgrade_to=title.required_tier
)
# Check concurrent stream limit
active_streams = await self.playback_service.count_active_streams(user_id)
if active_streams >= subscription.max_streams:
return EntitlementResult(
allowed=False,
reason="max_streams_exceeded",
active_streams=active_streams
)
return EntitlementResult(allowed=True)
Multi-Device Session Management
Users watch Netflix on multiple devices. Sessions must sync playback position and handle concurrent playback limits.
Session State
class PlaybackSession:
"""Represent an active playback session"""
def __init__(
self,
session_id: str,
user_id: int,
title_id: str,
device_type: str,
position_seconds: int,
quality: str
):
self.session_id = session_id
self.user_id = user_id
self.title_id = title_id
self.device_type = device_type
self.position_seconds = position_seconds
self.quality = quality
self.started_at = datetime.utcnow()
self.last_heartbeat = datetime.utcnow()
async def update_position(self, position_seconds: int):
"""Update playback position"""
self.position_seconds = position_seconds
self.last_heartbeat = datetime.utcnow()
# Persist to storage
await self.playback_store.save_position(
self.session_id,
position_seconds
)
# Invalidate device position cache
await self.cache.delete(f"position:{self.user_id}:{self.title_id}")
Continue Watching Sync
class ContinueWatchingService:
"""Sync playback position across devices"""
async def get_resume_position(
self,
user_id: int,
title_id: str,
requesting_device: str
) -> ResumePosition:
# Check if another device has more recent position
active_sessions = await self.session_manager.get_active_sessions(
user_id,
exclude_device=requesting_device
)
# Find session with this title
for session in active_sessions:
if session.title_id == title_id:
return ResumePosition(
position_seconds=session.position_seconds,
device=session.device_type,
updated_at=session.last_heartbeat
)
# Fallback to database
return await self.playback_store.get_position(user_id, title_id)
async def handle_playback_start(
self,
user_id: int,
title_id: str,
device_type: str
) -> Session:
# Create new session
session = PlaybackSession(
session_id=uuid4(),
user_id=user_id,
title_id=title_id,
device_type=device_type,
position_seconds=0,
quality="auto"
)
# Register session
await self.session_manager.register(session)
# Enforce concurrent stream limit
await self._enforce_stream_limit(user_id)
return session
async def _enforce_stream_limit(self, user_id: int):
"""Ensure user hasn't exceeded stream limit"""
subscription = await self.user_service.get_subscription(user_id)
active = await self.session_manager.count_active(user_id)
if active > subscription.max_streams:
# Force oldest session to stop
oldest = await self.session_manager.get_oldest_session(user_id)
await self.session_manager.terminate(oldest.session_id)
Adaptive Bitrate Streaming Deep Dive
ABR Algorithm Details
class AdaptiveBitrateController:
"""Netflix's adaptive bitrate selection algorithm"""
def __init__(self):
# Quality levels (lowest to highest)
self.levels = [
{"name": "auto", "min_bandwidth": 0},
{"name": "360p", "min_bandwidth": 0.7, "max_bandwidth": 1.5},
{"name": "480p", "min_bandwidth": 1.5, "max_bandwidth": 3},
{"name": "720p", "min_bandwidth": 3, "max_bandwidth": 5},
{"name": "1080p", "min_bandwidth": 5, "max_bandwidth": 10},
{"name": "4K", "min_bandwidth": 10, "max_bandwidth": float('inf')}
]
# State
self.current_level = 0
self.buffer_levels = [] # Ring buffer of recent buffer levels
self.throughput_samples = [] # Ring buffer of recent throughput
def calculate_throughput(self, segments: List[Segment]) -> float:
"""Calculate effective throughput from recent segments"""
if not segments:
return 0
# Weight recent samples higher
weights = [0.1, 0.15, 0.25, 0.5] # Oldest to newest
weighted_sum = sum(
s.download_time / s.size_mb * w
for s, w in zip(segments[-4:], weights)
)
return weighted_sum
def select_quality(self, throughput: float, buffer_level: float) -> str:
"""Select optimal quality based on conditions"""
# Determine if we're in startup, steady, or buffer-depleted state
state = self._classify_state(buffer_level)
if state == "startup":
# During startup, download multiple qualities in parallel
return self.levels[0]["name"] # "auto"
elif state == "buffer_depleted":
# Buffer running low - switch to lower quality
return self._select_lower_quality(throughput)
elif state == "steady":
# Try to improve quality if buffer is healthy
return self._select_quality_for_throughput(throughput)
return self.current_level
def _classify_state(self, buffer_level: float) -> str:
if buffer_level < 10:
return "buffer_depleted"
elif buffer_level < 60:
return "steady"
else:
return "startup" # Large buffer, can experiment
CDN Cache Invalidation
graph TB
A[Content Update] --> B{New encode or metadata change?}
B -->|Metadata| C[Update metadata service]
B -->|New encode| D[Push to origin]
D --> E[Invalidate edge caches]
E --> F[CDN propagates in < 30s]
subgraph "Cache Invalidation Strategy"
C
E
end
class CDNInvalidationService:
"""Handle content cache invalidation across CDN"""
async def invalidate_title(self, title_id: str, reason: str):
"""Invalidate all cached data for a title"""
# Invalidate manifest files
await self.cdn.invalidate(f"/title/{title_id}/*.m3u8")
# Invalidate metadata
await self.cdn.invalidate(f"/api/metadata/{title_id}")
# Invalidate thumbnails
await self.cdn.invalidate(f"/images/{title_id}/*")
# Log invalidation for audit
await self.audit_log.record({
"event": "cache_invalidation",
"title_id": title_id,
"reason": reason,
"timestamp": datetime.utcnow()
})
Production Failure Scenarios
| Failure Scenario | Impact | Mitigation |
|---|---|---|
| CDN origin failure | Video segments unavailable | Multi-CDN; fallback to direct streaming |
| License server down | No new streams can start | Cache licenses; graceful degradation |
| Recommendation service slow | Homepage takes longer to load | Cache recommendations; show stale content |
| Encoding pipeline backlog | New content delayed | Priority encoding; capacity headroom |
| Device too many streams | New streams rejected | Clear error message; upgrade prompt |
Observability Checklist
Metrics to Capture
stream_startup_time_seconds(histogram) - Time to first framebitrate_selected(histogram) - Quality distributionrebuffer_ratio(gauge) - Time spent rebuffering vs playingcdn_cache_hit_ratio(gauge) - Cache efficiencylicense_request_latency_ms(histogram) - DRM overheadconcurrent_streams(gauge) - Active stream count
Alerts to Configure
| Alert | Threshold | Severity |
|---|---|---|
| Startup time P99 > 3s | 3000ms | Warning |
| Rebuffer ratio > 5% | 5% | Warning |
| CDN cache hit < 90% | 90% | Warning |
| License latency P99 > 200ms | 200ms | Critical |
| Active streams < expected | < 50% baseline | Warning |
Security Checklist
- DRM encryption for all premium content
- Device attestation before issuing licenses
- HDCP enforcement for high-definition outputs
- Screen capture detection and blocking
- Concurrent stream enforcement
- Geographic restrictions per title
- Secure token exchange for session management
- Content signing to prevent tampering
Common Pitfalls / Anti-Patterns
Pitfall 1: Aggressive Quality Switching
Problem: Switching quality too frequently creates a “flutter” effect that’s visually jarring.
Solution: Implement quality stability windows. Once you switch up or down, stay at that level for at least 30 seconds.
Pitfall 2: Ignoring Network Variability
Problem: Using average throughput misses spikes and drops.
Solution: Use weighted average that emphasizes recent samples. Build in safety margins (use 70-80% of measured throughput for decisions).
Pitfall 3: Not Testing on Real Networks
Problem: Lab testing does not capture real-world variability (WiFi interference, cellular handoffs).
Solution: A/B test ABR algorithms on real users. Monitor quality distributions and rebuffer ratios in production.
Interview Q&A
Q: Why does Netflix build its own CDN (Open Connect) instead of using existing CDNs?
A: Netflix streams billions of hours monthly. At that scale, commercial CDN costs become prohibitive. Open Connect appliances are custom-built for Netflix’s workload (video streaming, not general web content). They are deployed at ISP facilities worldwide, placing content close to users while reducing Netflix’s backbone costs. The economics only work at Netflix’s scale.
Q: How does adaptive bitrate streaming work?
A: Netflix encodes each title in multiple quality levels (4K, 1080p, 720p, etc.). The device downloads an HLS/DASH manifest listing available quality levels. The client measures download speed and buffer fullness. If buffer runs low, it switches to lower quality. If buffer is healthy and bandwidth is high, it switches up. This happens every few seconds during playback.
Q: How does Netflix enforce concurrent stream limits?
A: The entitlement service tracks active playback sessions per user. When a user starts playback, it counts existing sessions. If the count exceeds the subscription tier limit (e.g., 4 streams for premium), the oldest session is terminated. The device receives an error with an upgrade prompt.
Q: What is the difference between Widevine, PlayReady, and FairPlay?
A: These are DRM schemes for different platforms. Widevine (Google) runs on Android, Chrome, most smart TVs. PlayReady (Microsoft) runs on Windows, Xbox, some smart TVs. FairPlay (Apple) runs on iOS, Safari, Apple TV. Netflix negotiates with the device to select the highest security level the device supports. Content is encrypted once and licenses are delivered via the device’s preferred DRM.
Scenario Drills
Scenario 1: CDN Origin Server Failure
Situation: The CDN origin serving video segments goes down during peak viewing hours.
Analysis:
- Devices requesting segments get errors
- Playback stalls, rebuffering begins
- Millions of concurrent streams affected
Solution: Multi-CDN deployment with automatic failover. If one CDN has issues, traffic routes to another. Open Connect has multiple origin clusters geographically distributed. Devices can switch CDN transparently if segment requests fail.
Scenario 2: New Show Releases Simultaneously Worldwide
Situation: A highly anticipated show releases globally at midnight UTC. 10 million users try to start playback simultaneously.
Analysis:
- Encoding pipeline must complete all quality levels before release
- CDN edge caches start cold for a new title
- License servers receive burst of requests
Solution: Pre-encode content days before release. Pre-position popular titles at ISP locations. License server scales horizontally; licenses are cached for their validity period to reduce server load.
Scenario 3: ABR Algorithm Causes Quality Flutter
Situation: Users report constantly changing video quality, creating a jarring viewing experience.
Analysis:
- ABR switches quality too frequently
- Bandwidth measurements fluctuate (wireless networks)
- Buffer thresholds trigger rapid up/down switching
Solution: Implement stability windows. Once you switch to a quality level, stay there for at least 30 seconds. Use weighted throughput averages that emphasize recent samples. Build in safety margins (use 70% of measured throughput for decisions).
Failure Flow Diagrams
Stream Playback Initialization
graph TD
A[User Clicks Play] --> B[Get Manifest from CDN]
B --> C[Parse Quality Levels]
C --> D[Select Initial Quality]
D --> E[Request Video Segment]
E --> F{Segment Available?}
F -->|No| G[Try Alternative CDN]
F -->|Yes| H[Download Segment]
G --> E
H --> I[Request DRM License]
I --> J[License Server]
J --> K{Check Entitlement?}
K -->|No| L[Return Error]
K -->|Yes| M[Return License Key]
M --> N[Decrypt Segment]
N --> O[Decode and Display]
O --> P[Buffer Next Segment]
P --> E
ABR Quality Selection
graph TD
A[Monitor Buffer Level] --> B{Buffer < 10s?}
B -->|Yes| C[Select Lower Quality]
B -->|No| D{Buffer > 60s?}
D -->|Yes| E[Select Higher Quality]
D -->|No| F[Maintain Current Quality]
C --> G[Download at New Quality]
E --> G
F --> H[Continue Current Quality]
G --> H
H --> A
Multi-CDN Failover
graph TD
A[Request Segment] --> B[Primary CDN]
B --> C{Response OK?}
C -->|Yes| D[Deliver to Device]
C -->|No| E[Try Secondary CDN]
E --> F{Response OK?}
F -->|Yes| G[Deliver to Device]
F -->|No| H[Fallback to Origin Direct]
H --> I{Origin Available?}
I -->|Yes| J[Deliver to Device]
I -->|No| K[Show Error]
D --> L[Report CDN Health]
G --> L
Quick Recap
- Netflix’s Open Connect CDN places servers at ISP locations worldwide.
- Adaptive bitrate streaming selects quality based on bandwidth and buffer health.
- DRM (Widevine, PlayReady, FairPlay) protects content on each platform.
- Entitlement service enforces subscription tier and concurrent stream limits.
- Session sync allows “continue watching” across devices.
Copy/Paste Checklist
- [ ] Implement ABR with buffer-aware quality selection
- [ ] Use multi-CDN with failover
- [ ] Cache CDN responses aggressively
- [ ] Enforce concurrent stream limits per subscription
- [ ] Implement session sync for continue watching
- [ ] Monitor stream startup time and rebuffer ratio
- [ ] Test ABR on real networks, not just labs
For more on CDN design, see our CDN Deep Dive guide. For database strategies, see NoSQL Databases. For caching patterns, see Distributed Caching.
Category
Related Posts
System Design: Twitter Feed Architecture and Scalability
Deep dive into Twitter system design. Learn about feed generation, fan-out, timeline computation, search, notifications, and scaling challenges.
System Design: URL Shortener from Scratch
Deep dive into URL shortener architecture. Learn hash function design, redirect logic, data storage, rate limiting, and high-availability.
Amazon's Architecture: Lessons from the Pioneer of Microservices
Learn how Amazon pioneered service-oriented architecture, the famous 'two-pizza team' rule, and how they built the foundation for AWS.