System Design: Real-Time Chat System Architecture
Deep dive into chat system design. Learn about WebSocket management, message delivery, presence systems, scalability patterns, and E2E encryption.
System Design: Real-Time Chat System Architecture
Chat systems like WhatsApp, Slack, and Discord handle millions of concurrent users sending messages in real-time. The challenges are fundamentally different from request-response web applications: persistent connections, low-latency delivery, and handling offline users.
This case study walks through designing a scalable chat system.
Requirements Analysis
Functional Requirements
Users should be able to:
- Send and receive messages in real-time
- Create and join chat rooms or direct messages
- See who is online (presence)
- Receive push notifications when offline
- Send text, images, and files
- Search through message history
Non-Functional Requirements
The system needs:
- Message delivery under 100ms for online users
- Support 100,000+ concurrent connections per server
- Handle 10 million daily active users
- Store message history indefinitely
- Maintain 99.99% availability
Architecture Overview
graph TB
A[Mobile App] -->|WSS| B[WebSocket Gateway]
A -->|HTTPS| C[REST API Gateway]
B --> D[Message Router]
D --> E[Message Store]
D --> F[Presence Service]
D --> G[Notification Service]
C --> E
C --> H[User Service]
E --> I[(Database)]
G --> J[(Push Service)]
G --> K[(Queue)]
WebSocket Connections
WebSockets provide persistent bidirectional connections, essential for real-time chat.
Connection Flow
sequenceDiagram
participant C as Client
participant W as WebSocket Gateway
participant R as Message Router
participant M as Message Store
C->>W: CONNECT {auth_token}
W->>W: Validate token
W->>W: Extract user_id
W->>R: Register connection {user_id, connection_id}
R-->>W: Connection registered
W-->>C: Connection established
C->>W: SEND {recipient_id, content}
W->>R: Route message {from, to, content}
R->>M: Store message
M-->>R: Message stored
R->>R: Find recipient connection
R->>W: Deliver to recipient
W-->>C: MESSAGE {from, content}
Note over C,W: Bidirectional messages flow through router
WebSocket Server Implementation
import asyncio
import websockets
from dataclasses import dataclass
from typing import Dict, Set
@dataclass
class Connection:
user_id: str
connection_id: str
websocket: websockets.WebSocketServerProtocol
subscribed_rooms: Set[str]
class WebSocketGateway:
def __init__(self, message_router: MessageRouter):
self.router = message_router
self.connections: Dict[str, Connection] = {}
self.user_connections: Dict[str, Set[str]] = {}
async def handle_connection(self, websocket: websockets.WebSocketServerProtocol, path: str):
# Authenticate
token = websocket.request_headers.get("Authorization", "").replace("Bearer ", "")
user_id = await self.authenticate(token)
if not user_id:
await websocket.close(4001, "Authentication failed")
return
connection_id = str(uuid4())
connection = Connection(
user_id=user_id,
connection_id=connection_id,
websocket=websocket,
subscribed_rooms=set()
)
# Register connection
self.connections[connection_id] = connection
if user_id not in self.user_connections:
self.user_connections[user_id] = set()
self.user_connections[user_id].add(connection_id)
# Register with router
await self.router.register_connection(user_id, connection_id)
try:
# Handle incoming messages
async for message in websocket:
await self.handle_message(connection, message)
except websockets.ConnectionClosed:
pass
finally:
await self.cleanup(connection)
async def handle_message(self, connection: Connection, raw_message: str):
message = json.loads(raw_message)
msg_type = message.get("type")
if msg_type == "message":
await self.router.route_message(
from_user=connection.user_id,
to_user=message["recipient_id"],
content=message["content"],
connection_id=connection.connection_id
)
elif msg_type == "subscribe":
room_id = message["room_id"]
connection.subscribed_rooms.add(room_id)
await self.router.subscribe_to_room(connection.user_id, room_id)
elif msg_type == "typing":
await self.router.send_typing_indicator(
from_user=connection.user_id,
to_user=message["recipient_id"]
)
Message Routing
The message router is the heart of the chat system, handling delivery to online and offline users.
Message Router Service
class MessageRouter:
def __init__(
self,
message_store: MessageStore,
presence_service: PresenceService,
notification_service: NotificationService
):
self.message_store = message_store
self.presence_service = presence_service
self.notification_service = notification_service
self.online_connections: Dict[str, Set[str]] = {}
async def register_connection(self, user_id: str, connection_id: str):
if user_id not in self.online_connections:
self.online_connections[user_id] = set()
self.online_connections[user_id].add(connection_id)
await self.presence_service.set_online(user_id)
async def route_message(
self,
from_user: str,
to_user: str,
content: str,
connection_id: str = None
):
# Store message first (durability)
message = await self.message_store.store(
from_user=from_user,
to_user=to_user,
content=content
)
# Deliver to online recipients
if to_user in self.online_connections:
for conn_id in self.online_connections[to_user]:
await self.deliver_message(conn_id, message)
# No push notification needed
else:
# User offline - send push notification
await self.notification_service.send_push_notification(
user_id=to_user,
message=message
)
# Acknowledge to sender if online
if connection_id:
await self.send_ack(connection_id, message)
return message
async def deliver_message(self, connection_id: str, message: Message):
gateway = self.get_gateway_for_connection(connection_id)
await gateway.send_to_connection(connection_id, {
"type": "message",
"message_id": message.id,
"from": message.from_user,
"content": message.content,
"timestamp": message.created_at.isoformat()
})
Message Storage
Messages must be stored durably and queryable.
Database Schema
-- Messages table
CREATE TABLE messages (
id BIGSERIAL PRIMARY KEY,
conversation_id TEXT NOT NULL,
sender_id BIGINT NOT NULL,
content TEXT NOT NULL,
content_type TEXT DEFAULT 'text', -- text, image, file
metadata JSONB DEFAULT '{}',
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
delivered_at TIMESTAMP WITH TIME ZONE,
read_at TIMESTAMP WITH TIME ZONE,
INDEX idx_messages_conversation (conversation_id, created_at DESC),
INDEX idx_messages_sender (sender_id, created_at DESC)
);
-- Conversations table
CREATE TABLE conversations (
id TEXT PRIMARY KEY,
type TEXT NOT NULL, -- direct, group
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
INDEX idx_conversations_updated (updated_at DESC)
);
-- Conversation members
CREATE TABLE conversation_members (
conversation_id TEXT NOT NULL REFERENCES conversations(id),
user_id BIGINT NOT NULL,
joined_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
last_read_at TIMESTAMP WITH TIME ZONE,
unread_count BIGINT DEFAULT 0,
PRIMARY KEY (conversation_id, user_id)
);
-- Index for finding user's conversations
CREATE INDEX idx_members_user ON conversation_members(user_id);
Message Store Service
class MessageStore:
def __init__(self, db: Database, cache: RedisCache):
self.db = db
self.cache = cache
def generate_conversation_id(self, user1: str, user2: str) -> str:
# Consistent ordering for DMs
sorted_users = sorted([user1, user2])
return f"dm:{sorted_users[0]}:{sorted_users[1]}"
async def store(
self,
from_user: str,
to_user: str,
content: str,
conversation_id: str = None
) -> Message:
if not conversation_id:
conversation_id = self.generate_conversation_id(from_user, to_user)
# Ensure conversation exists
await self._ensure_conversation_exists(conversation_id)
# Insert message
row = await self.db.fetch_one("""
INSERT INTO messages (conversation_id, sender_id, content)
VALUES ($1, $2, $3)
RETURNING *
""", conversation_id, from_user, content)
message = Message(**row)
# Update conversation timestamp
await self.db.execute("""
UPDATE conversations
SET updated_at = NOW()
WHERE id = $1
""", conversation_id)
# Invalidate cache
await self.cache.delete(f"conversation:{conversation_id}:messages")
return message
async def get_messages(
self,
conversation_id: str,
limit: int = 50,
before: datetime = None
) -> List[Message]:
# Check cache first
cache_key = f"conversation:{conversation_id}:messages"
if not before:
cached = await self.cache.get(cache_key)
if cached:
return [Message(**m) for m in json.loads(cached)]
# Query database
query = """
SELECT * FROM messages
WHERE conversation_id = $1
"""
params = [conversation_id]
if before:
query += " AND created_at < $2"
params.append(before)
query += " ORDER BY created_at DESC LIMIT $" + str(len(params))
params.append(limit)
rows = await self.db.fetch(query, *params)
messages = [Message(**row) for row in rows]
# Cache if first page
if not before:
await self.cache.setex(
cache_key,
300,
json.dumps([m.dict() for m in messages])
)
return messages
Presence System
Presence shows users who is online and their activity status.
Presence Service
class PresenceService:
def __init__(self, redis: Redis):
self.redis = redis
self.PRESENCE_TTL = 60 # Heartbeat interval
async def set_online(self, user_id: str):
key = f"presence:{user_id}"
await self.redis.setex(key, self.PRESENCE_TTL, json.dumps({
"status": "online",
"last_seen": datetime.utcnow().isoformat()
})
# Publish presence change
await self.redis.publish("presence_updates", json.dumps({
"user_id": user_id,
"status": "online"
}))
async def set_offline(self, user_id: str):
key = f"presence:{user_id}"
await self.redis.setex(key, self.PRESENCE_TTL, json.dumps({
"status": "offline",
"last_seen": datetime.utcnow().isoformat()
}))
await self.redis.publish("presence_updates", json.dumps({
"user_id": user_id,
"status": "offline"
}))
async def get_presence(self, user_id: str) -> Dict:
key = f"presence:{user_id}"
data = await self.redis.get(key)
if data:
return json.loads(data)
return {"status": "offline"}
async def get_bulk_presence(self, user_ids: List[str]) -> Dict[str, Dict]:
keys = [f"presence:{uid}" for uid in user_ids]
values = await self.redis.mget(keys)
result = {}
for uid, value in zip(user_ids, values):
if value:
result[uid] = json.loads(value)
else:
result[uid] = {"status": "offline"}
return result
Heartbeat Mechanism
class PresenceHeartbeat:
def __init__(self, presence_service: PresenceService):
self.presence_service = presence_service
self.heartbeat_interval = 30
async def start_heartbeat(self, websocket, user_id: str):
while True:
try:
await asyncio.sleep(self.heartbeat_interval)
await websocket.send(json.dumps({"type": "heartbeat_ack"}))
await self.presence_service.refresh_heartbeat(user_id)
except websockets.ConnectionClosed:
await self.presence_service.set_offline(user_id)
break
async def refresh_heartbeat(self, user_id: str):
key = f"presence:{user_id}"
data = {
"status": "online",
"last_seen": datetime.utcnow().isoformat()
}
await self.redis.setex(key, self.PRESENCE_TTL, json.dumps(data))
Push Notifications
When users are offline, push notifications deliver new messages.
Notification Service
class NotificationService:
def __init__(
self,
push_service: PushService,
user_service: UserService
):
self.push_service = push_service
self.user_service = user_service
async def send_push_notification(
self,
user_id: str,
message: Message
):
# Get user's push tokens
tokens = await self.user_service.get_push_tokens(user_id)
if not tokens:
return
# Get sender info
sender = await self.user_service.get_user(message.from_user)
# Send to all user's devices
await asyncio.gather(*[
self.push_service.send(
token=token,
notification={
"title": f"Message from {sender.display_name}",
"body": message.content[:100],
"data": {
"type": "message",
"conversation_id": message.conversation_id,
"message_id": message.id
}
}
)
for token in tokens
])
Scalability Patterns
Connection Distribution
graph TB
L[Load Balancer]
W1[WS Server 1<br/>10K connections]
W2[WS Server 2<br/>10K connections]
W3[WS Server 3<br/>10K connections]
W4[WS Server N<br/>10K connections]
L -->|Route| W1
L -->|Route| W2
L -->|Route| W3
L -->|Route| W4
subgraph "Message Bus"
K[Kafka/RabbitMQ]
end
W1 --> K
W2 --> K
W3 --> K
W4 --> K
Horizontal Scaling
class ConsistentHashRouter:
def __init__(self, nodes: List[str]):
self.ring = {}
self.sorted_keys = []
self._build_ring(nodes)
def _build_ring(self, nodes: List[str]):
for node in nodes:
for i in range(100): # Virtual nodes
key = self._hash(f"{node}:{i}")
self.ring[key] = node
self.sorted_keys = sorted(self.ring.keys())
def get_node(self, key: str) -> str:
hash_key = self._hash(key)
idx = bisect_left(self.sorted_keys, hash_key) % len(self.sorted_keys)
return self.ring[self.sorted_keys[idx]]
def _hash(self, key: str) -> int:
return int(hashlib.md5(key.encode()).hexdigest(), 16)
# Usage: Route message to the server holding recipient's connection
async def route_to_user(user_id: str, message: Message):
server = consistent_hash.get_node(user_id)
await kafka.send("messageDelivery", {
"server": server,
"user_id": user_id,
"message": message.dict()
})
End-to-End Encryption
For privacy, implement end-to-end encryption (E2EE).
Signal Protocol Overview
graph TB
A[Alice Key Pair] -->|Identity Key| PAB[Public Address Book]
B[Bob Key Pair] -->|Identity Key| PAB
PAB -->|Bob's Identity Key| AS[Alice's Device]
PAB -->|Alice's Identity Key| BS[Bob's Device]
AS -->|Pre-Key Bundle| BS
BS -->|Pre-Key Bundle| AS
AS <-->|X3DH Key Exchange| BS
BS <-->|Double Ratchet| AS
Key Exchange Implementation
class KeyExchange:
def generate_identity_key_pair(self) -> Tuple[bytes, bytes]:
private_key = Curve25519.generate_private()
public_key = Curve25519.public_from_private(private_key)
return private_key, public_key
def generate_prekey_bundle(self, identity_key: bytes) -> PrekeyBundle:
identity_private, identity_public = self.generate_identity_key_pair()
signed_prekey_private, signed_prekey_public = self.generate_identity_key_pair()
one_time_prekey_private, one_time_prekey_public = self.generate_identity_key_pair()
# Sign the signed prekey
signature = self.sign(identity_private, signed_prekey_public)
return PrekeyBundle(
identity_public=identity_public,
signed_prekey_public=signed_prekey_public,
signed_prekey_signature=signature,
one_time_prekey_public=one_time_prekey_public
)
async def x3dh_key_agreement(
self,
my_identity_key: bytes,
their_bundle: PrekeyBundle
) -> bytes:
# Perform X3DH key agreement
dh1 = Curve25519.diffie_hellman(my_identity_key, their_bundle.signed_prekey_public)
dh2 = Curve25519.diffie_hellman(my_identity_key, their_bundle.one_time_prekey_public)
dh3 = Curve25519.diffie_hellman(my_identity_key, their_bundle.signed_prekey_public)
dh4 = Curve25519.diffie_hellman(my_identity_key, their_bundle.one_time_prekey_public)
# Combine DH outputs
shared_secret = dh1 + dh2 + dh3 + dh4
return HKDF(shared_secret, b"X3DH")
Message Encryption
class MessageEncryptor:
def __init__(self, root_key: bytes):
self.root_key = root_key
def encrypt_message(
self,
message: str,
chain_key: bytes,
ratchet_key: bytes
) -> EncryptedMessage:
# Generate message key from chain key
message_key = HKDF(chain_key, b"MessageKey")
# Encrypt with AES-GCM
ciphertext = AESGCM.encrypt(message_key, message)
# Advance chain key
next_chain_key = HKDF(chain_key, b"ChainKey")
return EncryptedMessage(
ciphertext=ciphertext,
message_key=message_key,
next_chain_key=next_chain_key
)
API Design
REST Endpoints
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/v1/conversations | List user’s conversations |
| GET | /api/v1/conversations/:id/messages | Get messages |
| POST | /api/v1/conversations | Create DM or group |
| POST | /api/v1/conversations/:id/messages | Send message (offline) |
| GET | /api/v1/presence/:user_id | Get user presence |
| PUT | /api/v1/presence | Update own presence |
| POST | /api/v1/devices | Register push token |
WebSocket Messages
| Type | Direction | Payload |
|---|---|---|
| message | bidirectional | {to, content} or {from, content} |
| ack | server to client | {message_id} |
| typing | bidirectional | {to} or {from} |
| presence | server to client | {user_id, status} |
| heartbeat | bidirectional | {} |
| subscribe | client to server | {room_id} |
Conclusion
Building a scalable chat system requires:
- WebSocket gateways for persistent connections
- Message routers that handle online and offline delivery
- Durable message storage with efficient queries
- Presence tracking with heartbeat mechanism
- Push notifications for offline users
- Consistent hashing for horizontal scaling
When to Use / When Not to Use
| Scenario | Recommendation |
|---|---|
| Real-time messaging apps | Use WebSockets with message queue |
| Team collaboration tools | Add presence and typing indicators |
| Customer support chat | Add conversation persistence and search |
| High-volume broadcast (1:N) | Use pub/sub message broker fanout |
When TO Use WebSocket Scaling with Sticky Sessions
- Session affinity important: When connection state is local to server and expensive to rebuild
- Moderate scale: When you can manage server affinity at load balancer level
- Simpler operations: When you prefer simpler debugging and tracing
When NOT to Use Sticky Sessions
- Very high scale: When you need to scale to thousands of servers
- Geographic distribution: When users connect to nearest region, not a specific server
- Transparent failover: When connection should survive server restarts
Offline Sync and Conflict Resolution
When users come back online after being offline, their device must sync missed messages without creating duplicates or conflicts.
Conflict Resolution Strategies
class ConflictResolver:
"""Resolve message sync conflicts using vector clocks"""
def __init__(self):
self.vector_clocks = {} # {message_id: vector_clock}
def resolve_messages(
self,
local_messages: List[Message],
server_messages: List[Message]
) -> List[Message]:
"""Merge local and server messages, removing duplicates"""
all_messages = {}
for msg in local_messages + server_messages:
msg_id = msg.id
if msg_id not in all_messages:
all_messages[msg_id] = msg
else:
# Conflict - use vector clock to resolve
winner = self._resolve_conflict(all_messages[msg_id], msg)
all_messages[msg_id] = winner
return sorted(all_messages.values(), key=lambda m: m.timestamp)
def _resolve_conflict(self, msg1: Message, msg2: Message) -> Message:
"""Vector clock comparison: later timestamp wins"""
if msg2.timestamp > msg1.timestamp:
return msg2
return msg1
Message Sync Protocol
sequenceDiagram
participant D as Device (Offline)
participant S as Sync Server
D->>D: User comes online
D->>S: GET /sync?last_sync=1709123456
S-->>D: {messages: [...], cursor: "abc123"}
loop Until Caught Up
D->>S: GET /sync?last_sync=cursor
S-->>D: {messages: [...], cursor: "def456"}
Note over D: Insert messages, update cursor
end
D->>D: Display "You're all caught up"
Offline Queue Management
class OfflineMessageQueue:
"""Queue messages sent while offline, sync when online"""
def __init__(self, db: Database):
self.db = db
async def queue_message(self, user_id: int, message: Message):
"""Queue message sent offline"""
await self.db.execute("""
INSERT INTO offline_outbox (user_id, message_json, created_at)
VALUES ($1, $2, NOW())
""", user_id, json.dumps(message.dict()))
async def sync_pending(self, user_id: int) -> List[Message]:
"""Get and delete pending messages"""
rows = await self.db.fetch("""
SELECT message_json FROM offline_outbox
WHERE user_id = $1
ORDER BY created_at ASC
""", user_id)
await self.db.execute("""
DELETE FROM offline_outbox WHERE user_id = $1
""", user_id)
return [Message(**json.loads(row['message_json'])) for row in rows]
WebSocket Scaling Patterns
Sticky Sessions (Session Affinity)
graph TB
L[Load Balancer<br/>Sticky Enabled]
W1[WS Server 1]
W2[WS Server 2]
W3[WS Server N]
L -->|JSESSIONID=W1| W1
L -->|JSESSIONID=W2| W2
L -->|JSESSIONID=W3| W3
Note over L: Same user always routes to same server
Pros: Simple to implement, no shared state needed Cons: Server restart disconnects user, hard to scale
Broker Fanout Pattern
graph TB
L[Load Balancer]
W1[WS Server 1]
W2[WS Server 2]
W3[WS Server N]
K[(Redis Pub/Sub)]
L --> W1
L --> W2
L --> W3
W1 --> K
W2 --> K
W3 --> K
K -->|Fanout| W1
K -->|Fanout| W2
K -->|Fanout| W3
Note over K: Message published once, delivered to all servers
Hybrid Architecture (Production Recommended)
class HybridWebSocketGateway:
"""
Combines sticky sessions for connection
with broker fanout for message delivery
"""
def __init__(
self,
redis: Redis,
message_router: MessageRouter,
consistent_hash: ConsistentHashRouter
):
self.redis = redis
self.router = message_router
self.hash = consistent_hash
async def route_message(self, message: Message):
"""
Route message through pub/sub so all
server instances receive it
"""
# Publish to Redis channel for this conversation
channel = f"conversation:{message.conversation_id}"
await self.redis.publish(channel, json.dumps(message.dict()))
async def subscribe(self, connection_id: str, conversation_ids: List[str]):
"""Subscribe to message channels"""
pubsub = self.redis.pubsub()
for conv_id in conversation_ids:
await pubsub.subscribe(f"conversation:{conv_id}")
# Listen in background, deliver to connection
asyncio.create_task(self._pubsub_listener(connection_id, pubsub))
Message Ordering Guarantees
Total Order vs Causal Order
| Guarantee | Description | Implementation |
|---|---|---|
| Total Order | All messages in exact global order | Single sequencer; Lamport clocks |
| Causal Order | Causally related messages in order; unrelated can be interleaved | Vector clocks |
| FIFO per sender | Messages from same sender arrive in send order | Sender sequence numbers |
class MessageOrdering:
"""Implement causal ordering with vector clocks"""
def __init__(self, node_id: str):
self.node_id = node_id
self.vector_clock = {node_id: 0}
def increment_clock(self):
self.vector_clock[self.node_id] += 1
return self.vector_clock.copy()
def send_message(self, content: str) -> Message:
clock = self.increment_clock()
return Message(
content=content,
vector_clock=clock,
sender=self.node_id
)
def receive_message(self, message: Message) -> bool:
"""Return True if message should be delivered (causal ordering)"""
# Update our clock
for node, val in message.vector_clock.items():
self.vector_clock[node] = max(
self.vector_clock.get(node, 0),
val
)
# Increment for receive event
self.vector_clock[self.node_id] += 1
# Check if we have all preceding messages
return self._check_causal_delivery(message)
def _check_causal_delivery(self, message: Message) -> bool:
"""Verify we have all causally preceding messages"""
for node, val in message.vector_clock.items():
if node == self.node_id:
continue
if self.vector_clock.get(node, 0) < val - 1:
return False # Missing preceding message
return True
Production Failure Scenarios
| Failure Scenario | Impact | Mitigation |
|---|---|---|
| WebSocket gateway crash | Users disconnect, must reconnect | Run N+1 instances; health checks; client auto-reconnect |
| Message queue backlog | Messages delayed, offline users fall further behind | Priority queues; max backlog limits; batch catchup |
| Presence cache failure | Users appear offline incorrectly | Redis cluster with replication; fallback to DB query |
| Push notification service down | Offline users don’t get notified | Queue notifications; retry with backoff; SMS fallback |
| Database replica lag | Message history incomplete | Read from primary for recent messages; accept lag for history |
| Device conflict on sync | Duplicate messages or lost messages | Vector clocks; idempotent message IDs; conflict resolution |
Capacity Estimation
Connection Capacity
Per WebSocket server (4 vCPU, 8GB RAM):
- Max connections: 10,000-50,000
- Messages/second: 1,000-5,000
For 100,000 concurrent users:
- Minimum servers: 100,000 / 10,000 = 10 servers
- With redundancy (2x): 20 servers
- Peak message rate: 100,000 users × 1 msg/user/10sec = 10,000 msg/s
Storage Estimation
Messages per day: 10M DAU × 10 messages/day = 100M messages
Message size (avg): 500 bytes + 200 bytes metadata = 700 bytes
Daily storage: 100M × 700 = 70 GB
5-year storage: 70GB × 365 × 5 = 127 TB
With 3x replication: ~380 TB
Observability Checklist
Metrics to Capture
websocket_connections_active(gauge) - Per server, totalmessages_delivered_total(counter) - By type (direct, group, system)message_delivery_latency_seconds(histogram) - End-to-end, P50/P95/P99presence_updates_total(counter) - Online/offline transitionsoffline_sync_duration_seconds(histogram) - Time to sync missed messagespush_notification_sent_total(counter) - By channel (APNS, FCM, SMS)
Logs to Emit
{
"timestamp": "2026-03-22T10:30:00Z",
"event": "message_delivered",
"message_id": "msg_abc123",
"conversation_id": "conv_xyz",
"recipient_id": "user_456",
"delivery_latency_ms": 45,
"channel": "websocket"
}
Alerts to Configure
| Alert | Threshold | Severity |
|---|---|---|
| Connection count > 80% capacity | 80% max | Warning |
| Message latency P99 > 200ms | 200ms | Warning |
| Presence update lag > 10s | 10s | Critical |
| Offline sync duration > 30s | 30s | Warning |
| Push notification queue > 10K | 10000 | Warning |
Security Checklist
- WebSocket connections over WSS (TLS)
- JWT token validation on connection
- Message content encryption (E2EE for sensitive chats)
- Rate limiting per user (prevent spam)
- Input sanitization (XSS prevention)
- Push notification content limited (no PII in notifications)
- Audit logging for moderation actions
- Device management (revoke stolen device sessions)
Common Pitfalls / Anti-Patterns
Pitfall 1: Storing Messages in WebSocket Server Memory
Problem: Messages sent while user is temporarily disconnected are lost if stored only in server memory.
Solution: Store all messages in persistent storage (DB) immediately. Deliver from storage when user reconnects. Use message queue as buffer, not source of truth.
Pitfall 2: Not Handling Reconnection Gracefully
Problem: When client reconnects, duplicate messages appear or gaps in conversation.
Solution: Implement idempotent message delivery with client-side deduplication. Use cursor-based pagination for loading message history.
Pitfall 3: Presence Flooding
Problem: Every presence change triggers notifications to all connected clients, creating broadcast storms in large channels.
Solution: Batch presence updates (send every 5 seconds, not every change). Use bloom filters for “who is online” queries instead of broadcasting.
Pitfall 4: Single Point of Failure for Message Ordering
Problem: Using a single sequencer node for total ordering creates a bottleneck and failure point.
Solution: Use causal ordering (vector clocks) instead of total ordering. Most applications only need causal order anyway.
Interview Q&A
Q: How do you handle message delivery when the recipient is offline?
A: When the recipient is offline, the message router stores the message in durable storage first. Then it triggers a push notification to the user’s device. When the user comes back online, the client syncs missed messages using a cursor-based pagination protocol, fetching messages since the last known sync timestamp.
Q: Why is vector clock ordering better than total ordering for chat?
A: Total ordering requires a single sequencer node that sees all messages. This creates a bottleneck and a single point of failure. Causal ordering only guarantees that causally related messages arrive in order. Unrelated messages can be interleaved freely. Most chat applications only need causal ordering, which is cheaper to implement and more fault-tolerant.
Q: How does presence tracking scale in large channels?
A: Presence changes are batched rather than broadcast immediately. The system sends presence updates every 5-10 seconds, not on every change. For “who is online” queries, bloom filters can approximate membership without broadcasting to all members. For large channels, presence is often not shown at all or shown as “many people online.”
Q: What is the offline sync conflict resolution strategy?
A: Messages are identified by client-generated idempotent message IDs. When syncing, the server returns all messages since the last sync cursor. The client merges local and server messages, deduplicating by message ID. For conflicts (same ID, different content), vector clocks determine which is causally newer, or timestamps break ties.
Scenario Drills
Scenario 1: WebSocket Gateway Server Crash
Situation: The WebSocket server handling 10,000 connections crashes unexpectedly.
Analysis:
- All 10,000 users disconnect simultaneously
- Client auto-reconnect kicks in but overwhelms surviving servers
- Messages sent during disconnect are in message store, not lost
Solution: Run N+1 instances with health checks. Clients implement exponential backoff on reconnect. Sticky sessions at load balancer ensure reconnection routes to same server. Messages are stored in durable DB before delivery, so offline messages survive server death.
Scenario 2: Presence Flood in Large Channel
Situation: 1,000 users join a popular channel simultaneously. Each presence change is broadcast to all members.
Analysis:
- Each join generates 1 presence update broadcast to 999 users
- 1,000 joins = 999,000 presence messages
- Network flooding and server CPU spike
Solution: Batch presence updates into 5-10 second windows. Instead of broadcasting each change, publish aggregated state. Use bloom filters for approximate “is user online” queries. For very large channels, hide individual presence.
Scenario 3: Offline User Comes Back Online After Days
Situation: A user returns after being offline for a week with thousands of missed messages.
Analysis:
- Message history is large; fetching all at once causes timeout
- Push notification storm if all messages trigger notifications
- User sees huge badge count
Solution: Cursor-based pagination syncs in batches. Messages are fetched incrementally with cursors. Push notifications are coalesced into a single “you have messages” notification rather than one per message. The app shows “loading” state while catching up.
Failure Flow Diagrams
Message Delivery Flow
graph TD
A[Sender Sends Message] --> B[WebSocket Gateway]
B --> C[Message Router]
C --> D[Store in Database]
D --> E{Recipient Online?}
E -->|Yes| F[Find Connection]
F --> G[Deliver via WS Gateway]
E -->|No| H[Queue Push Notification]
G --> I[Send ACK to Sender]
H --> J[APNS/FCM]
J --> K[Device Receives]
K --> L[User Opens App]
L --> M[Sync Missed Messages]
WebSocket Connection Lifecycle
graph TD
A[Client Connects] --> B[Authenticate Token]
B --> C{Valid?}
C -->|No| D[Close Connection 4001]
C -->|Yes| E[Register Connection]
E --> F[Subscribe to Conversations]
F --> G[Connection Established]
G --> H[Heartbeat Loop]
H --> I{Heartbeat OK?}
I -->|No| J[Mark Offline]
I -->|Yes| K{Message Received?}
K -->|Yes| L[Process Message]
K -->|No| H
L --> H
J --> M[Cleanup Connection]
Presence Update Batching
graph LR
A[User Online Event] --> B[Presence Service]
B --> C{Batch Window Open?}
C -->|No| D[Start 5s Timer]
C -->|Yes| E[Add to Batch Buffer]
D --> E
E --> F{Window Elapsed?}
F -->|No| E
F -->|Yes| G[Publish Aggregated Update]
G --> H[Subscribers Receive]
Quick Recap
- WebSocket gateways scale horizontally using broker fanout or sticky sessions.
- Store messages persistently first, then deliver; don’t use memory as source of truth.
- Use vector clocks for causal ordering; only require total order when truly necessary.
- Conflict resolution on offline sync uses timestamps, vector clocks, or application logic.
- Presence should be batched and cached, not broadcast on every change.
Copy/Paste Checklist
- [ ] WebSocket gateway with Redis pub/sub fanout
- [ ] Messages stored in DB before delivery
- [ ] Client auto-reconnect with idempotent delivery
- [ ] Cursor-based pagination for message history
- [ ] Vector clocks for causal ordering
- [ ] Conflict resolution on offline sync
- [ ] Presence batched updates (5-10 second windows)
- [ ] Push notification fallback for offline users
- [ ] Rate limiting per connection
- [ ] E2EE encryption for sensitive conversations
For more on real-time systems, see our Message Queue Types guide. For message queues that power chat delivery, see the RabbitMQ and Apache Kafka guides.
Category
Related Posts
Uber's Architecture: From Monolith to Microservices at Scale
Explore how Uber evolved from a monolith to a microservices architecture handling millions of real-time marketplace transactions daily.
System Design: Netflix Architecture for Global Streaming
Deep dive into Netflix architecture. Learn about content delivery, CDN design, microservices, recommendation systems, and streaming protocols.
System Design: Twitter Feed Architecture and Scalability
Deep dive into Twitter system design. Learn about feed generation, fan-out, timeline computation, search, notifications, and scaling challenges.