System Design: Real-Time Chat System Architecture
Deep dive into chat system design. Learn about WebSocket management, message delivery, presence systems, scalability patterns, and E2E encryption.
System Design: Real-Time Chat System Architecture
Chat systems like WhatsApp, Slack, and Discord handle millions of concurrent users sending messages in real-time. The challenges are fundamentally different from request-response web applications: persistent connections, low-latency delivery, and handling offline users.
This case study walks through designing a scalable chat system.
Introduction
Requirements Analysis
Functional Requirements
Users should be able to:
- Send and receive messages in real-time
- Create and join chat rooms or direct messages
- See who is online (presence)
- Receive push notifications when offline
- Send text, images, and files
- Search through message history
Non-Functional Requirements
The system needs:
- Message delivery under 100ms for online users
- Support 100,000+ concurrent connections per server
- Handle 10 million daily active users
- Store message history indefinitely
- Maintain 99.99% availability
Architecture Overview
graph TB
A[Mobile App] -->|WSS| B[WebSocket Gateway]
A -->|HTTPS| C[REST API Gateway]
B --> D[Message Router]
D --> E[Message Store]
D --> F[Presence Service]
D --> G[Notification Service]
C --> E
C --> H[User Service]
E --> I[(Database)]
G --> J[(Push Service)]
G --> K[(Queue)]
WebSocket Connections
WebSockets provide persistent bidirectional connections, essential for real-time chat.
Connection Flow
sequenceDiagram
participant C as Client
participant W as WebSocket Gateway
participant R as Message Router
participant M as Message Store
C->>W: CONNECT {auth_token}
W->>W: Validate token
W->>W: Extract user_id
W->>R: Register connection {user_id, connection_id}
R-->>W: Connection registered
W-->>C: Connection established
C->>W: SEND {recipient_id, content}
W->>R: Route message {from, to, content}
R->>M: Store message
M-->>R: Message stored
R->>R: Find recipient connection
R->>W: Deliver to recipient
W-->>C: MESSAGE {from, content}
Note over C,W: Bidirectional messages flow through router
WebSocket Server Implementation
import asyncio
import websockets
from dataclasses import dataclass
from typing import Dict, Set
@dataclass
class Connection:
user_id: str
connection_id: str
websocket: websockets.WebSocketServerProtocol
subscribed_rooms: Set[str]
class WebSocketGateway:
def __init__(self, message_router: MessageRouter):
self.router = message_router
self.connections: Dict[str, Connection] = {}
self.user_connections: Dict[str, Set[str]] = {}
async def handle_connection(self, websocket: websockets.WebSocketServerProtocol, path: str):
# Authenticate
token = websocket.request_headers.get("Authorization", "").replace("Bearer ", "")
user_id = await self.authenticate(token)
if not user_id:
await websocket.close(4001, "Authentication failed")
return
connection_id = str(uuid4())
connection = Connection(
user_id=user_id,
connection_id=connection_id,
websocket=websocket,
subscribed_rooms=set()
)
# Register connection
self.connections[connection_id] = connection
if user_id not in self.user_connections:
self.user_connections[user_id] = set()
self.user_connections[user_id].add(connection_id)
# Register with router
await self.router.register_connection(user_id, connection_id)
try:
# Handle incoming messages
async for message in websocket:
await self.handle_message(connection, message)
except websockets.ConnectionClosed:
pass
finally:
await self.cleanup(connection)
async def handle_message(self, connection: Connection, raw_message: str):
message = json.loads(raw_message)
msg_type = message.get("type")
if msg_type == "message":
await self.router.route_message(
from_user=connection.user_id,
to_user=message["recipient_id"],
content=message["content"],
connection_id=connection.connection_id
)
elif msg_type == "subscribe":
room_id = message["room_id"]
connection.subscribed_rooms.add(room_id)
await self.router.subscribe_to_room(connection.user_id, room_id)
elif msg_type == "typing":
await self.router.send_typing_indicator(
from_user=connection.user_id,
to_user=message["recipient_id"]
)
Message Routing
The message router is the heart of the chat system, handling delivery to online and offline users.
Message Router Service
class MessageRouter:
def __init__(
self,
message_store: MessageStore,
presence_service: PresenceService,
notification_service: NotificationService
):
self.message_store = message_store
self.presence_service = presence_service
self.notification_service = notification_service
self.online_connections: Dict[str, Set[str]] = {}
async def register_connection(self, user_id: str, connection_id: str):
if user_id not in self.online_connections:
self.online_connections[user_id] = set()
self.online_connections[user_id].add(connection_id)
await self.presence_service.set_online(user_id)
async def route_message(
self,
from_user: str,
to_user: str,
content: str,
connection_id: str = None
):
# Store message first (durability)
message = await self.message_store.store(
from_user=from_user,
to_user=to_user,
content=content
)
# Deliver to online recipients
if to_user in self.online_connections:
for conn_id in self.online_connections[to_user]:
await self.deliver_message(conn_id, message)
# No push notification needed
else:
# User offline - send push notification
await self.notification_service.send_push_notification(
user_id=to_user,
message=message
)
# Acknowledge to sender if online
if connection_id:
await self.send_ack(connection_id, message)
return message
async def deliver_message(self, connection_id: str, message: Message):
gateway = self.get_gateway_for_connection(connection_id)
await gateway.send_to_connection(connection_id, {
"type": "message",
"message_id": message.id,
"from": message.from_user,
"content": message.content,
"timestamp": message.created_at.isoformat()
})
Message Storage
Messages must be stored durably and queryable.
Database Schema
-- Messages table
CREATE TABLE messages (
id BIGSERIAL PRIMARY KEY,
conversation_id TEXT NOT NULL,
sender_id BIGINT NOT NULL,
content TEXT NOT NULL,
content_type TEXT DEFAULT 'text', -- text, image, file
metadata JSONB DEFAULT '{}',
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
delivered_at TIMESTAMP WITH TIME ZONE,
read_at TIMESTAMP WITH TIME ZONE,
INDEX idx_messages_conversation (conversation_id, created_at DESC),
INDEX idx_messages_sender (sender_id, created_at DESC)
);
-- Conversations table
CREATE TABLE conversations (
id TEXT PRIMARY KEY,
type TEXT NOT NULL, -- direct, group
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
INDEX idx_conversations_updated (updated_at DESC)
);
-- Conversation members
CREATE TABLE conversation_members (
conversation_id TEXT NOT NULL REFERENCES conversations(id),
user_id BIGINT NOT NULL,
joined_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
last_read_at TIMESTAMP WITH TIME ZONE,
unread_count BIGINT DEFAULT 0,
PRIMARY KEY (conversation_id, user_id)
);
-- Index for finding user's conversations
CREATE INDEX idx_members_user ON conversation_members(user_id);
Message Store Service
class MessageStore:
def __init__(self, db: Database, cache: RedisCache):
self.db = db
self.cache = cache
def generate_conversation_id(self, user1: str, user2: str) -> str:
# Consistent ordering for DMs
sorted_users = sorted([user1, user2])
return f"dm:{sorted_users[0]}:{sorted_users[1]}"
async def store(
self,
from_user: str,
to_user: str,
content: str,
conversation_id: str = None
) -> Message:
if not conversation_id:
conversation_id = self.generate_conversation_id(from_user, to_user)
# Ensure conversation exists
await self._ensure_conversation_exists(conversation_id)
# Insert message
row = await self.db.fetch_one("""
INSERT INTO messages (conversation_id, sender_id, content)
VALUES ($1, $2, $3)
RETURNING *
""", conversation_id, from_user, content)
message = Message(**row)
# Update conversation timestamp
await self.db.execute("""
UPDATE conversations
SET updated_at = NOW()
WHERE id = $1
""", conversation_id)
# Invalidate cache
await self.cache.delete(f"conversation:{conversation_id}:messages")
return message
async def get_messages(
self,
conversation_id: str,
limit: int = 50,
before: datetime = None
) -> List[Message]:
# Check cache first
cache_key = f"conversation:{conversation_id}:messages"
if not before:
cached = await self.cache.get(cache_key)
if cached:
return [Message(**m) for m in json.loads(cached)]
# Query database
query = """
SELECT * FROM messages
WHERE conversation_id = $1
"""
params = [conversation_id]
if before:
query += " AND created_at < $2"
params.append(before)
query += " ORDER BY created_at DESC LIMIT $" + str(len(params))
params.append(limit)
rows = await self.db.fetch(query, *params)
messages = [Message(**row) for row in rows]
# Cache if first page
if not before:
await self.cache.setex(
cache_key,
300,
json.dumps([m.dict() for m in messages])
)
return messages
Presence System
Presence shows users who is online and their activity status.
Presence Service
class PresenceService:
def __init__(self, redis: Redis):
self.redis = redis
self.PRESENCE_TTL = 60 # Heartbeat interval
async def set_online(self, user_id: str):
key = f"presence:{user_id}"
await self.redis.setex(key, self.PRESENCE_TTL, json.dumps({
"status": "online",
"last_seen": datetime.utcnow().isoformat()
})
# Publish presence change
await self.redis.publish("presence_updates", json.dumps({
"user_id": user_id,
"status": "online"
}))
async def set_offline(self, user_id: str):
key = f"presence:{user_id}"
await self.redis.setex(key, self.PRESENCE_TTL, json.dumps({
"status": "offline",
"last_seen": datetime.utcnow().isoformat()
}))
await self.redis.publish("presence_updates", json.dumps({
"user_id": user_id,
"status": "offline"
}))
async def get_presence(self, user_id: str) -> Dict:
key = f"presence:{user_id}"
data = await self.redis.get(key)
if data:
return json.loads(data)
return {"status": "offline"}
async def get_bulk_presence(self, user_ids: List[str]) -> Dict[str, Dict]:
keys = [f"presence:{uid}" for uid in user_ids]
values = await self.redis.mget(keys)
result = {}
for uid, value in zip(user_ids, values):
if value:
result[uid] = json.loads(value)
else:
result[uid] = {"status": "offline"}
return result
Heartbeat Mechanism
class PresenceHeartbeat:
def __init__(self, presence_service: PresenceService):
self.presence_service = presence_service
self.heartbeat_interval = 30
async def start_heartbeat(self, websocket, user_id: str):
while True:
try:
await asyncio.sleep(self.heartbeat_interval)
await websocket.send(json.dumps({"type": "heartbeat_ack"}))
await self.presence_service.refresh_heartbeat(user_id)
except websockets.ConnectionClosed:
await self.presence_service.set_offline(user_id)
break
async def refresh_heartbeat(self, user_id: str):
key = f"presence:{user_id}"
data = {
"status": "online",
"last_seen": datetime.utcnow().isoformat()
}
await self.redis.setex(key, self.PRESENCE_TTL, json.dumps(data))
Push Notifications
When users are offline, push notifications deliver new messages.
Notification Service
class NotificationService:
def __init__(
self,
push_service: PushService,
user_service: UserService
):
self.push_service = push_service
self.user_service = user_service
async def send_push_notification(
self,
user_id: str,
message: Message
):
# Get user's push tokens
tokens = await self.user_service.get_push_tokens(user_id)
if not tokens:
return
# Get sender info
sender = await self.user_service.get_user(message.from_user)
# Send to all user's devices
await asyncio.gather(*[
self.push_service.send(
token=token,
notification={
"title": f"Message from {sender.display_name}",
"body": message.content[:100],
"data": {
"type": "message",
"conversation_id": message.conversation_id,
"message_id": message.id
}
}
)
for token in tokens
])
API Design
REST Endpoints
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/v1/conversations | List user’s conversations |
| GET | /api/v1/conversations/:id/messages | Get messages |
| POST | /api/v1/conversations | Create DM or group |
| POST | /api/v1/conversations/:id/messages | Send message (offline) |
| GET | /api/v1/presence/:user_id | Get user presence |
| PUT | /api/v1/presence | Update own presence |
| POST | /api/v1/devices | Register push token |
WebSocket Messages
| Type | Direction | Payload |
|---|---|---|
| message | bidirectional | {to, content} or {from, content} |
| ack | server to client | {message_id} |
| typing | bidirectional | {to} or {from} |
| presence | server to client | {user_id, status} |
| heartbeat | bidirectional | {} |
| subscribe | client to server | {room_id} |
Scalability Patterns
Connection Distribution
graph TB
L[Load Balancer]
W1[WS Server 1<br/>10K connections]
W2[WS Server 2<br/>10K connections]
W3[WS Server 3<br/>10K connections]
W4[WS Server N<br/>10K connections]
L -->|Route| W1
L -->|Route| W2
L -->|Route| W3
L -->|Route| W4
subgraph Message_Bus["Kafka/RabbitMQ"]
end
W1 --> K
W2 --> K
W3 --> K
W4 --> K
Horizontal Scaling
class ConsistentHashRouter:
def __init__(self, nodes: List[str]):
self.ring = {}
self.sorted_keys = []
self._build_ring(nodes)
def _build_ring(self, nodes: List[str]):
for node in nodes:
for i in range(100): # Virtual nodes
key = self._hash(f"{node}:{i}")
self.ring[key] = node
self.sorted_keys = sorted(self.ring.keys())
def get_node(self, key: str) -> str:
hash_key = self._hash(key)
idx = bisect_left(self.sorted_keys, hash_key) % len(self.sorted_keys)
return self.ring[self.sorted_keys[idx]]
def _hash(self, key: str) -> int:
return int(hashlib.md5(key.encode()).hexdigest(), 16)
# Usage: Route message to the server holding recipient's connection
async def route_to_user(user_id: str, message: Message):
server = consistent_hash.get_node(user_id)
await kafka.send("messageDelivery", {
"server": server,
"user_id": user_id,
"message": message.dict()
})
End-to-End Encryption
For privacy, implement end-to-end encryption (E2EE).
Signal Protocol Overview
graph TB
A[Alice Key Pair] -->|Identity Key| PAB[Public Address Book]
B[Bob Key Pair] -->|Identity Key| PAB
PAB -->|Bob's Identity Key| AS[Alice's Device]
PAB -->|Alice's Identity Key| BS[Bob's Device]
AS -->|Pre-Key Bundle| BS
BS -->|Pre-Key Bundle| AS
AS <-->|X3DH Key Exchange| BS
BS <-->|Double Ratchet| AS
Key Exchange Implementation
class KeyExchange:
def generate_identity_key_pair(self) -> Tuple[bytes, bytes]:
private_key = Curve25519.generate_private()
public_key = Curve25519.public_from_private(private_key)
return private_key, public_key
def generate_prekey_bundle(self, identity_key: bytes) -> PrekeyBundle:
identity_private, identity_public = self.generate_identity_key_pair()
signed_prekey_private, signed_prekey_public = self.generate_identity_key_pair()
one_time_prekey_private, one_time_prekey_public = self.generate_identity_key_pair()
# Sign the signed prekey
signature = self.sign(identity_private, signed_prekey_public)
return PrekeyBundle(
identity_public=identity_public,
signed_prekey_public=signed_prekey_public,
signed_prekey_signature=signature,
one_time_prekey_public=one_time_prekey_public
)
async def x3dh_key_agreement(
self,
my_identity_key: bytes,
their_bundle: PrekeyBundle
) -> bytes:
# Perform X3DH key agreement
dh1 = Curve25519.diffie_hellman(my_identity_key, their_bundle.signed_prekey_public)
dh2 = Curve25519.diffie_hellman(my_identity_key, their_bundle.one_time_prekey_public)
dh3 = Curve25519.diffie_hellman(my_identity_key, their_bundle.signed_prekey_public)
dh4 = Curve25519.diffie_hellman(my_identity_key, their_bundle.one_time_prekey_public)
# Combine DH outputs
shared_secret = dh1 + dh2 + dh3 + dh4
return HKDF(shared_secret, b"X3DH")
Message Encryption
class MessageEncryptor:
def __init__(self, root_key: bytes):
self.root_key = root_key
def encrypt_message(
self,
message: str,
chain_key: bytes,
ratchet_key: bytes
) -> EncryptedMessage:
# Generate message key from chain key
message_key = HKDF(chain_key, b"MessageKey")
# Encrypt with AES-GCM
ciphertext = AESGCM.encrypt(message_key, message)
# Advance chain key
next_chain_key = HKDF(chain_key, b"ChainKey")
return EncryptedMessage(
ciphertext=ciphertext,
message_key=message_key,
next_chain_key=next_chain_key
)
Offline Sync and Conflict Resolution
When users come back online after being offline, their device must sync missed messages without creating duplicates or conflicts.
Conflict Resolution Strategies
class ConflictResolver:
"""Resolve message sync conflicts using vector clocks"""
def __init__(self):
self.vector_clocks = {} # {message_id: vector_clock}
def resolve_messages(
self,
local_messages: List[Message],
server_messages: List[Message]
) -> List[Message]:
"""Merge local and server messages, removing duplicates"""
all_messages = {}
for msg in local_messages + server_messages:
msg_id = msg.id
if msg_id not in all_messages:
all_messages[msg_id] = msg
else:
# Conflict - use vector clock to resolve
winner = self._resolve_conflict(all_messages[msg_id], msg)
all_messages[msg_id] = winner
return sorted(all_messages.values(), key=lambda m: m.timestamp)
def _resolve_conflict(self, msg1: Message, msg2: Message) -> Message:
"""Vector clock comparison: later timestamp wins"""
if msg2.timestamp > msg1.timestamp:
return msg2
return msg1
Message Sync Protocol
sequenceDiagram
participant D as Device (Offline)
participant S as Sync Server
D->>D: User comes online
D->>S: GET /sync?last_sync=1709123456
S-->>D: {messages: [...], cursor: "abc123"}
loop Until Caught Up
D->>S: GET /sync?last_sync=cursor
S-->>D: {messages: [...], cursor: "def456"}
Note over D: Insert messages, update cursor
end
D->>D: Display "You're all caught up"
Offline Queue Management
class OfflineMessageQueue:
"""Queue messages sent while offline, sync when online"""
def __init__(self, db: Database):
self.db = db
async def queue_message(self, user_id: int, message: Message):
"""Queue message sent offline"""
await self.db.execute("""
INSERT INTO offline_outbox (user_id, message_json, created_at)
VALUES ($1, $2, NOW())
""", user_id, json.dumps(message.dict()))
async def sync_pending(self, user_id: int) -> List[Message]:
"""Get and delete pending messages"""
rows = await self.db.fetch("""
SELECT message_json FROM offline_outbox
WHERE user_id = $1
ORDER BY created_at ASC
""", user_id)
await self.db.execute("""
DELETE FROM offline_outbox WHERE user_id = $1
""", user_id)
return [Message(**json.loads(row['message_json'])) for row in rows]
Message Ordering Guarantees
Total Order vs Causal Order
| Guarantee | Description | Implementation |
|---|---|---|
| Total Order | All messages in exact global order | Single sequencer; Lamport clocks |
| Causal Order | Causally related messages in order; unrelated can be interleaved | Vector clocks |
| FIFO per sender | Messages from same sender arrive in send order | Sender sequence numbers |
class MessageOrdering:
"""Implement causal ordering with vector clocks"""
def __init__(self, node_id: str):
self.node_id = node_id
self.vector_clock = {node_id: 0}
def increment_clock(self):
self.vector_clock[self.node_id] += 1
return self.vector_clock.copy()
def send_message(self, content: str) -> Message:
clock = self.increment_clock()
return Message(
content=content,
vector_clock=clock,
sender=self.node_id
)
def receive_message(self, message: Message) -> bool:
"""Return True if message should be delivered (causal ordering)"""
# Update our clock
for node, val in message.vector_clock.items():
self.vector_clock[node] = max(
self.vector_clock.get(node, 0),
val
)
# Increment for receive event
self.vector_clock[self.node_id] += 1
# Check if we have all preceding messages
return self._check_causal_delivery(message)
def _check_causal_delivery(self, message: Message) -> bool:
"""Verify we have all causally preceding messages"""
for node, val in message.vector_clock.items():
if node == self.node_id:
continue
if self.vector_clock.get(node, 0) < val - 1:
return False # Missing preceding message
return True
Trade-off Analysis
| Factor | Sticky Sessions | Broker Fanout | Hybrid Approach |
|---|---|---|---|
| Complexity | Low | Medium | Medium-High |
| Horizontal scale | Poor (session affinity) | Excellent | Excellent |
| Message delivery | Direct | Via pub/sub | Via pub/sub |
| State management | Local to server | Shared (Redis) | Shared |
| Failover | Session lost | Seamless | Seamless |
| Geographic distribution | Poor | Excellent | Excellent |
When to Use / When Not to Use
| Scenario | Recommendation |
|---|---|
| Real-time messaging apps | Use WebSockets with message queue |
| Team collaboration tools | Add presence and typing indicators |
| Customer support chat | Add conversation persistence and search |
| High-volume broadcast (1:N) | Use pub/sub message broker fanout |
When TO Use WebSocket Scaling with Sticky Sessions
- Session affinity important: When connection state is local to server and expensive to rebuild
- Moderate scale: When you can manage server affinity at load balancer level
- Simpler operations: When you prefer simpler debugging and tracing
When NOT to Use Sticky Sessions
- Very high scale: When you need to scale to thousands of servers
- Geographic distribution: When users connect to nearest region, not a specific server
- Transparent failover: When connection should survive server restarts
Production Failure Scenarios
| Failure Scenario | Impact | Mitigation |
|---|---|---|
| WebSocket gateway crash | Users disconnect, must reconnect | Run N+1 instances; health checks; client auto-reconnect |
| Message queue backlog | Messages delayed, offline users fall further behind | Priority queues; max backlog limits; batch catchup |
| Presence cache failure | Users appear offline incorrectly | Redis cluster with replication; fallback to DB query |
| Push notification service down | Offline users don’t get notified | Queue notifications; retry with backoff; SMS fallback |
| Database replica lag | Message history incomplete | Read from primary for recent messages; accept lag for history |
| Device conflict on sync | Duplicate messages or lost messages | Vector clocks; idempotent message IDs; conflict resolution |
Real-world Failure Scenarios
Scenario 1: Slack Message Reordering Incident
What happened: A network partition between Slack’s primary message store and its read replica caused messages to be delivered out of order for several user sessions, leading to confused conversations and duplicate message deliveries.
Root cause: The read replica fell behind during the partition and was promoted as the new primary without a full resync. Messages written to the old primary during the failover window were not fully replicated before the role swap, causing gaps and reordering in the message timeline.
Impact: Users in affected workspaces saw messages appearing in wrong chronological order. Some users received the same message twice. The incident lasted approximately 25 minutes before the engineering team detected and resolved the inconsistency.
Lesson learned: Implement strict consistency guarantees for message ordering using vector clocks or monotonic sequence IDs. Always perform a full data integrity check before completing a primary failover. Build monitoring that alerts on replication lag exceeding acceptable thresholds.
Scenario 2: WhatsApp Broadcast Storm
What happened: A bug in WhatsApp’s group message fanout logic triggered a feedback loop that generated a broadcast storm, overwhelming message routing infrastructure and causing service-wide delays.
Root cause: When a group message delivery failed for a subset of recipients, the retry logic inadvertently re-triggered the original broadcast event. This created a feedback loop that multiplied message volume exponentially until the queue infrastructure was saturated.
Impact: All WhatsApp messages experienced delivery delays of up to several hours. The broadcast storm consumed significant infrastructure capacity, slowing down both group and individual message delivery globally. The issue was identified and throttled within 2 hours.
Lesson learned: Implement idempotent message delivery with deduplication at the consumer level. Add rate limiting and circuit breakers to retry logic to prevent feedback loops. Build back-pressure mechanisms that alert and slow down producers before infrastructure is overwhelmed.
Scenario 3: Discord WebSocket Connection Storm
What happened: Discord experienced a connection storm after a software update caused WebSocket clients to reconnect in rapid succession, exhausting server connection limits and causing widespread disconnections.
Root cause: The update changed the WebSocket heartbeat interval to a much shorter value. When deployed, a large percentage of connected clients simultaneously detected a “missed” heartbeat (because the interval change was interpreted as a timeout) and attempted to reconnect at the same time. The resulting connection storm saturated available file descriptors on the gateway servers.
Impact: Millions of users were disconnected simultaneously and experienced significant difficulty reconnecting. Voice and video calls were dropped. The incident persisted for approximately 45 minutes while the team implemented emergency rate limiting and rolled back the heartbeat interval change.
Lesson learned: When changing connection management parameters, roll them out incrementally with connection pool monitoring. Implement client-side jitter to prevent thundering herd on reconnect. Set hard limits on connection establishment rates at the gateway level to absorb connection storms gracefully.
Capacity Estimation
Connection Capacity
Per WebSocket server (4 vCPU, 8GB RAM):
- Max connections: 10,000-50,000
- Messages/second: 1,000-5,000
For 100,000 concurrent users:
- Minimum servers: 100,000 / 10,000 = 10 servers
- With redundancy (2x): 20 servers
- Peak message rate: 100,000 users × 1 msg/user/10sec = 10,000 msg/s
Storage Estimation
Messages per day: 10M DAU × 10 messages/day = 100M messages
Message size (avg): 500 bytes + 200 bytes metadata = 700 bytes
Daily storage: 100M × 700 = 70 GB
5-year storage: 70GB × 365 × 5 = 127 TB
With 3x replication: ~380 TB
Common Pitfalls / Anti-Patterns
Pitfall 1: Storing Messages in WebSocket Server Memory
Problem: Messages sent while user is temporarily disconnected are lost if stored only in server memory.
Solution: Store all messages in persistent storage (DB) immediately. Deliver from storage when user reconnects. Use message queue as buffer, not source of truth.
Pitfall 2: Not Handling Reconnection Gracefully
Problem: When client reconnects, duplicate messages appear or gaps in conversation.
Solution: Implement idempotent message delivery with client-side deduplication. Use cursor-based pagination for loading message history.
Pitfall 3: Presence Flooding
Problem: Every presence change triggers notifications to all connected clients, creating broadcast storms in large channels.
Solution: Batch presence updates (send every 5 seconds, not every change). Use bloom filters for “who is online” queries instead of broadcasting.
Pitfall 4: Single Point of Failure for Message Ordering
Problem: Using a single sequencer node for total ordering creates a bottleneck and failure point.
Solution: Use causal ordering (vector clocks) instead of total ordering. Most applications only need causal order anyway.
Quick Recap
- WebSocket gateways scale horizontally using broker fanout or sticky sessions.
- Store messages persistently first, then deliver; don’t use memory as source of truth.
- Use vector clocks for causal ordering; only require total order when truly necessary.
- Conflict resolution on offline sync uses timestamps, vector clocks, or application logic.
- Presence should be batched and cached, not broadcast on every change.
Copy/Paste Checklist
- [ ] WebSocket gateway with Redis pub/sub fanout
- [ ] Messages stored in DB before delivery
- [ ] Client auto-reconnect with idempotent delivery
- [ ] Cursor-based pagination for message history
- [ ] Vector clocks for causal ordering
- [ ] Conflict resolution on offline sync
- [ ] Presence batched updates (5-10 second windows)
- [ ] Push notification fallback for offline users
- [ ] Rate limiting per connection
- [ ] E2EE encryption for sensitive conversations
Observability Checklist
Metrics to Capture
websocket_connections_active(gauge) - Per server, totalmessages_delivered_total(counter) - By type (direct, group, system)message_delivery_latency_seconds(histogram) - End-to-end, P50/P95/P99presence_updates_total(counter) - Online/offline transitionsoffline_sync_duration_seconds(histogram) - Time to sync missed messagespush_notification_sent_total(counter) - By channel (APNS, FCM, SMS)
Logs to Emit
{
"timestamp": "2026-03-22T10:30:00Z",
"event": "message_delivered",
"message_id": "msg_abc123",
"conversation_id": "conv_xyz",
"recipient_id": "user_456",
"delivery_latency_ms": 45,
"channel": "websocket"
}
Alerts to Configure
| Alert | Threshold | Severity |
|---|---|---|
| Connection count > 80% capacity | 80% max | Warning |
| Message latency P99 > 200ms | 200ms | Warning |
| Presence update lag > 10s | 10s | Critical |
| Offline sync duration > 30s | 30s | Warning |
| Push notification queue > 10K | 10000 | Warning |
Security Checklist
- WebSocket connections over WSS (TLS)
- JWT token validation on connection
- Message content encryption (E2EE for sensitive chats)
- Rate limiting per user (prevent spam)
- Input sanitization (XSS prevention)
- Push notification content limited (no PII in notifications)
- Audit logging for moderation actions
- Device management (revoke stolen device sessions)
Scenario Drills
Scenario 1: WebSocket Gateway Server Crash
Situation: The WebSocket server handling 10,000 connections crashes unexpectedly.
Analysis:
- All 10,000 users disconnect simultaneously
- Client auto-reconnect kicks in but overwhelms surviving servers
- Messages sent during disconnect are in message store, not lost
Solution: Run N+1 instances with health checks. Clients implement exponential backoff on reconnect. Sticky sessions at load balancer ensure reconnection routes to same server. Messages are stored in durable DB before delivery, so offline messages survive server death.
Scenario 2: Presence Flood in Large Channel
Situation: 1,000 users join a popular channel simultaneously. Each presence change is broadcast to all members.
Analysis:
- Each join generates 1 presence update broadcast to 999 users
- 1,000 joins = 999,000 presence messages
- Network flooding and server CPU spike
Solution: Batch presence updates into 5-10 second windows. Instead of broadcasting each change, publish aggregated state. Use bloom filters for approximate “is user online” queries. For very large channels, hide individual presence.
Scenario 3: Offline User Comes Back Online After Days
Situation: A user returns after being offline for a week with thousands of missed messages.
Analysis:
- Message history is large; fetching all at once causes timeout
- Push notification storm if all messages trigger notifications
- User sees huge badge count
Solution: Cursor-based pagination syncs in batches. Messages are fetched incrementally with cursors. Push notifications are coalesced into a single “you have messages” notification rather than one per message. The app shows “loading” state while catching up.
Failure Flow Diagrams
Message Delivery Flow
graph TD
A[Sender Sends Message] --> B[WebSocket Gateway]
B --> C[Message Router]
C --> D[Store in Database]
D --> E{Recipient Online?}
E -->|Yes| F[Find Connection]
F --> G[Deliver via WS Gateway]
E -->|No| H[Queue Push Notification]
G --> I[Send ACK to Sender]
H --> J[APNS/FCM]
J --> K[Device Receives]
K --> L[User Opens App]
L --> M[Sync Missed Messages]
WebSocket Connection Lifecycle
graph TD
A[Client Connects] --> B[Authenticate Token]
B --> C{Valid?}
C -->|No| D[Close Connection 4001]
C -->|Yes| E[Register Connection]
E --> F[Subscribe to Conversations]
F --> G[Connection Established]
G --> H[Heartbeat Loop]
H --> I{Heartbeat OK?}
I -->|No| J[Mark Offline]
I -->|Yes| K{Message Received?}
K -->|Yes| L[Process Message]
K -->|No| H
L --> H
J --> M[Cleanup Connection]
Presence Update Batching
graph LR
A[User Online Event] --> B[Presence Service]
B --> C{Batch Window Open?}
C -->|No| D[Start 5s Timer]
C -->|Yes| E[Add to Batch Buffer]
D --> E
E --> F{Window Elapsed?}
F -->|No| E
F -->|Yes| G[Publish Aggregated Update]
G --> H[Subscribers Receive]
Interview Questions
When the recipient is offline, the message router stores the message in durable storage first. Then it triggers a push notification to the user's device. When the user comes back online, the client syncs missed messages using a cursor-based pagination protocol, fetching messages since the last known sync timestamp.
Total ordering requires a single sequencer node that sees all messages. This creates a bottleneck and a single point of failure. Causal ordering only guarantees that causally related messages arrive in order. Unrelated messages can be interleaved freely. Most chat applications only need causal ordering, which is cheaper to implement and more fault-tolerant.
Presence changes are batched rather than broadcast immediately. The system sends presence updates every 5-10 seconds, not on every change. For "who is online" queries, bloom filters can approximate membership without broadcasting to all members. For large channels, presence is often not shown at all or shown as "many people online."
Messages are identified by client-generated idempotent message IDs. When syncing, the server returns all messages since the last sync cursor. The client merges local and server messages, deduplicating by message ID. For conflicts (same ID, different content), vector clocks determine which is causally newer, or timestamps break ties.
Offset-based pagination (LIMIT 100 OFFSET 500) falls apart at scale. As offset grows, the database still scans all those rows before returning your 100. With millions of messages per conversation, page 1000 is slow.
Cursor-based pagination solves this. Instead of "give me rows 500-600", you say "give me messages after message ID XYZ". The database uses the index on (conversation_id, created_at DESC) and returns the next page instantly, regardless of where you are in the conversation.
The cursor can be a message ID or a timestamp — both work with the composite index. The tradeoff is that you can't jump to arbitrary pages; you can only move forward or backward sequentially.
WebSockets give you a persistent TCP connection. Once established, sending a message is essentially free — no connection overhead. The tradeoff is infrastructure complexity: WebSockets are stateful, so scaling horizontally means either sticky sessions (same user routes to same server) or a pub/sub layer so all servers can receive messages for all users.
Long-polling works over standard HTTP. The client makes a request, the server holds it open until there's data (or a timeout), then responds and the client immediately opens a new request. It's simpler to scale because every request is stateless, but it generates more HTTP overhead and has higher latency.
For most chat applications, WebSockets are the right call. The infrastructure complexity is manageable with modern tooling. Long-polling is a reasonable fallback for enterprise environments with strict proxies that block WebSocket upgrades.
Rate limiting needs to be per-user-per-conversation, not just per-user. If Alice can send 100 messages per minute total, she could spam 100 different people simultaneously. Limiting to 1 message per second per recipient forces her to actually want to communicate.
Redis sliding window counters work well here. The key format is something like ratelimit:user123:conversation456:minute. Increment on each message, check the count before delivering. If over the limit, return 429 with a Retry-After header.
Presence updates deserve separate limits. A misbehaving client could send presence heartbeats every second; you want to rate limit those lower than actual messages.
Don't build this yourself. Elasticsearch or Meilisearch handle tokenization, stemming, ranking, and fuzzy matching — all things you'd otherwise have to implement from scratch. The indexing happens asynchronously: when a message is stored, it's also sent to a queue, and a worker indexes it separately from the hot path.
Scope matters. Global search across all conversations is expensive and usually not what users want. Searching within a single conversation is faster and more useful. Some systems also support searching across a subset of conversations (e.g., "search my starred conversations").
Index freshness vs. search volume is a tradeoff. If you index synchronously, search results are always fresh but you've added latency to every message send. Async indexing (what most systems do) means search results might be a few seconds behind, which is acceptable for most use cases.
Typing indicators are pure ephemeral state — they're not important enough to persist. If the server crashes and you lose who's typing, nobody notices.
Client-side: debounce aggressively. If a user types 10 characters in a second, you shouldn't send 10 typing events. Wait until 300-500ms have passed since the last keystroke before sending one event.
Server-side: batch and broadcast. Instead of notifying all conversation members every time someone starts typing, batch notifications into 2-3 second windows. One user typing rapidly generates one notification, not twenty. Redis sorted sets with timestamps work well for tracking who is typing in a conversation.
State expires automatically. If no typing event comes in for 5-10 seconds, the user has either stopped or their client crashed. The indicator disappears without requiring explicit stop events.
The core challenge: the same user might have the app open on their laptop and phone simultaneously. Messages arrive on one device; they need to appear on the other. Read state needs to sync too — if you read a message on your phone, you shouldn't see it as unread on your laptop.
Each device maintains a cursor — the ID of the last message it has synced. When a device comes online (or periodically while connected), it asks the server for any messages since that cursor. The server returns them; the client merges them with local messages. Vector clocks or simple timestamps handle ordering.
Read receipts are a separate problem. When you mark a message as read on one device, that read state needs to fan out to all your other devices and to the sender. This can be handled via the same message queue infrastructure as regular messages.
On mobile, be careful about sync frequency. Aggressive sync drains battery. Quiesce sync when the app is backgrounded and catch up when it returns to foreground.
The Signal Protocol uses the Double Ratchet algorithm to achieve two properties that normal encryption schemes lack.
Forward secrecy means that if your long-term key is compromised, past messages stay safe. This works because messages are encrypted with ephemeral session keys derived from the root key. The root key itself is derived through Diffie-Hellman ratcheting, and once a message key is used, it's deleted. Compromising today's key doesn't reveal yesterday's messages.
Break-in recovery (also called post-compromise security) means that after a device is compromised, the protocol can recover to a secure state without requiring a new key exchange. The ratcheting mechanism advances the chain even if an attacker had visibility, so after a few messages, the attacker loses access again.
The combination makes it computationally expensive for an attacker to decrypt messages — they'd need to compromise each key in the ratchet chain sequentially.
Idempotent message IDs are the foundation. The client generates a UUID before sending, and the server checks if that ID already exists before processing. This moves deduplication from the server to the client, which is fine because the client is the source of truth for its own sends.
At the server level, you need a fast lookup. Redis with the message ID as a key works for active conversations. For cross-conversation deduplication across millions of messages, you'd use a bloom filter or a distributed set, trading some memory for speed.
The tricky part is cleanup. Message IDs accumulate forever if you keep them. A common approach is to partition by time window — keep deduplication state for 7 days, then expire. Most real duplicates happen immediately due to network retries anyway.
For groups with many members, the deduplication check should happen before fanout, not during. One server checks the ID, then it routes to all recipients without re-checking.
Optimistic updates show the message immediately in the UI as if it were sent, before the server confirms it. This feels fast — zero perceived latency. The risk is that if the send fails, you have to roll back the UI, which is jarring for users and requires careful state management.
Confirmed delivery waits for a server ACK before showing the message as "sent." This is safer but slower. If the roundtrip is 50ms, users notice. On mobile with variable latency, it gets worse.
Most apps use optimistic updates with local queuing. The message goes into a "pending" state, shows in the UI, and if the server ACK doesn't come back within a timeout (say, 10 seconds), it's marked as failed with a retry button. The user sees it immediately but knows it's not confirmed.
The middle ground some apps use: optimistic updates for text, confirmed delivery for important actions like payments or message edits. Text is low-stakes; editing a message or sending a large file isn't.
Read state is one of the harder problems in multi-device chat. When you read a message on your phone, that needs to sync to your laptop, your tablet, and any web sessions — and it needs to happen fast enough that you don't see it as unread again.
The core mechanism: read receipts flow through the same message infrastructure as regular messages. When you mark something read, that event goes to the server, which fans it out to all your other devices via their active WebSocket connections. All devices converge on the same read cursor.
The tricky part is ordering. If you read message 100 on your phone while message 99 is still syncing to your laptop, the laptop might think you've only read up to message 98. The fix is to only advance the read cursor to the highest consecutive message number, not just the latest.
For conversations with thousands of messages, you don't want to sync read state on every single read. Batch it — send a read receipt every 5-10 seconds or when leaving the conversation, not on every scroll position change.
Real-time spam detection operates at two levels: rate limiting and content classification.
Rate limiting is straightforward. Redis sliding window counters track messages per user per conversation. If someone sends more than 1 message per second to the same person, or more than 10 messages per minute to different people, flag them. This catches bots immediately without analyzing content.
Content classification is harder. You don't want to scan every message synchronously — that adds latency. Instead, asynchronously analyze messages after delivery. A queue collects messages, a worker runs detection (simple: keyword matching, link density, account age checks; complex: ML models), and if flagged, the content is deleted and the user is warned or suspended.
For real-time response to spam waves (coordinated attacks), use reputation scores. New accounts have low reputation. Accounts with history of good behavior have high reputation. Low reputation users get slower delivery — messages go through a moderation queue before reaching recipients. This buys time without blocking legitimate users.
Naive E2EE for groups is O(n²) — every message is encrypted separately for every recipient. With 100 members, that's 100 encryptions per message. At scale, this doesn't work.
The practical approach is sender keys. Each member generates one key that they share with every other member once (during group join). When sending, they encrypt the message once with their sender key and attach it to the message. Recipients decrypt with the sender's key. This reduces encryptions to O(n) per message — still linear, but a much smaller constant.
When someone joins late, you need to send them all the sender keys for the current group state. When someone leaves, you need to remove their access to future messages. Group key management is its own sub-problem — you need a protocol for updating keys when membership changes.
For very large groups (broadcast-style with thousands), E2EE often isn't practical. Most large group apps either accept that admins can read messages (operational security model) or they restrict E2EE to small groups where it's tractable.
Offset-based pagination says "give me rows 500 through 600." The database still scans the first 500 rows before returning the 100 you asked for. With millions of messages per conversation, page 10,000 is effectively a full table scan.
Cursor-based pagination says "give me messages after message ID XYZ." The database uses the composite index on (conversation_id, created_at DESC, id) and jumps directly to the cursor position. The query is O(log n) regardless of where you are in the conversation.
The cursor can be a message ID or a timestamp. Message ID is better because it handles concurrent inserts cleanly — if a new message arrives while you're paginating, you won't see duplicates or gaps. Timestamps have precision limits; concurrent inserts at the same millisecond are ambiguous.
The tradeoff is that cursors don't support random access. You can't jump to page 100 directly. You have to walk forward from the beginning. For most use cases this is fine — users rarely need to jump to arbitrary points in chat history.
"Last seen" seems simple — store the timestamp of the user's last activity. It's not simple.
The privacy problem: showing exactly when someone was online is intrusive. "She was last active at 11:43pm" tells you someone stayed up late, which you shouldn't know. Most apps blur this: "online," "last seen 5 minutes ago," "last seen today."
The performance problem: every presence update writes to storage. If you track exact timestamps on every keystroke, you have millions of writes per day for a popular app. Instead, round to the nearest 5-minute bucket or only update on explicit presence changes (connect, disconnect, app background/foreground), not on every message.
For showing "last seen" to other users, don't expose exact timestamps from recent activity. Use relative phrases (active 5m ago, active 2h ago) that decay to larger buckets over time. Nobody needs to know exactly when someone was online in real-time except in specific contexts.
Network transitions are a silent killer of WebSocket connections. When a phone switches from WiFi to cellular, the TCP connection doesn't migrate — it dies. The app has to reconnect.
On mobile, the OS handles network transitions transparently at the IP level for a few seconds. Your WebSocket connection might survive a brief WiFi handoff. But a full network switch (WiFi off, cellular on) always breaks connections.
The client-side fix: implement exponential backoff reconnection with jitter. When the connection drops, wait a random interval (0-2 seconds), then reconnect. Don't hammer the server immediately after a network change — the network stack needs a moment to stabilize.
At the protocol level, use a session token that's independent of the TCP connection. The client identifies its session, the server knows which messages it has received, and the client can resume from where it left off without a full re-sync. This is especially important for mobile where connections are inherently unstable.
Edit and delete are surprisingly hard because they need to work consistently across all a user's devices and all recipients.
For delete: the cleanest model is soft delete with a tombstone. The message stays in the database but is marked deleted. When a client loads messages, it checks the tombstone and hides the content, showing "This message was deleted" instead. This works across all devices because the tombstone propagates like a regular message.
For edit: edits should create a new message ID with a reference to the original. The original is marked as edited (but not deleted — you still see it was edited). Clients display the latest version but can show edit history if needed. The server rejects edits on messages older than a time window (say, 15 minutes) to prevent abuse.
The hardest case: what if you've read the original message on device A, the sender edits it on device B, and device A is offline when the edit happens? Device A comes online, syncs, and needs to show the edited content without marking it as unread. This requires tracking which messages each device has acknowledged and only updating read state for genuinely new content.
Further Reading
- Message Queue Types — Kafka and RabbitMQ patterns
- RabbitMQ — Message queue fundamentals
- Apache Kafka — Distributed event streaming
- WebSocket Scaling — Connection distribution patterns
- Distributed Caching — Redis pub/sub patterns
For more on real-time systems, see our Message Queue Types guide. For message queues that power chat delivery, see the RabbitMQ and Apache Kafka guides.
Conclusion
Building a scalable chat system requires:
- WebSocket gateways for persistent connections
- Message routers that handle online and offline delivery
- Durable message storage with efficient queries
- Presence tracking with heartbeat mechanism
- Push notifications for offline users
- Consistent hashing for horizontal scaling
Category
Related Posts
Uber's Architecture: From Monolith to Microservices at Scale
Explore how Uber evolved from a monolith to a microservices architecture handling millions of real-time marketplace transactions daily.
System Design: Netflix Architecture for Global Streaming
Deep dive into Netflix architecture. Learn about content delivery, CDN design, microservices, recommendation systems, and streaming protocols.
System Design: Twitter Feed Architecture and Scalability
Deep dive into Twitter system design. Learn about feed generation, fan-out, timeline computation, search, notifications, and scaling challenges.