System Design: Real-Time Chat System Architecture

Deep dive into chat system design. Learn about WebSocket management, message delivery, presence systems, scalability patterns, and E2E encryption.

published: reading time: 38 min read author: GeekWorkBench

System Design: Real-Time Chat System Architecture

Chat systems like WhatsApp, Slack, and Discord handle millions of concurrent users sending messages in real-time. The challenges are fundamentally different from request-response web applications: persistent connections, low-latency delivery, and handling offline users.

This case study walks through designing a scalable chat system.

Introduction

Requirements Analysis

Functional Requirements

Users should be able to:

  • Send and receive messages in real-time
  • Create and join chat rooms or direct messages
  • See who is online (presence)
  • Receive push notifications when offline
  • Send text, images, and files
  • Search through message history

Non-Functional Requirements

The system needs:

  • Message delivery under 100ms for online users
  • Support 100,000+ concurrent connections per server
  • Handle 10 million daily active users
  • Store message history indefinitely
  • Maintain 99.99% availability

Architecture Overview

graph TB
    A[Mobile App] -->|WSS| B[WebSocket Gateway]
    A -->|HTTPS| C[REST API Gateway]
    B --> D[Message Router]
    D --> E[Message Store]
    D --> F[Presence Service]
    D --> G[Notification Service]
    C --> E
    C --> H[User Service]
    E --> I[(Database)]
    G --> J[(Push Service)]
    G --> K[(Queue)]

WebSocket Connections

WebSockets provide persistent bidirectional connections, essential for real-time chat.

Connection Flow

sequenceDiagram
    participant C as Client
    participant W as WebSocket Gateway
    participant R as Message Router
    participant M as Message Store

    C->>W: CONNECT {auth_token}
    W->>W: Validate token
    W->>W: Extract user_id
    W->>R: Register connection {user_id, connection_id}
    R-->>W: Connection registered
    W-->>C: Connection established

    C->>W: SEND {recipient_id, content}
    W->>R: Route message {from, to, content}
    R->>M: Store message
    M-->>R: Message stored
    R->>R: Find recipient connection
    R->>W: Deliver to recipient
    W-->>C: MESSAGE {from, content}

    Note over C,W: Bidirectional messages flow through router

WebSocket Server Implementation

import asyncio
import websockets
from dataclasses import dataclass
from typing import Dict, Set

@dataclass
class Connection:
    user_id: str
    connection_id: str
    websocket: websockets.WebSocketServerProtocol
    subscribed_rooms: Set[str]

class WebSocketGateway:
    def __init__(self, message_router: MessageRouter):
        self.router = message_router
        self.connections: Dict[str, Connection] = {}
        self.user_connections: Dict[str, Set[str]] = {}

    async def handle_connection(self, websocket: websockets.WebSocketServerProtocol, path: str):
        # Authenticate
        token = websocket.request_headers.get("Authorization", "").replace("Bearer ", "")
        user_id = await self.authenticate(token)

        if not user_id:
            await websocket.close(4001, "Authentication failed")
            return

        connection_id = str(uuid4())
        connection = Connection(
            user_id=user_id,
            connection_id=connection_id,
            websocket=websocket,
            subscribed_rooms=set()
        )

        # Register connection
        self.connections[connection_id] = connection
        if user_id not in self.user_connections:
            self.user_connections[user_id] = set()
        self.user_connections[user_id].add(connection_id)

        # Register with router
        await self.router.register_connection(user_id, connection_id)

        try:
            # Handle incoming messages
            async for message in websocket:
                await self.handle_message(connection, message)
        except websockets.ConnectionClosed:
            pass
        finally:
            await self.cleanup(connection)

    async def handle_message(self, connection: Connection, raw_message: str):
        message = json.loads(raw_message)

        msg_type = message.get("type")

        if msg_type == "message":
            await self.router.route_message(
                from_user=connection.user_id,
                to_user=message["recipient_id"],
                content=message["content"],
                connection_id=connection.connection_id
            )
        elif msg_type == "subscribe":
            room_id = message["room_id"]
            connection.subscribed_rooms.add(room_id)
            await self.router.subscribe_to_room(connection.user_id, room_id)
        elif msg_type == "typing":
            await self.router.send_typing_indicator(
                from_user=connection.user_id,
                to_user=message["recipient_id"]
            )

Message Routing

The message router is the heart of the chat system, handling delivery to online and offline users.

Message Router Service

class MessageRouter:
    def __init__(
        self,
        message_store: MessageStore,
        presence_service: PresenceService,
        notification_service: NotificationService
    ):
        self.message_store = message_store
        self.presence_service = presence_service
        self.notification_service = notification_service
        self.online_connections: Dict[str, Set[str]] = {}

    async def register_connection(self, user_id: str, connection_id: str):
        if user_id not in self.online_connections:
            self.online_connections[user_id] = set()
        self.online_connections[user_id].add(connection_id)
        await self.presence_service.set_online(user_id)

    async def route_message(
        self,
        from_user: str,
        to_user: str,
        content: str,
        connection_id: str = None
    ):
        # Store message first (durability)
        message = await self.message_store.store(
            from_user=from_user,
            to_user=to_user,
            content=content
        )

        # Deliver to online recipients
        if to_user in self.online_connections:
            for conn_id in self.online_connections[to_user]:
                await self.deliver_message(conn_id, message)
            # No push notification needed
        else:
            # User offline - send push notification
            await self.notification_service.send_push_notification(
                user_id=to_user,
                message=message
            )

        # Acknowledge to sender if online
        if connection_id:
            await self.send_ack(connection_id, message)

        return message

    async def deliver_message(self, connection_id: str, message: Message):
        gateway = self.get_gateway_for_connection(connection_id)
        await gateway.send_to_connection(connection_id, {
            "type": "message",
            "message_id": message.id,
            "from": message.from_user,
            "content": message.content,
            "timestamp": message.created_at.isoformat()
        })

Message Storage

Messages must be stored durably and queryable.

Database Schema

-- Messages table
CREATE TABLE messages (
    id BIGSERIAL PRIMARY KEY,
    conversation_id TEXT NOT NULL,
    sender_id BIGINT NOT NULL,
    content TEXT NOT NULL,
    content_type TEXT DEFAULT 'text',  -- text, image, file
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    delivered_at TIMESTAMP WITH TIME ZONE,
    read_at TIMESTAMP WITH TIME ZONE,

    INDEX idx_messages_conversation (conversation_id, created_at DESC),
    INDEX idx_messages_sender (sender_id, created_at DESC)
);

-- Conversations table
CREATE TABLE conversations (
    id TEXT PRIMARY KEY,
    type TEXT NOT NULL,  -- direct, group
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),

    INDEX idx_conversations_updated (updated_at DESC)
);

-- Conversation members
CREATE TABLE conversation_members (
    conversation_id TEXT NOT NULL REFERENCES conversations(id),
    user_id BIGINT NOT NULL,
    joined_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    last_read_at TIMESTAMP WITH TIME ZONE,
    unread_count BIGINT DEFAULT 0,

    PRIMARY KEY (conversation_id, user_id)
);

-- Index for finding user's conversations
CREATE INDEX idx_members_user ON conversation_members(user_id);

Message Store Service

class MessageStore:
    def __init__(self, db: Database, cache: RedisCache):
        self.db = db
        self.cache = cache

    def generate_conversation_id(self, user1: str, user2: str) -> str:
        # Consistent ordering for DMs
        sorted_users = sorted([user1, user2])
        return f"dm:{sorted_users[0]}:{sorted_users[1]}"

    async def store(
        self,
        from_user: str,
        to_user: str,
        content: str,
        conversation_id: str = None
    ) -> Message:
        if not conversation_id:
            conversation_id = self.generate_conversation_id(from_user, to_user)

        # Ensure conversation exists
        await self._ensure_conversation_exists(conversation_id)

        # Insert message
        row = await self.db.fetch_one("""
            INSERT INTO messages (conversation_id, sender_id, content)
            VALUES ($1, $2, $3)
            RETURNING *
        """, conversation_id, from_user, content)

        message = Message(**row)

        # Update conversation timestamp
        await self.db.execute("""
            UPDATE conversations
            SET updated_at = NOW()
            WHERE id = $1
        """, conversation_id)

        # Invalidate cache
        await self.cache.delete(f"conversation:{conversation_id}:messages")

        return message

    async def get_messages(
        self,
        conversation_id: str,
        limit: int = 50,
        before: datetime = None
    ) -> List[Message]:
        # Check cache first
        cache_key = f"conversation:{conversation_id}:messages"
        if not before:
            cached = await self.cache.get(cache_key)
            if cached:
                return [Message(**m) for m in json.loads(cached)]

        # Query database
        query = """
            SELECT * FROM messages
            WHERE conversation_id = $1
        """
        params = [conversation_id]

        if before:
            query += " AND created_at < $2"
            params.append(before)

        query += " ORDER BY created_at DESC LIMIT $" + str(len(params))
        params.append(limit)

        rows = await self.db.fetch(query, *params)
        messages = [Message(**row) for row in rows]

        # Cache if first page
        if not before:
            await self.cache.setex(
                cache_key,
                300,
                json.dumps([m.dict() for m in messages])
            )

        return messages

Presence System

Presence shows users who is online and their activity status.

Presence Service

class PresenceService:
    def __init__(self, redis: Redis):
        self.redis = redis
        self.PRESENCE_TTL = 60  # Heartbeat interval

    async def set_online(self, user_id: str):
        key = f"presence:{user_id}"
        await self.redis.setex(key, self.PRESENCE_TTL, json.dumps({
            "status": "online",
            "last_seen": datetime.utcnow().isoformat()
        })

        # Publish presence change
        await self.redis.publish("presence_updates", json.dumps({
            "user_id": user_id,
            "status": "online"
        }))

    async def set_offline(self, user_id: str):
        key = f"presence:{user_id}"
        await self.redis.setex(key, self.PRESENCE_TTL, json.dumps({
            "status": "offline",
            "last_seen": datetime.utcnow().isoformat()
        }))

        await self.redis.publish("presence_updates", json.dumps({
            "user_id": user_id,
            "status": "offline"
        }))

    async def get_presence(self, user_id: str) -> Dict:
        key = f"presence:{user_id}"
        data = await self.redis.get(key)
        if data:
            return json.loads(data)
        return {"status": "offline"}

    async def get_bulk_presence(self, user_ids: List[str]) -> Dict[str, Dict]:
        keys = [f"presence:{uid}" for uid in user_ids]
        values = await self.redis.mget(keys)

        result = {}
        for uid, value in zip(user_ids, values):
            if value:
                result[uid] = json.loads(value)
            else:
                result[uid] = {"status": "offline"}

        return result

Heartbeat Mechanism

class PresenceHeartbeat:
    def __init__(self, presence_service: PresenceService):
        self.presence_service = presence_service
        self.heartbeat_interval = 30

    async def start_heartbeat(self, websocket, user_id: str):
        while True:
            try:
                await asyncio.sleep(self.heartbeat_interval)
                await websocket.send(json.dumps({"type": "heartbeat_ack"}))
                await self.presence_service.refresh_heartbeat(user_id)
            except websockets.ConnectionClosed:
                await self.presence_service.set_offline(user_id)
                break

    async def refresh_heartbeat(self, user_id: str):
        key = f"presence:{user_id}"
        data = {
            "status": "online",
            "last_seen": datetime.utcnow().isoformat()
        }
        await self.redis.setex(key, self.PRESENCE_TTL, json.dumps(data))

Push Notifications

When users are offline, push notifications deliver new messages.

Notification Service

class NotificationService:
    def __init__(
        self,
        push_service: PushService,
        user_service: UserService
    ):
        self.push_service = push_service
        self.user_service = user_service

    async def send_push_notification(
        self,
        user_id: str,
        message: Message
    ):
        # Get user's push tokens
        tokens = await self.user_service.get_push_tokens(user_id)

        if not tokens:
            return

        # Get sender info
        sender = await self.user_service.get_user(message.from_user)

        # Send to all user's devices
        await asyncio.gather(*[
            self.push_service.send(
                token=token,
                notification={
                    "title": f"Message from {sender.display_name}",
                    "body": message.content[:100],
                    "data": {
                        "type": "message",
                        "conversation_id": message.conversation_id,
                        "message_id": message.id
                    }
                }
            )
            for token in tokens
        ])

API Design

REST Endpoints

MethodEndpointDescription
GET/api/v1/conversationsList user’s conversations
GET/api/v1/conversations/:id/messagesGet messages
POST/api/v1/conversationsCreate DM or group
POST/api/v1/conversations/:id/messagesSend message (offline)
GET/api/v1/presence/:user_idGet user presence
PUT/api/v1/presenceUpdate own presence
POST/api/v1/devicesRegister push token

WebSocket Messages

TypeDirectionPayload
messagebidirectional{to, content} or {from, content}
ackserver to client{message_id}
typingbidirectional{to} or {from}
presenceserver to client{user_id, status}
heartbeatbidirectional{}
subscribeclient to server{room_id}

Scalability Patterns

Connection Distribution

graph TB
    L[Load Balancer]
    W1[WS Server 1<br/>10K connections]
    W2[WS Server 2<br/>10K connections]
    W3[WS Server 3<br/>10K connections]
    W4[WS Server N<br/>10K connections]

    L -->|Route| W1
    L -->|Route| W2
    L -->|Route| W3
    L -->|Route| W4

    subgraph Message_Bus["Kafka/RabbitMQ"]

    end

    W1 --> K
    W2 --> K
    W3 --> K
    W4 --> K

Horizontal Scaling

class ConsistentHashRouter:
    def __init__(self, nodes: List[str]):
        self.ring = {}
        self.sorted_keys = []
        self._build_ring(nodes)

    def _build_ring(self, nodes: List[str]):
        for node in nodes:
            for i in range(100):  # Virtual nodes
                key = self._hash(f"{node}:{i}")
                self.ring[key] = node
        self.sorted_keys = sorted(self.ring.keys())

    def get_node(self, key: str) -> str:
        hash_key = self._hash(key)
        idx = bisect_left(self.sorted_keys, hash_key) % len(self.sorted_keys)
        return self.ring[self.sorted_keys[idx]]

    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

# Usage: Route message to the server holding recipient's connection
async def route_to_user(user_id: str, message: Message):
    server = consistent_hash.get_node(user_id)
    await kafka.send("messageDelivery", {
        "server": server,
        "user_id": user_id,
        "message": message.dict()
    })

End-to-End Encryption

For privacy, implement end-to-end encryption (E2EE).

Signal Protocol Overview

graph TB
    A[Alice Key Pair] -->|Identity Key| PAB[Public Address Book]
    B[Bob Key Pair] -->|Identity Key| PAB

    PAB -->|Bob's Identity Key| AS[Alice's Device]
    PAB -->|Alice's Identity Key| BS[Bob's Device]

    AS -->|Pre-Key Bundle| BS
    BS -->|Pre-Key Bundle| AS

    AS <-->|X3DH Key Exchange| BS
    BS <-->|Double Ratchet| AS

Key Exchange Implementation

class KeyExchange:
    def generate_identity_key_pair(self) -> Tuple[bytes, bytes]:
        private_key = Curve25519.generate_private()
        public_key = Curve25519.public_from_private(private_key)
        return private_key, public_key

    def generate_prekey_bundle(self, identity_key: bytes) -> PrekeyBundle:
        identity_private, identity_public = self.generate_identity_key_pair()
        signed_prekey_private, signed_prekey_public = self.generate_identity_key_pair()
        one_time_prekey_private, one_time_prekey_public = self.generate_identity_key_pair()

        # Sign the signed prekey
        signature = self.sign(identity_private, signed_prekey_public)

        return PrekeyBundle(
            identity_public=identity_public,
            signed_prekey_public=signed_prekey_public,
            signed_prekey_signature=signature,
            one_time_prekey_public=one_time_prekey_public
        )

    async def x3dh_key_agreement(
        self,
        my_identity_key: bytes,
        their_bundle: PrekeyBundle
    ) -> bytes:
        # Perform X3DH key agreement
        dh1 = Curve25519.diffie_hellman(my_identity_key, their_bundle.signed_prekey_public)
        dh2 = Curve25519.diffie_hellman(my_identity_key, their_bundle.one_time_prekey_public)
        dh3 = Curve25519.diffie_hellman(my_identity_key, their_bundle.signed_prekey_public)
        dh4 = Curve25519.diffie_hellman(my_identity_key, their_bundle.one_time_prekey_public)

        # Combine DH outputs
        shared_secret = dh1 + dh2 + dh3 + dh4
        return HKDF(shared_secret, b"X3DH")

Message Encryption

class MessageEncryptor:
    def __init__(self, root_key: bytes):
        self.root_key = root_key

    def encrypt_message(
        self,
        message: str,
        chain_key: bytes,
        ratchet_key: bytes
    ) -> EncryptedMessage:
        # Generate message key from chain key
        message_key = HKDF(chain_key, b"MessageKey")

        # Encrypt with AES-GCM
        ciphertext = AESGCM.encrypt(message_key, message)

        # Advance chain key
        next_chain_key = HKDF(chain_key, b"ChainKey")

        return EncryptedMessage(
            ciphertext=ciphertext,
            message_key=message_key,
            next_chain_key=next_chain_key
        )

Offline Sync and Conflict Resolution

When users come back online after being offline, their device must sync missed messages without creating duplicates or conflicts.

Conflict Resolution Strategies

class ConflictResolver:
    """Resolve message sync conflicts using vector clocks"""

    def __init__(self):
        self.vector_clocks = {}  # {message_id: vector_clock}

    def resolve_messages(
        self,
        local_messages: List[Message],
        server_messages: List[Message]
    ) -> List[Message]:
        """Merge local and server messages, removing duplicates"""

        all_messages = {}
        for msg in local_messages + server_messages:
            msg_id = msg.id

            if msg_id not in all_messages:
                all_messages[msg_id] = msg
            else:
                # Conflict - use vector clock to resolve
                winner = self._resolve_conflict(all_messages[msg_id], msg)
                all_messages[msg_id] = winner

        return sorted(all_messages.values(), key=lambda m: m.timestamp)

    def _resolve_conflict(self, msg1: Message, msg2: Message) -> Message:
        """Vector clock comparison: later timestamp wins"""
        if msg2.timestamp > msg1.timestamp:
            return msg2
        return msg1

Message Sync Protocol

sequenceDiagram
    participant D as Device (Offline)
    participant S as Sync Server

    D->>D: User comes online
    D->>S: GET /sync?last_sync=1709123456
    S-->>D: {messages: [...], cursor: "abc123"}

    loop Until Caught Up
        D->>S: GET /sync?last_sync=cursor
        S-->>D: {messages: [...], cursor: "def456"}
        Note over D: Insert messages, update cursor
    end

    D->>D: Display "You're all caught up"

Offline Queue Management

class OfflineMessageQueue:
    """Queue messages sent while offline, sync when online"""

    def __init__(self, db: Database):
        self.db = db

    async def queue_message(self, user_id: int, message: Message):
        """Queue message sent offline"""
        await self.db.execute("""
            INSERT INTO offline_outbox (user_id, message_json, created_at)
            VALUES ($1, $2, NOW())
        """, user_id, json.dumps(message.dict()))

    async def sync_pending(self, user_id: int) -> List[Message]:
        """Get and delete pending messages"""
        rows = await self.db.fetch("""
            SELECT message_json FROM offline_outbox
            WHERE user_id = $1
            ORDER BY created_at ASC
        """, user_id)

        await self.db.execute("""
            DELETE FROM offline_outbox WHERE user_id = $1
        """, user_id)

        return [Message(**json.loads(row['message_json'])) for row in rows]

Message Ordering Guarantees

Total Order vs Causal Order

GuaranteeDescriptionImplementation
Total OrderAll messages in exact global orderSingle sequencer; Lamport clocks
Causal OrderCausally related messages in order; unrelated can be interleavedVector clocks
FIFO per senderMessages from same sender arrive in send orderSender sequence numbers
class MessageOrdering:
    """Implement causal ordering with vector clocks"""

    def __init__(self, node_id: str):
        self.node_id = node_id
        self.vector_clock = {node_id: 0}

    def increment_clock(self):
        self.vector_clock[self.node_id] += 1
        return self.vector_clock.copy()

    def send_message(self, content: str) -> Message:
        clock = self.increment_clock()
        return Message(
            content=content,
            vector_clock=clock,
            sender=self.node_id
        )

    def receive_message(self, message: Message) -> bool:
        """Return True if message should be delivered (causal ordering)"""
        # Update our clock
        for node, val in message.vector_clock.items():
            self.vector_clock[node] = max(
                self.vector_clock.get(node, 0),
                val
            )

        # Increment for receive event
        self.vector_clock[self.node_id] += 1

        # Check if we have all preceding messages
        return self._check_causal_delivery(message)

    def _check_causal_delivery(self, message: Message) -> bool:
        """Verify we have all causally preceding messages"""
        for node, val in message.vector_clock.items():
            if node == self.node_id:
                continue
            if self.vector_clock.get(node, 0) < val - 1:
                return False  # Missing preceding message
        return True

Trade-off Analysis

FactorSticky SessionsBroker FanoutHybrid Approach
ComplexityLowMediumMedium-High
Horizontal scalePoor (session affinity)ExcellentExcellent
Message deliveryDirectVia pub/subVia pub/sub
State managementLocal to serverShared (Redis)Shared
FailoverSession lostSeamlessSeamless
Geographic distributionPoorExcellentExcellent

When to Use / When Not to Use

ScenarioRecommendation
Real-time messaging appsUse WebSockets with message queue
Team collaboration toolsAdd presence and typing indicators
Customer support chatAdd conversation persistence and search
High-volume broadcast (1:N)Use pub/sub message broker fanout

When TO Use WebSocket Scaling with Sticky Sessions

  • Session affinity important: When connection state is local to server and expensive to rebuild
  • Moderate scale: When you can manage server affinity at load balancer level
  • Simpler operations: When you prefer simpler debugging and tracing

When NOT to Use Sticky Sessions

  • Very high scale: When you need to scale to thousands of servers
  • Geographic distribution: When users connect to nearest region, not a specific server
  • Transparent failover: When connection should survive server restarts

Production Failure Scenarios

Failure ScenarioImpactMitigation
WebSocket gateway crashUsers disconnect, must reconnectRun N+1 instances; health checks; client auto-reconnect
Message queue backlogMessages delayed, offline users fall further behindPriority queues; max backlog limits; batch catchup
Presence cache failureUsers appear offline incorrectlyRedis cluster with replication; fallback to DB query
Push notification service downOffline users don’t get notifiedQueue notifications; retry with backoff; SMS fallback
Database replica lagMessage history incompleteRead from primary for recent messages; accept lag for history
Device conflict on syncDuplicate messages or lost messagesVector clocks; idempotent message IDs; conflict resolution

Real-world Failure Scenarios

Scenario 1: Slack Message Reordering Incident

What happened: A network partition between Slack’s primary message store and its read replica caused messages to be delivered out of order for several user sessions, leading to confused conversations and duplicate message deliveries.

Root cause: The read replica fell behind during the partition and was promoted as the new primary without a full resync. Messages written to the old primary during the failover window were not fully replicated before the role swap, causing gaps and reordering in the message timeline.

Impact: Users in affected workspaces saw messages appearing in wrong chronological order. Some users received the same message twice. The incident lasted approximately 25 minutes before the engineering team detected and resolved the inconsistency.

Lesson learned: Implement strict consistency guarantees for message ordering using vector clocks or monotonic sequence IDs. Always perform a full data integrity check before completing a primary failover. Build monitoring that alerts on replication lag exceeding acceptable thresholds.

Scenario 2: WhatsApp Broadcast Storm

What happened: A bug in WhatsApp’s group message fanout logic triggered a feedback loop that generated a broadcast storm, overwhelming message routing infrastructure and causing service-wide delays.

Root cause: When a group message delivery failed for a subset of recipients, the retry logic inadvertently re-triggered the original broadcast event. This created a feedback loop that multiplied message volume exponentially until the queue infrastructure was saturated.

Impact: All WhatsApp messages experienced delivery delays of up to several hours. The broadcast storm consumed significant infrastructure capacity, slowing down both group and individual message delivery globally. The issue was identified and throttled within 2 hours.

Lesson learned: Implement idempotent message delivery with deduplication at the consumer level. Add rate limiting and circuit breakers to retry logic to prevent feedback loops. Build back-pressure mechanisms that alert and slow down producers before infrastructure is overwhelmed.

Scenario 3: Discord WebSocket Connection Storm

What happened: Discord experienced a connection storm after a software update caused WebSocket clients to reconnect in rapid succession, exhausting server connection limits and causing widespread disconnections.

Root cause: The update changed the WebSocket heartbeat interval to a much shorter value. When deployed, a large percentage of connected clients simultaneously detected a “missed” heartbeat (because the interval change was interpreted as a timeout) and attempted to reconnect at the same time. The resulting connection storm saturated available file descriptors on the gateway servers.

Impact: Millions of users were disconnected simultaneously and experienced significant difficulty reconnecting. Voice and video calls were dropped. The incident persisted for approximately 45 minutes while the team implemented emergency rate limiting and rolled back the heartbeat interval change.

Lesson learned: When changing connection management parameters, roll them out incrementally with connection pool monitoring. Implement client-side jitter to prevent thundering herd on reconnect. Set hard limits on connection establishment rates at the gateway level to absorb connection storms gracefully.


Capacity Estimation

Connection Capacity

Per WebSocket server (4 vCPU, 8GB RAM):
- Max connections: 10,000-50,000
- Messages/second: 1,000-5,000

For 100,000 concurrent users:
- Minimum servers: 100,000 / 10,000 = 10 servers
- With redundancy (2x): 20 servers
- Peak message rate: 100,000 users × 1 msg/user/10sec = 10,000 msg/s

Storage Estimation

Messages per day: 10M DAU × 10 messages/day = 100M messages
Message size (avg): 500 bytes + 200 bytes metadata = 700 bytes
Daily storage: 100M × 700 = 70 GB
5-year storage: 70GB × 365 × 5 = 127 TB
With 3x replication: ~380 TB

Common Pitfalls / Anti-Patterns

Pitfall 1: Storing Messages in WebSocket Server Memory

Problem: Messages sent while user is temporarily disconnected are lost if stored only in server memory.

Solution: Store all messages in persistent storage (DB) immediately. Deliver from storage when user reconnects. Use message queue as buffer, not source of truth.

Pitfall 2: Not Handling Reconnection Gracefully

Problem: When client reconnects, duplicate messages appear or gaps in conversation.

Solution: Implement idempotent message delivery with client-side deduplication. Use cursor-based pagination for loading message history.

Pitfall 3: Presence Flooding

Problem: Every presence change triggers notifications to all connected clients, creating broadcast storms in large channels.

Solution: Batch presence updates (send every 5 seconds, not every change). Use bloom filters for “who is online” queries instead of broadcasting.

Pitfall 4: Single Point of Failure for Message Ordering

Problem: Using a single sequencer node for total ordering creates a bottleneck and failure point.

Solution: Use causal ordering (vector clocks) instead of total ordering. Most applications only need causal order anyway.


Quick Recap

  • WebSocket gateways scale horizontally using broker fanout or sticky sessions.
  • Store messages persistently first, then deliver; don’t use memory as source of truth.
  • Use vector clocks for causal ordering; only require total order when truly necessary.
  • Conflict resolution on offline sync uses timestamps, vector clocks, or application logic.
  • Presence should be batched and cached, not broadcast on every change.

Copy/Paste Checklist

- [ ] WebSocket gateway with Redis pub/sub fanout
- [ ] Messages stored in DB before delivery
- [ ] Client auto-reconnect with idempotent delivery
- [ ] Cursor-based pagination for message history
- [ ] Vector clocks for causal ordering
- [ ] Conflict resolution on offline sync
- [ ] Presence batched updates (5-10 second windows)
- [ ] Push notification fallback for offline users
- [ ] Rate limiting per connection
- [ ] E2EE encryption for sensitive conversations

Observability Checklist

Metrics to Capture

  • websocket_connections_active (gauge) - Per server, total
  • messages_delivered_total (counter) - By type (direct, group, system)
  • message_delivery_latency_seconds (histogram) - End-to-end, P50/P95/P99
  • presence_updates_total (counter) - Online/offline transitions
  • offline_sync_duration_seconds (histogram) - Time to sync missed messages
  • push_notification_sent_total (counter) - By channel (APNS, FCM, SMS)

Logs to Emit

{
  "timestamp": "2026-03-22T10:30:00Z",
  "event": "message_delivered",
  "message_id": "msg_abc123",
  "conversation_id": "conv_xyz",
  "recipient_id": "user_456",
  "delivery_latency_ms": 45,
  "channel": "websocket"
}

Alerts to Configure

AlertThresholdSeverity
Connection count > 80% capacity80% maxWarning
Message latency P99 > 200ms200msWarning
Presence update lag > 10s10sCritical
Offline sync duration > 30s30sWarning
Push notification queue > 10K10000Warning

Security Checklist

  • WebSocket connections over WSS (TLS)
  • JWT token validation on connection
  • Message content encryption (E2EE for sensitive chats)
  • Rate limiting per user (prevent spam)
  • Input sanitization (XSS prevention)
  • Push notification content limited (no PII in notifications)
  • Audit logging for moderation actions
  • Device management (revoke stolen device sessions)

Scenario Drills

Scenario 1: WebSocket Gateway Server Crash

Situation: The WebSocket server handling 10,000 connections crashes unexpectedly.

Analysis:

  • All 10,000 users disconnect simultaneously
  • Client auto-reconnect kicks in but overwhelms surviving servers
  • Messages sent during disconnect are in message store, not lost

Solution: Run N+1 instances with health checks. Clients implement exponential backoff on reconnect. Sticky sessions at load balancer ensure reconnection routes to same server. Messages are stored in durable DB before delivery, so offline messages survive server death.

Scenario 2: Presence Flood in Large Channel

Situation: 1,000 users join a popular channel simultaneously. Each presence change is broadcast to all members.

Analysis:

  • Each join generates 1 presence update broadcast to 999 users
  • 1,000 joins = 999,000 presence messages
  • Network flooding and server CPU spike

Solution: Batch presence updates into 5-10 second windows. Instead of broadcasting each change, publish aggregated state. Use bloom filters for approximate “is user online” queries. For very large channels, hide individual presence.

Scenario 3: Offline User Comes Back Online After Days

Situation: A user returns after being offline for a week with thousands of missed messages.

Analysis:

  • Message history is large; fetching all at once causes timeout
  • Push notification storm if all messages trigger notifications
  • User sees huge badge count

Solution: Cursor-based pagination syncs in batches. Messages are fetched incrementally with cursors. Push notifications are coalesced into a single “you have messages” notification rather than one per message. The app shows “loading” state while catching up.


Failure Flow Diagrams

Message Delivery Flow

graph TD
    A[Sender Sends Message] --> B[WebSocket Gateway]
    B --> C[Message Router]
    C --> D[Store in Database]
    D --> E{Recipient Online?}
    E -->|Yes| F[Find Connection]
    F --> G[Deliver via WS Gateway]
    E -->|No| H[Queue Push Notification]
    G --> I[Send ACK to Sender]
    H --> J[APNS/FCM]
    J --> K[Device Receives]
    K --> L[User Opens App]
    L --> M[Sync Missed Messages]

WebSocket Connection Lifecycle

graph TD
    A[Client Connects] --> B[Authenticate Token]
    B --> C{Valid?}
    C -->|No| D[Close Connection 4001]
    C -->|Yes| E[Register Connection]
    E --> F[Subscribe to Conversations]
    F --> G[Connection Established]
    G --> H[Heartbeat Loop]
    H --> I{Heartbeat OK?}
    I -->|No| J[Mark Offline]
    I -->|Yes| K{Message Received?}
    K -->|Yes| L[Process Message]
    K -->|No| H
    L --> H
    J --> M[Cleanup Connection]

Presence Update Batching

graph LR
    A[User Online Event] --> B[Presence Service]
    B --> C{Batch Window Open?}
    C -->|No| D[Start 5s Timer]
    C -->|Yes| E[Add to Batch Buffer]
    D --> E
    E --> F{Window Elapsed?}
    F -->|No| E
    F -->|Yes| G[Publish Aggregated Update]
    G --> H[Subscribers Receive]

Interview Questions

1. How do you handle message delivery when the recipient is offline?

When the recipient is offline, the message router stores the message in durable storage first. Then it triggers a push notification to the user's device. When the user comes back online, the client syncs missed messages using a cursor-based pagination protocol, fetching messages since the last known sync timestamp.

2. Why is vector clock ordering better than total ordering for chat?

Total ordering requires a single sequencer node that sees all messages. This creates a bottleneck and a single point of failure. Causal ordering only guarantees that causally related messages arrive in order. Unrelated messages can be interleaved freely. Most chat applications only need causal ordering, which is cheaper to implement and more fault-tolerant.

3. How does presence tracking scale in large channels?

Presence changes are batched rather than broadcast immediately. The system sends presence updates every 5-10 seconds, not on every change. For "who is online" queries, bloom filters can approximate membership without broadcasting to all members. For large channels, presence is often not shown at all or shown as "many people online."

4. What is the offline sync conflict resolution strategy?

Messages are identified by client-generated idempotent message IDs. When syncing, the server returns all messages since the last sync cursor. The client merges local and server messages, deduplicating by message ID. For conflicts (same ID, different content), vector clocks determine which is causally newer, or timestamps break ties.

5. How would you design a message store schema for efficient pagination?

Offset-based pagination (LIMIT 100 OFFSET 500) falls apart at scale. As offset grows, the database still scans all those rows before returning your 100. With millions of messages per conversation, page 1000 is slow.

Cursor-based pagination solves this. Instead of "give me rows 500-600", you say "give me messages after message ID XYZ". The database uses the index on (conversation_id, created_at DESC) and returns the next page instantly, regardless of where you are in the conversation.

The cursor can be a message ID or a timestamp — both work with the composite index. The tradeoff is that you can't jump to arbitrary pages; you can only move forward or backward sequentially.

6. What are the trade-offs between WebSockets and HTTP long-polling for chat applications?

WebSockets give you a persistent TCP connection. Once established, sending a message is essentially free — no connection overhead. The tradeoff is infrastructure complexity: WebSockets are stateful, so scaling horizontally means either sticky sessions (same user routes to same server) or a pub/sub layer so all servers can receive messages for all users.

Long-polling works over standard HTTP. The client makes a request, the server holds it open until there's data (or a timeout), then responds and the client immediately opens a new request. It's simpler to scale because every request is stateless, but it generates more HTTP overhead and has higher latency.

For most chat applications, WebSockets are the right call. The infrastructure complexity is manageable with modern tooling. Long-polling is a reasonable fallback for enterprise environments with strict proxies that block WebSocket upgrades.

7. How do you implement rate limiting in a chat system?

Rate limiting needs to be per-user-per-conversation, not just per-user. If Alice can send 100 messages per minute total, she could spam 100 different people simultaneously. Limiting to 1 message per second per recipient forces her to actually want to communicate.

Redis sliding window counters work well here. The key format is something like ratelimit:user123:conversation456:minute. Increment on each message, check the count before delivering. If over the limit, return 429 with a Retry-After header.

Presence updates deserve separate limits. A misbehaving client could send presence heartbeats every second; you want to rate limit those lower than actual messages.

8. What strategies would you use for full-text message search in a chat application?

Don't build this yourself. Elasticsearch or Meilisearch handle tokenization, stemming, ranking, and fuzzy matching — all things you'd otherwise have to implement from scratch. The indexing happens asynchronously: when a message is stored, it's also sent to a queue, and a worker indexes it separately from the hot path.

Scope matters. Global search across all conversations is expensive and usually not what users want. Searching within a single conversation is faster and more useful. Some systems also support searching across a subset of conversations (e.g., "search my starred conversations").

Index freshness vs. search volume is a tradeoff. If you index synchronously, search results are always fresh but you've added latency to every message send. Async indexing (what most systems do) means search results might be a few seconds behind, which is acceptable for most use cases.

9. How would you design an efficient typing indicator system?

Typing indicators are pure ephemeral state — they're not important enough to persist. If the server crashes and you lose who's typing, nobody notices.

Client-side: debounce aggressively. If a user types 10 characters in a second, you shouldn't send 10 typing events. Wait until 300-500ms have passed since the last keystroke before sending one event.

Server-side: batch and broadcast. Instead of notifying all conversation members every time someone starts typing, batch notifications into 2-3 second windows. One user typing rapidly generates one notification, not twenty. Redis sorted sets with timestamps work well for tracking who is typing in a conversation.

State expires automatically. If no typing event comes in for 5-10 seconds, the user has either stopped or their client crashed. The indicator disappears without requiring explicit stop events.

10. What are the key considerations for multi-device message synchronization?

The core challenge: the same user might have the app open on their laptop and phone simultaneously. Messages arrive on one device; they need to appear on the other. Read state needs to sync too — if you read a message on your phone, you shouldn't see it as unread on your laptop.

Each device maintains a cursor — the ID of the last message it has synced. When a device comes online (or periodically while connected), it asks the server for any messages since that cursor. The server returns them; the client merges them with local messages. Vector clocks or simple timestamps handle ordering.

Read receipts are a separate problem. When you mark a message as read on one device, that read state needs to fan out to all your other devices and to the sender. This can be handled via the same message queue infrastructure as regular messages.

On mobile, be careful about sync frequency. Aggressive sync drains battery. Quiesce sync when the app is backgrounded and catch up when it returns to foreground.

11. How does the Signal Protocol provide forward secrecy and break-in recovery?

The Signal Protocol uses the Double Ratchet algorithm to achieve two properties that normal encryption schemes lack.

Forward secrecy means that if your long-term key is compromised, past messages stay safe. This works because messages are encrypted with ephemeral session keys derived from the root key. The root key itself is derived through Diffie-Hellman ratcheting, and once a message key is used, it's deleted. Compromising today's key doesn't reveal yesterday's messages.

Break-in recovery (also called post-compromise security) means that after a device is compromised, the protocol can recover to a secure state without requiring a new key exchange. The ratcheting mechanism advances the chain even if an attacker had visibility, so after a few messages, the attacker loses access again.

The combination makes it computationally expensive for an attacker to decrypt messages — they'd need to compromise each key in the ratchet chain sequentially.

12. How would you handle message deduplication at scale in a distributed chat system?

Idempotent message IDs are the foundation. The client generates a UUID before sending, and the server checks if that ID already exists before processing. This moves deduplication from the server to the client, which is fine because the client is the source of truth for its own sends.

At the server level, you need a fast lookup. Redis with the message ID as a key works for active conversations. For cross-conversation deduplication across millions of messages, you'd use a bloom filter or a distributed set, trading some memory for speed.

The tricky part is cleanup. Message IDs accumulate forever if you keep them. A common approach is to partition by time window — keep deduplication state for 7 days, then expire. Most real duplicates happen immediately due to network retries anyway.

For groups with many members, the deduplication check should happen before fanout, not during. One server checks the ID, then it routes to all recipients without re-checking.

13. What are the trade-offs between optimistic UI updates versus confirmed delivery in chat applications?

Optimistic updates show the message immediately in the UI as if it were sent, before the server confirms it. This feels fast — zero perceived latency. The risk is that if the send fails, you have to roll back the UI, which is jarring for users and requires careful state management.

Confirmed delivery waits for a server ACK before showing the message as "sent." This is safer but slower. If the roundtrip is 50ms, users notice. On mobile with variable latency, it gets worse.

Most apps use optimistic updates with local queuing. The message goes into a "pending" state, shows in the UI, and if the server ACK doesn't come back within a timeout (say, 10 seconds), it's marked as failed with a retry button. The user sees it immediately but knows it's not confirmed.

The middle ground some apps use: optimistic updates for text, confirmed delivery for important actions like payments or message edits. Text is low-stakes; editing a message or sending a large file isn't.

14. How do you prevent read state from getting inconsistent across multiple devices?

Read state is one of the harder problems in multi-device chat. When you read a message on your phone, that needs to sync to your laptop, your tablet, and any web sessions — and it needs to happen fast enough that you don't see it as unread again.

The core mechanism: read receipts flow through the same message infrastructure as regular messages. When you mark something read, that event goes to the server, which fans it out to all your other devices via their active WebSocket connections. All devices converge on the same read cursor.

The tricky part is ordering. If you read message 100 on your phone while message 99 is still syncing to your laptop, the laptop might think you've only read up to message 98. The fix is to only advance the read cursor to the highest consecutive message number, not just the latest.

For conversations with thousands of messages, you don't want to sync read state on every single read. Batch it — send a read receipt every 5-10 seconds or when leaving the conversation, not on every scroll position change.

15. How would you design a system to detect and suspend spammy users in real-time?

Real-time spam detection operates at two levels: rate limiting and content classification.

Rate limiting is straightforward. Redis sliding window counters track messages per user per conversation. If someone sends more than 1 message per second to the same person, or more than 10 messages per minute to different people, flag them. This catches bots immediately without analyzing content.

Content classification is harder. You don't want to scan every message synchronously — that adds latency. Instead, asynchronously analyze messages after delivery. A queue collects messages, a worker runs detection (simple: keyword matching, link density, account age checks; complex: ML models), and if flagged, the content is deleted and the user is warned or suspended.

For real-time response to spam waves (coordinated attacks), use reputation scores. New accounts have low reputation. Accounts with history of good behavior have high reputation. Low reputation users get slower delivery — messages go through a moderation queue before reaching recipients. This buys time without blocking legitimate users.

16. What architectural changes would you make to support end-to-end encrypted group chats with hundreds of members?

Naive E2EE for groups is O(n²) — every message is encrypted separately for every recipient. With 100 members, that's 100 encryptions per message. At scale, this doesn't work.

The practical approach is sender keys. Each member generates one key that they share with every other member once (during group join). When sending, they encrypt the message once with their sender key and attach it to the message. Recipients decrypt with the sender's key. This reduces encryptions to O(n) per message — still linear, but a much smaller constant.

When someone joins late, you need to send them all the sender keys for the current group state. When someone leaves, you need to remove their access to future messages. Group key management is its own sub-problem — you need a protocol for updating keys when membership changes.

For very large groups (broadcast-style with thousands), E2EE often isn't practical. Most large group apps either accept that admins can read messages (operational security model) or they restrict E2EE to small groups where it's tractable.

17. How does cursor-based pagination work for loading message history, and why is it better than offset-based?

Offset-based pagination says "give me rows 500 through 600." The database still scans the first 500 rows before returning the 100 you asked for. With millions of messages per conversation, page 10,000 is effectively a full table scan.

Cursor-based pagination says "give me messages after message ID XYZ." The database uses the composite index on (conversation_id, created_at DESC, id) and jumps directly to the cursor position. The query is O(log n) regardless of where you are in the conversation.

The cursor can be a message ID or a timestamp. Message ID is better because it handles concurrent inserts cleanly — if a new message arrives while you're paginating, you won't see duplicates or gaps. Timestamps have precision limits; concurrent inserts at the same millisecond are ambiguous.

The tradeoff is that cursors don't support random access. You can't jump to page 100 directly. You have to walk forward from the beginning. For most use cases this is fine — users rarely need to jump to arbitrary points in chat history.

18. How would you handle the "last seen" feature without creating privacy or performance issues?

"Last seen" seems simple — store the timestamp of the user's last activity. It's not simple.

The privacy problem: showing exactly when someone was online is intrusive. "She was last active at 11:43pm" tells you someone stayed up late, which you shouldn't know. Most apps blur this: "online," "last seen 5 minutes ago," "last seen today."

The performance problem: every presence update writes to storage. If you track exact timestamps on every keystroke, you have millions of writes per day for a popular app. Instead, round to the nearest 5-minute bucket or only update on explicit presence changes (connect, disconnect, app background/foreground), not on every message.

For showing "last seen" to other users, don't expose exact timestamps from recent activity. Use relative phrases (active 5m ago, active 2h ago) that decay to larger buckets over time. Nobody needs to know exactly when someone was online in real-time except in specific contexts.

19. What strategies exist for managing WebSocket connections during network transitions (e.g., WiFi to cellular)?

Network transitions are a silent killer of WebSocket connections. When a phone switches from WiFi to cellular, the TCP connection doesn't migrate — it dies. The app has to reconnect.

On mobile, the OS handles network transitions transparently at the IP level for a few seconds. Your WebSocket connection might survive a brief WiFi handoff. But a full network switch (WiFi off, cellular on) always breaks connections.

The client-side fix: implement exponential backoff reconnection with jitter. When the connection drops, wait a random interval (0-2 seconds), then reconnect. Don't hammer the server immediately after a network change — the network stack needs a moment to stabilize.

At the protocol level, use a session token that's independent of the TCP connection. The client identifies its session, the server knows which messages it has received, and the client can resume from where it left off without a full re-sync. This is especially important for mobile where connections are inherently unstable.

20. How do you design for message edit and delete semantics that are consistent across devices?

Edit and delete are surprisingly hard because they need to work consistently across all a user's devices and all recipients.

For delete: the cleanest model is soft delete with a tombstone. The message stays in the database but is marked deleted. When a client loads messages, it checks the tombstone and hides the content, showing "This message was deleted" instead. This works across all devices because the tombstone propagates like a regular message.

For edit: edits should create a new message ID with a reference to the original. The original is marked as edited (but not deleted — you still see it was edited). Clients display the latest version but can show edit history if needed. The server rejects edits on messages older than a time window (say, 15 minutes) to prevent abuse.

The hardest case: what if you've read the original message on device A, the sender edits it on device B, and device A is offline when the edit happens? Device A comes online, syncs, and needs to show the edited content without marking it as unread. This requires tracking which messages each device has acknowledged and only updating read state for genuinely new content.


Further Reading

For more on real-time systems, see our Message Queue Types guide. For message queues that power chat delivery, see the RabbitMQ and Apache Kafka guides.

Conclusion

Building a scalable chat system requires:

  • WebSocket gateways for persistent connections
  • Message routers that handle online and offline delivery
  • Durable message storage with efficient queries
  • Presence tracking with heartbeat mechanism
  • Push notifications for offline users
  • Consistent hashing for horizontal scaling

Category

Related Posts

Uber's Architecture: From Monolith to Microservices at Scale

Explore how Uber evolved from a monolith to a microservices architecture handling millions of real-time marketplace transactions daily.

#microservices #uber #architecture

System Design: Netflix Architecture for Global Streaming

Deep dive into Netflix architecture. Learn about content delivery, CDN design, microservices, recommendation systems, and streaming protocols.

#system-design #case-study #netflix

System Design: Twitter Feed Architecture and Scalability

Deep dive into Twitter system design. Learn about feed generation, fan-out, timeline computation, search, notifications, and scaling challenges.

#system-design #case-study #twitter