System Design: Real-Time Chat System Architecture

Deep dive into chat system design. Learn about WebSocket management, message delivery, presence systems, scalability patterns, and E2E encryption.

published: March 22, 2026 reading time: 48 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Real-time chat systems maintain persistent WebSocket connections for low-latency message delivery, with gateways scaling horizontally through broker fanout or sticky sessions. The hard part is handling both online users (direct delivery) and offline users (store-and-forward with push notifications). Vector clocks solve causal ordering, presence batching prevents broadcast storms, and E2E encryption via the Signal protocol protects messages end-to-end.

System Design: Real-Time Chat System Architecture

Chat systems like WhatsApp, Slack, and Discord handle millions of concurrent users sending messages in real-time. The challenges are fundamentally different from request-response web applications: persistent connections, low-latency delivery, and handling offline users.

This case study walks through designing a scalable chat system.

Introduction

Chat systems aren’t like regular web apps. A normal web app gets a request, processes it, sends a response, and closes the connection. A chat app keeps connections open, delivers messages fast, and has to handle users who go offline and come back hours later. All of this changes how you approach storage, routing, presence — pretty much every part of the system.

The tricky parts are predictable once you know what to look for. Connection management: keeping thousands of WebSocket connections alive per server, handling heartbeats, detecting disconnects, and recovering gracefully. Message delivery: routing between sender and recipient in real time, with different code paths for online and offline users. Presence: showing who’s around without broadcasting so many updates that you clog the network. Storage: keeping message history forever while still supporting fast pagination and search.

This case study walks through building a scalable chat system like WhatsApp, Slack, or Discord. We’ll cover WebSocket connections, message routing, durable storage, presence systems, push notifications, offline sync, end-to-end encryption, and the scaling patterns that let these systems actually work at scale.

Requirements Analysis

Functional Requirements

Users should be able to:

Send and receive messages in real-time
Create and join chat rooms or direct messages
See who is online (presence)
Receive push notifications when offline
Send text, images, and files
Search through message history

Non-Functional Requirements

The system needs:

Message delivery under 100ms for online users
Support 100,000+ concurrent connections per server
Handle 10 million daily active users
Store message history indefinitely
Maintain 99.99% availability

Architecture Overview

graph TB
    A[Mobile App] -->|WSS| B[WebSocket Gateway]
    A -->|HTTPS| C[REST API Gateway]
    B --> D[Message Router]
    D --> E[Message Store]
    D --> F[Presence Service]
    D --> G[Notification Service]
    C --> E
    C --> H[User Service]
    E --> I[(Database)]
    G --> J[(Push Service)]
    G --> K[(Queue)]

WebSocket Connections

WebSockets provide persistent bidirectional connections, essential for real-time chat.

Connection Flow

sequenceDiagram
    participant C as Client
    participant W as WebSocket Gateway
    participant R as Message Router
    participant M as Message Store

    C->>W: CONNECT {auth_token}
    W->>W: Validate token
    W->>W: Extract user_id
    W->>R: Register connection {user_id, connection_id}
    R-->>W: Connection registered
    W-->>C: Connection established

    C->>W: SEND {recipient_id, content}
    W->>R: Route message {from, to, content}
    R->>M: Store message
    M-->>R: Message stored
    R->>R: Find recipient connection
    R->>W: Deliver to recipient
    W-->>C: MESSAGE {from, content}

    Note over C,W: Bidirectional messages flow through router

WebSocket Server Implementation

import asyncio
import websockets
from dataclasses import dataclass
from typing import Dict, Set

@dataclass
class Connection:
    user_id: str
    connection_id: str
    websocket: websockets.WebSocketServerProtocol
    subscribed_rooms: Set[str]

class WebSocketGateway:
    def __init__(self, message_router: MessageRouter):
        self.router = message_router
        self.connections: Dict[str, Connection] = {}
        self.user_connections: Dict[str, Set[str]] = {}

    async def handle_connection(self, websocket: websockets.WebSocketServerProtocol, path: str):
        # Authenticate
        token = websocket.request_headers.get("Authorization", "").replace("Bearer ", "")
        user_id = await self.authenticate(token)

        if not user_id:
            await websocket.close(4001, "Authentication failed")
            return

        connection_id = str(uuid4())
        connection = Connection(
            user_id=user_id,
            connection_id=connection_id,
            websocket=websocket,
            subscribed_rooms=set()
        )

        # Register connection
        self.connections[connection_id] = connection
        if user_id not in self.user_connections:
            self.user_connections[user_id] = set()
        self.user_connections[user_id].add(connection_id)

        # Register with router
        await self.router.register_connection(user_id, connection_id)

        try:
            # Handle incoming messages
            async for message in websocket:
                await self.handle_message(connection, message)
        except websockets.ConnectionClosed:
            pass
        finally:
            await self.cleanup(connection)

    async def handle_message(self, connection: Connection, raw_message: str):
        message = json.loads(raw_message)

        msg_type = message.get("type")

        if msg_type == "message":
            await self.router.route_message(
                from_user=connection.user_id,
                to_user=message["recipient_id"],
                content=message["content"],
                connection_id=connection.connection_id
            )
        elif msg_type == "subscribe":
            room_id = message["room_id"]
            connection.subscribed_rooms.add(room_id)
            await self.router.subscribe_to_room(connection.user_id, room_id)
        elif msg_type == "typing":
            await self.router.send_typing_indicator(
                from_user=connection.user_id,
                to_user=message["recipient_id"]
            )

Message Routing

The message router is the heart of the chat system, handling delivery to online and offline users.

Message Router Service

class MessageRouter:
    def __init__(
        self,
        message_store: MessageStore,
        presence_service: PresenceService,
        notification_service: NotificationService
    ):
        self.message_store = message_store
        self.presence_service = presence_service
        self.notification_service = notification_service
        self.online_connections: Dict[str, Set[str]] = {}

    async def register_connection(self, user_id: str, connection_id: str):
        if user_id not in self.online_connections:
            self.online_connections[user_id] = set()
        self.online_connections[user_id].add(connection_id)
        await self.presence_service.set_online(user_id)

    async def route_message(
        self,
        from_user: str,
        to_user: str,
        content: str,
        connection_id: str = None
    ):
        # Store message first (durability)
        message = await self.message_store.store(
            from_user=from_user,
            to_user=to_user,
            content=content
        )

        # Deliver to online recipients
        if to_user in self.online_connections:
            for conn_id in self.online_connections[to_user]:
                await self.deliver_message(conn_id, message)
            # No push notification needed
        else:
            # User offline - send push notification
            await self.notification_service.send_push_notification(
                user_id=to_user,
                message=message
            )

        # Acknowledge to sender if online
        if connection_id:
            await self.send_ack(connection_id, message)

        return message

    async def deliver_message(self, connection_id: str, message: Message):
        gateway = self.get_gateway_for_connection(connection_id)
        await gateway.send_to_connection(connection_id, {
            "type": "message",
            "message_id": message.id,
            "from": message.from_user,
            "content": message.content,
            "timestamp": message.created_at.isoformat()
        })

Message Storage

Messages must be stored durably and queryable.

Database Schema

-- Messages table
CREATE TABLE messages (
    id BIGSERIAL PRIMARY KEY,
    conversation_id TEXT NOT NULL,
    sender_id BIGINT NOT NULL,
    content TEXT NOT NULL,
    content_type TEXT DEFAULT 'text',  -- text, image, file
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    delivered_at TIMESTAMP WITH TIME ZONE,
    read_at TIMESTAMP WITH TIME ZONE,

    INDEX idx_messages_conversation (conversation_id, created_at DESC),
    INDEX idx_messages_sender (sender_id, created_at DESC)
);

-- Conversations table
CREATE TABLE conversations (
    id TEXT PRIMARY KEY,
    type TEXT NOT NULL,  -- direct, group
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),

    INDEX idx_conversations_updated (updated_at DESC)
);

-- Conversation members
CREATE TABLE conversation_members (
    conversation_id TEXT NOT NULL REFERENCES conversations(id),
    user_id BIGINT NOT NULL,
    joined_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    last_read_at TIMESTAMP WITH TIME ZONE,
    unread_count BIGINT DEFAULT 0,

    PRIMARY KEY (conversation_id, user_id)
);

-- Index for finding user's conversations
CREATE INDEX idx_members_user ON conversation_members(user_id);

Message Store Service

class MessageStore:
    def __init__(self, db: Database, cache: RedisCache):
        self.db = db
        self.cache = cache

    def generate_conversation_id(self, user1: str, user2: str) -> str:
        # Consistent ordering for DMs
        sorted_users = sorted([user1, user2])
        return f"dm:{sorted_users[0]}:{sorted_users[1]}"

    async def store(
        self,
        from_user: str,
        to_user: str,
        content: str,
        conversation_id: str = None
    ) -> Message:
        if not conversation_id:
            conversation_id = self.generate_conversation_id(from_user, to_user)

        # Ensure conversation exists
        await self._ensure_conversation_exists(conversation_id)

        # Insert message
        row = await self.db.fetch_one("""
            INSERT INTO messages (conversation_id, sender_id, content)
            VALUES ($1, $2, $3)
            RETURNING *
        """, conversation_id, from_user, content)

        message = Message(**row)

        # Update conversation timestamp
        await self.db.execute("""
            UPDATE conversations
            SET updated_at = NOW()
            WHERE id = $1
        """, conversation_id)

        # Invalidate cache
        await self.cache.delete(f"conversation:{conversation_id}:messages")

        return message

    async def get_messages(
        self,
        conversation_id: str,
        limit: int = 50,
        before: datetime = None
    ) -> List[Message]:
        # Check cache first
        cache_key = f"conversation:{conversation_id}:messages"
        if not before:
            cached = await self.cache.get(cache_key)
            if cached:
                return [Message(**m) for m in json.loads(cached)]

        # Query database
        query = """
            SELECT * FROM messages
            WHERE conversation_id = $1
        """
        params = [conversation_id]

        if before:
            query += " AND created_at < $2"
            params.append(before)

        query += " ORDER BY created_at DESC LIMIT $" + str(len(params))
        params.append(limit)

        rows = await self.db.fetch(query, *params)
        messages = [Message(**row) for row in rows]

        # Cache if first page
        if not before:
            await self.cache.setex(
                cache_key,
                300,
                json.dumps([m.dict() for m in messages])
            )

        return messages

Presence System

Presence shows users who is online and their activity status.

Presence Service

class PresenceService:
    def __init__(self, redis: Redis):
        self.redis = redis
        self.PRESENCE_TTL = 60  # Heartbeat interval

    async def set_online(self, user_id: str):
        key = f"presence:{user_id}"
        await self.redis.setex(key, self.PRESENCE_TTL, json.dumps({
            "status": "online",
            "last_seen": datetime.utcnow().isoformat()
        })

        # Publish presence change
        await self.redis.publish("presence_updates", json.dumps({
            "user_id": user_id,
            "status": "online"
        }))

    async def set_offline(self, user_id: str):
        key = f"presence:{user_id}"
        await self.redis.setex(key, self.PRESENCE_TTL, json.dumps({
            "status": "offline",
            "last_seen": datetime.utcnow().isoformat()
        }))

        await self.redis.publish("presence_updates", json.dumps({
            "user_id": user_id,
            "status": "offline"
        }))

    async def get_presence(self, user_id: str) -> Dict:
        key = f"presence:{user_id}"
        data = await self.redis.get(key)
        if data:
            return json.loads(data)
        return {"status": "offline"}

    async def get_bulk_presence(self, user_ids: List[str]) -> Dict[str, Dict]:
        keys = [f"presence:{uid}" for uid in user_ids]
        values = await self.redis.mget(keys)

        result = {}
        for uid, value in zip(user_ids, values):
            if value:
                result[uid] = json.loads(value)
            else:
                result[uid] = {"status": "offline"}

        return result

Heartbeat Mechanism

class PresenceHeartbeat:
    def __init__(self, presence_service: PresenceService):
        self.presence_service = presence_service
        self.heartbeat_interval = 30

    async def start_heartbeat(self, websocket, user_id: str):
        while True:
            try:
                await asyncio.sleep(self.heartbeat_interval)
                await websocket.send(json.dumps({"type": "heartbeat_ack"}))
                await self.presence_service.refresh_heartbeat(user_id)
            except websockets.ConnectionClosed:
                await self.presence_service.set_offline(user_id)
                break

    async def refresh_heartbeat(self, user_id: str):
        key = f"presence:{user_id}"
        data = {
            "status": "online",
            "last_seen": datetime.utcnow().isoformat()
        }
        await self.redis.setex(key, self.PRESENCE_TTL, json.dumps(data))

Push Notifications

When users are offline, push notifications deliver new messages.

Notification Service

class NotificationService:
    def __init__(
        self,
        push_service: PushService,
        user_service: UserService
    ):
        self.push_service = push_service
        self.user_service = user_service

    async def send_push_notification(
        self,
        user_id: str,
        message: Message
    ):
        # Get user's push tokens
        tokens = await self.user_service.get_push_tokens(user_id)

        if not tokens:
            return

        # Get sender info
        sender = await self.user_service.get_user(message.from_user)

        # Send to all user's devices
        await asyncio.gather(*[
            self.push_service.send(
                token=token,
                notification={
                    "title": f"Message from {sender.display_name}",
                    "body": message.content[:100],
                    "data": {
                        "type": "message",
                        "conversation_id": message.conversation_id,
                        "message_id": message.id
                    }
                }
            )
            for token in tokens
        ])

API Design

REST Endpoints

Method	Endpoint	Description
GET	/api/v1/conversations	List user’s conversations
GET	/api/v1/conversations/:id/messages	Get messages
POST	/api/v1/conversations	Create DM or group
POST	/api/v1/conversations/:id/messages	Send message (offline)
GET	/api/v1/presence/:user_id	Get user presence
PUT	/api/v1/presence	Update own presence
POST	/api/v1/devices	Register push token

WebSocket Messages

Type	Direction	Payload
message	bidirectional	{to, content} or {from, content}
ack	server to client	{message_id}
typing	bidirectional	{to} or {from}
presence	server to client	{user_id, status}
heartbeat	bidirectional	{}
subscribe	client to server	{room_id}

Scalability Patterns

Connection Distribution

graph TB
    L[Load Balancer]
    W1[WS Server 1<br/>10K connections]
    W2[WS Server 2<br/>10K connections]
    W3[WS Server 3<br/>10K connections]
    W4[WS Server N<br/>10K connections]

    L -->|Route| W1
    L -->|Route| W2
    L -->|Route| W3
    L -->|Route| W4

    subgraph Message_Bus["Kafka/RabbitMQ"]

    end

    W1 --> K
    W2 --> K
    W3 --> K
    W4 --> K

Horizontal Scaling

class ConsistentHashRouter:
    def __init__(self, nodes: List[str]):
        self.ring = {}
        self.sorted_keys = []
        self._build_ring(nodes)

    def _build_ring(self, nodes: List[str]):
        for node in nodes:
            for i in range(100):  # Virtual nodes
                key = self._hash(f"{node}:{i}")
                self.ring[key] = node
        self.sorted_keys = sorted(self.ring.keys())

    def get_node(self, key: str) -> str:
        hash_key = self._hash(key)
        idx = bisect_left(self.sorted_keys, hash_key) % len(self.sorted_keys)
        return self.ring[self.sorted_keys[idx]]

    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

# Usage: Route message to the server holding recipient's connection
async def route_to_user(user_id: str, message: Message):
    server = consistent_hash.get_node(user_id)
    await kafka.send("messageDelivery", {
        "server": server,
        "user_id": user_id,
        "message": message.dict()
    })

End-to-End Encryption

For privacy, implement end-to-end encryption (E2EE).

Signal Protocol Overview

The Signal Protocol is the standard for end-to-end encrypted messaging. Signal, WhatsApp, and Google Messages all use variants of it. The protocol combines two cryptographic primitives: the X3DH key agreement for initial key exchange, and the Double Ratchet algorithm for ongoing message encryption.

The diagram traces the full lifetime of a Signal-encrypted conversation. Each party generates a long-term identity key pair. These identity keys are published to a public address book so that Alice and Bob can fetch each other’s keys before starting a conversation. Along with identity keys, each client also uploads pre-key bundles, a set of one-time-use keys that enable the asynchronous initiation of encrypted sessions without both parties being online simultaneously.

When Alice wants to send Bob the first message, she fetches Bob’s pre-key bundle from the server. Using X3DH, both parties independently derive the same root key without ever transmitting it across the network. After the initial handshake, all subsequent messages use the Double Ratchet algorithm, which rotates encryption keys on every message. This provides forward secrecy: if today’s key is compromised, yesterday’s messages stay safe.

graph TB
    A[Alice Key Pair] -->|Identity Key| PAB[Public Address Book]
    B[Bob Key Pair] -->|Identity Key| PAB

    PAB -->|Bob's Identity Key| AS[Alice's Device]
    PAB -->|Alice's Identity Key| BS[Bob's Device]

    AS -->|Pre-Key Bundle| BS
    BS -->|Pre-Key Bundle| AS

    AS <-->|X3DH Key Exchange| BS
    BS <-->|Double Ratchet| AS

Key Exchange Implementation

class KeyExchange:
    def generate_identity_key_pair(self) -> Tuple[bytes, bytes]:
        private_key = Curve25519.generate_private()
        public_key = Curve25519.public_from_private(private_key)
        return private_key, public_key

    def generate_prekey_bundle(self, identity_key: bytes) -> PrekeyBundle:
        identity_private, identity_public = self.generate_identity_key_pair()
        signed_prekey_private, signed_prekey_public = self.generate_identity_key_pair()
        one_time_prekey_private, one_time_prekey_public = self.generate_identity_key_pair()

        # Sign the signed prekey
        signature = self.sign(identity_private, signed_prekey_public)

        return PrekeyBundle(
            identity_public=identity_public,
            signed_prekey_public=signed_prekey_public,
            signed_prekey_signature=signature,
            one_time_prekey_public=one_time_prekey_public
        )

    async def x3dh_key_agreement(
        self,
        my_identity_key: bytes,
        their_bundle: PrekeyBundle
    ) -> bytes:
        # Perform X3DH key agreement
        dh1 = Curve25519.diffie_hellman(my_identity_key, their_bundle.signed_prekey_public)
        dh2 = Curve25519.diffie_hellman(my_identity_key, their_bundle.one_time_prekey_public)
        dh3 = Curve25519.diffie_hellman(my_identity_key, their_bundle.signed_prekey_public)
        dh4 = Curve25519.diffie_hellman(my_identity_key, their_bundle.one_time_prekey_public)

        # Combine DH outputs
        shared_secret = dh1 + dh2 + dh3 + dh4
        return HKDF(shared_secret, b"X3DH")

Message Encryption

class MessageEncryptor:
    def __init__(self, root_key: bytes):
        self.root_key = root_key

    def encrypt_message(
        self,
        message: str,
        chain_key: bytes,
        ratchet_key: bytes
    ) -> EncryptedMessage:
        # Generate message key from chain key
        message_key = HKDF(chain_key, b"MessageKey")

        # Encrypt with AES-GCM
        ciphertext = AESGCM.encrypt(message_key, message)

        # Advance chain key
        next_chain_key = HKDF(chain_key, b"ChainKey")

        return EncryptedMessage(
            ciphertext=ciphertext,
            message_key=message_key,
            next_chain_key=next_chain_key
        )

Offline Sync and Conflict Resolution

When users come back online after being offline, their device must sync missed messages without creating duplicates or conflicts.

Conflict Resolution Strategies

class ConflictResolver:
    """Resolve message sync conflicts using vector clocks"""

    def __init__(self):
        self.vector_clocks = {}  # {message_id: vector_clock}

    def resolve_messages(
        self,
        local_messages: List[Message],
        server_messages: List[Message]
    ) -> List[Message]:
        """Merge local and server messages, removing duplicates"""

        all_messages = {}
        for msg in local_messages + server_messages:
            msg_id = msg.id

            if msg_id not in all_messages:
                all_messages[msg_id] = msg
            else:
                # Conflict - use vector clock to resolve
                winner = self._resolve_conflict(all_messages[msg_id], msg)
                all_messages[msg_id] = winner

        return sorted(all_messages.values(), key=lambda m: m.timestamp)

    def _resolve_conflict(self, msg1: Message, msg2: Message) -> Message:
        """Vector clock comparison: later timestamp wins"""
        if msg2.timestamp > msg1.timestamp:
            return msg2
        return msg1

Message Sync Protocol

When a device comes back online after a period of disconnection, it does not know which messages it missed. The sync protocol fills this gap using cursor-based pagination over the message stream.

The device sends its last known sync timestamp or cursor to the server. The server responds with all messages that arrived after that point, along with a new cursor. The device inserts these messages into its local store. If the response indicates more messages are available, the device repeats the request with the updated cursor. This loop continues until the server returns an empty batch, at which point the device is fully caught up.

The sync endpoint is designed for incremental catch-up, not full history load. Each request returns a bounded batch, typically 100 messages. This prevents timeouts on large backlogs and keeps memory usage predictable on mobile devices.

sequenceDiagram
    participant D as Device (Offline)
    participant S as Sync Server

    D->>D: User comes online
    D->>S: GET /sync?last_sync=1709123456
    S-->>D: {messages: [...], cursor: "abc123"}

    loop Until Caught Up
        D->>S: GET /sync?last_sync=cursor
        S-->>D: {messages: [...], cursor: "def456"}
        Note over D: Insert messages, update cursor
    end

    D->>D: Display "You're all caught up"

Offline Queue Management

class OfflineMessageQueue:
    """Queue messages sent while offline, sync when online"""

    def __init__(self, db: Database):
        self.db = db

    async def queue_message(self, user_id: int, message: Message):
        """Queue message sent offline"""
        await self.db.execute("""
            INSERT INTO offline_outbox (user_id, message_json, created_at)
            VALUES ($1, $2, NOW())
        """, user_id, json.dumps(message.dict()))

    async def sync_pending(self, user_id: int) -> List[Message]:
        """Get and delete pending messages"""
        rows = await self.db.fetch("""
            SELECT message_json FROM offline_outbox
            WHERE user_id = $1
            ORDER BY created_at ASC
        """, user_id)

        await self.db.execute("""
            DELETE FROM offline_outbox WHERE user_id = $1
        """, user_id)

        return [Message(**json.loads(row['message_json'])) for row in rows]

Message Ordering Guarantees

Total Order vs Causal Order

Guarantee	Description	Implementation
Total Order	All messages in exact global order	Single sequencer; Lamport clocks
Causal Order	Causally related messages in order; unrelated can be interleaved	Vector clocks
FIFO per sender	Messages from same sender arrive in send order	Sender sequence numbers

class MessageOrdering:
    """Implement causal ordering with vector clocks"""

    def __init__(self, node_id: str):
        self.node_id = node_id
        self.vector_clock = {node_id: 0}

    def increment_clock(self):
        self.vector_clock[self.node_id] += 1
        return self.vector_clock.copy()

    def send_message(self, content: str) -> Message:
        clock = self.increment_clock()
        return Message(
            content=content,
            vector_clock=clock,
            sender=self.node_id
        )

    def receive_message(self, message: Message) -> bool:
        """Return True if message should be delivered (causal ordering)"""
        # Update our clock
        for node, val in message.vector_clock.items():
            self.vector_clock[node] = max(
                self.vector_clock.get(node, 0),
                val
            )

        # Increment for receive event
        self.vector_clock[self.node_id] += 1

        # Check if we have all preceding messages
        return self._check_causal_delivery(message)

    def _check_causal_delivery(self, message: Message) -> bool:
        """Verify we have all causally preceding messages"""
        for node, val in message.vector_clock.items():
            if node == self.node_id:
                continue
            if self.vector_clock.get(node, 0) < val - 1:
                return False  # Missing preceding message
        return True

Trade-off Analysis

Factor	Sticky Sessions	Broker Fanout	Hybrid Approach
Complexity	Low	Medium	Medium-High
Horizontal scale	Poor (session affinity)	Excellent	Excellent
Message delivery	Direct	Via pub/sub	Via pub/sub
State management	Local to server	Shared (Redis)	Shared
Failover	Session lost	Seamless	Seamless
Geographic distribution	Poor	Excellent	Excellent

When to Use / When Not to Use

Scenario	Recommendation
Real-time messaging apps	Use WebSockets with message queue
Team collaboration tools	Add presence and typing indicators
Customer support chat	Add conversation persistence and search
High-volume broadcast (1:N)	Use pub/sub message broker fanout

When TO Use WebSocket Scaling with Sticky Sessions

Sticky sessions make sense in scenarios where their tradeoffs actually work in your favor.

When connection state is expensive to rebuild, sticky sessions earn their keep. If your WebSocket server maintains in-memory state per connection — a per-connection message buffer, a subscription map, a rate limiter with connection-specific state — that state evaporates when the connection drops and the client reconnects to a different server. Rebuilding it requires extra database queries or cache lookups on every reconnection. If your connection churn is low (desktop apps, long-lived mobile sessions), the rebuilding cost rarely matters enough to justify the complexity of a fanout layer.

When you are operating at moderate scale and want operational simplicity, sticky sessions work well. If you have hundreds of WebSocket servers and your load balancer can handle sticky routing, you get the performance benefit of local state without the overhead of running and monitoring a separate message bus. You also sidestep the latency and ordering complexity that comes with broker-based fanout.

The catch is client reconnect behavior. If clients implement exponential backoff with jitter on reconnect, they spread out their reconnection attempts rather than thundering herd onto a single server. If load balancer health checks catch a failing server before the reconnect storm hits, clients route to healthy servers cleanly. Without that reconnect discipline, sticky sessions amplify failures.

The line is at “moderate scale.” Once you need thousands of WebSocket servers across multiple geographic regions, sticky sessions stop working. A client in Tokyo that reconnects to a server in us-east-1 has 200ms of unnecessary latency on every message. Load balancer affinity rules break when you try to route users to the nearest region. A server restart disconnects thousands of users simultaneously and they all reconnect to the same pool, overwhelming the surviving servers. At that point, you need a broker fanout layer.

When NOT to Use Sticky Sessions

Very high scale: When you need to scale to thousands of servers
Geographic distribution: When users connect to nearest region, not a specific server
Transparent failover: When connection should survive server restarts

Production Failure Scenarios

Failure Scenario	Impact	Mitigation
WebSocket gateway crash	Users disconnect, must reconnect	Run N+1 instances; health checks; client auto-reconnect
Message queue backlog	Messages delayed, offline users fall further behind	Priority queues; max backlog limits; batch catchup
Presence cache failure	Users appear offline incorrectly	Redis cluster with replication; fallback to DB query
Push notification service down	Offline users don’t get notified	Queue notifications; retry with backoff; SMS fallback
Database replica lag	Message history incomplete	Read from primary for recent messages; accept lag for history
Device conflict on sync	Duplicate messages or lost messages	Vector clocks; idempotent message IDs; conflict resolution

Real-world Failure Scenarios

Scenario 1: Slack Message Reordering Incident

What happened: A network partition between Slack’s primary message store and its read replica caused messages to be delivered out of order for several user sessions, leading to confused conversations and duplicate message deliveries.

Root cause: The read replica fell behind during the partition and was promoted as the new primary without a full resync. Messages written to the old primary during the failover window were not fully replicated before the role swap, causing gaps and reordering in the message timeline.

Impact: Users in affected workspaces saw messages appearing in wrong chronological order. Some users received the same message twice. The incident lasted approximately 25 minutes before the engineering team detected and resolved the inconsistency.

Lesson learned: Implement strict consistency guarantees for message ordering using vector clocks or monotonic sequence IDs. Always perform a full data integrity check before completing a primary failover. Build monitoring that alerts on replication lag exceeding acceptable thresholds.

Scenario 2: WhatsApp Broadcast Storm

What happened: A bug in WhatsApp’s group message fanout logic triggered a feedback loop that generated a broadcast storm, overwhelming message routing infrastructure and causing service-wide delays.

Root cause: When a group message delivery failed for a subset of recipients, the retry logic inadvertently re-triggered the original broadcast event. This created a feedback loop that multiplied message volume exponentially until the queue infrastructure was saturated.

Impact: All WhatsApp messages experienced delivery delays of up to several hours. The broadcast storm consumed significant infrastructure capacity, slowing down both group and individual message delivery globally. The issue was identified and throttled within 2 hours.

Lesson learned: Implement idempotent message delivery with deduplication at the consumer level. Add rate limiting and circuit breakers to retry logic to prevent feedback loops. Build back-pressure mechanisms that alert and slow down producers before infrastructure is overwhelmed.

Scenario 3: Discord WebSocket Connection Storm

What happened: Discord experienced a connection storm after a software update caused WebSocket clients to reconnect in rapid succession, exhausting server connection limits and causing widespread disconnections.

Root cause: The update changed the WebSocket heartbeat interval to a much shorter value. When deployed, a large percentage of connected clients simultaneously detected a “missed” heartbeat (because the interval change was interpreted as a timeout) and attempted to reconnect at the same time. The resulting connection storm saturated available file descriptors on the gateway servers.

Impact: Millions of users were disconnected simultaneously and experienced significant difficulty reconnecting. Voice and video calls were dropped. The incident persisted for approximately 45 minutes while the team implemented emergency rate limiting and rolled back the heartbeat interval change.

Lesson learned: When changing connection management parameters, roll them out incrementally with connection pool monitoring. Implement client-side jitter to prevent thundering herd on reconnect. Set hard limits on connection establishment rates at the gateway level to absorb connection storms gracefully.

Capacity Estimation

Connection Capacity

Per WebSocket server (4 vCPU, 8GB RAM):
- Max connections: 10,000-50,000
- Messages/second: 1,000-5,000

For 100,000 concurrent users:
- Minimum servers: 100,000 / 10,000 = 10 servers
- With redundancy (2x): 20 servers
- Peak message rate: 100,000 users × 1 msg/user/10sec = 10,000 msg/s

Storage Estimation

Messages per day: 10M DAU × 10 messages/day = 100M messages
Message size (avg): 500 bytes + 200 bytes metadata = 700 bytes
Daily storage: 100M × 700 = 70 GB
5-year storage: 70GB × 365 × 5 = 127 TB
With 3x replication: ~380 TB

Common Pitfalls / Anti-Patterns

Pitfall 1: Storing Messages in WebSocket Server Memory

WebSocket servers are ephemeral. When a server restarts, crashes, or a client migrates to a different server due to load balancing, all in-memory state is gone. If you buffered messages in a server’s RAM while waiting to deliver them, those messages vanish. The sender sees “delivered.” The recipient never gets them.

The failure mode is deceptively common. A user on mobile hits a dead zone. Their device reconnects to a different WebSocket server. The new server has no record of the message that was sitting in the old server’s memory. The user reports “my message didn’t go through.” You check your logs and see it was delivered to the old server. Nobody can explain where it went.

Horizontal scaling makes this worse. When you add more WebSocket servers, clients can route to any of them. If server A receives a message for a user currently connected to server B, you route through your message bus. But if server A kept that message in memory instead of a database, server B has no way to retrieve it.

The solution is write-first semantics. The WebSocket gateway receives the message and immediately persists it before acknowledging receipt. When the recipient reconnects (possibly to a different server), the message router pulls from storage and delivers. Your message queue handles delivery optimization, not storage.

async def handle_message(self, connection: Connection, raw_message: str):
    message = json.loads(raw_message)

    # Store before delivery
    stored = await self.message_store.store(
        from_user=connection.user_id,
        to_user=message["recipient_id"],
        content=message["content"]
    )

    # Then route
    await self.router.route_message(stored)

    # ACK only after durable write
    await connection.send({"type": "ack", "message_id": stored.id})

The server does not ACK until the write completes. If the server crashes after persisting but before delivery, the message survives. The recipient reconnects and the sync protocol fetches any messages after their last confirmed receipt.

Pitfall 2: Not Handling Reconnection Gracefully

Mobile networks drop connections constantly. A user walks through a dead zone, opens the app, or switches from WiFi to cellular. Each transition breaks the WebSocket and triggers a reconnect. The client reconnects, requests message history, and suddenly there are duplicates, gaps, or messages in the wrong order.

The root cause is a mismatch between what the client thinks it has received and what the server thinks it delivered. During disconnection, the server sent messages 5 through 12 to the old connection. The old connection died before sending an ACK. The client reconnects and asks for messages since message 4. The server delivers 5 through 12 again, because it never recorded that the client actually received them.

Idempotent message IDs fix the duplication problem. Every outbound message from a client carries a client-generated UUID. The server tracks which IDs it has already processed. When the client sends a duplicate ID (a retry after timeout), the server returns success without re-processing.

class IdempotentMessageStore:
    def __init__(self, redis: Redis):
        self.redis = redis
        self.DEDUP_WINDOW = 86400  # 24 hours

    async def is_duplicate(self, message_id: str) -> bool:
        key = f"msg:processed:{message_id}"
        return await self.redis.exists(key)

    async def mark_processed(self, message_id: str):
        key = f"msg:processed:{message_id}"
        await self.redis.setex(key, self.DEDUP_WINDOW, "1")

    async def handle_message(self, message: Message) -> bool:
        if await self.is_duplicate(message.client_id):
            return False  # Already processed

        await self.mark_processed(message.client_id)
        await self.message_store.store(message)
        return True

Cursor-based pagination fixes the gap problem. Instead of “give me messages from offset 500,” the client sends the ID of the last message it reliably received. The server returns messages after that cursor. The client maintains this cursor as ground truth for where it left off. When reconnecting, it requests messages after that cursor and only that cursor. No gaps, no duplicates.

Pitfall 3: Presence Flooding

Presence seems simple: someone connects, you mark them online and broadcast. Someone disconnects, you mark them offline and broadcast that too. At scale, the math catches up with you fast. In a channel with 10,000 members, every join event sends 10,000 notifications. Every leave sends another 10,000. A 30-second network hiccup in a popular channel during rush hour generates tens of thousands of unnecessary presence messages.

Mobile makes this worse. Phones disconnect and reconnect constantly as they move between cell towers, WiFi access points, and dead zones. Each brief dropout triggers a disconnect presence event and a reconnect presence event. In a channel with 5,000 mobile users, you can generate millions of presence messages per minute during peak instability.

The batching approach collects presence changes over a time window. Instead of publishing each change immediately, the presence service accumulates changes in memory. Every 5-10 seconds, it publishes a single aggregated update listing everyone who changed state during that window. The outbound message count drops by a factor equal to the average number of events per window.

Bloom filters handle the “who is online” query differently. When a user opens a channel and asks which of their 500 contacts are online, you do not want to broadcast a query to all connected clients or hit a database for every contact. A bloom filter is a probabilistic data structure that answers “possibly in set” or “definitely not in set.” You maintain a bloom filter of currently online user IDs. Checking whether a specific user is online is a single cheap query. False positives are acceptable (you might briefly think someone is online when they are not). False negatives are not (you should never tell someone a contact is offline when they are actually online). The false positive rate is tunable based on memory budget.

For very large channels, most chat systems stop showing individual presence entirely. Instead of listing every user, they show “many people are here” or display presence only for direct contacts. Most users in a large channel do not need to know if every single person is online.

Pitfall 4: Single Point of Failure for Message Ordering

Problem: Using a single sequencer node for total ordering creates a bottleneck and failure point.

Solution: Use causal ordering (vector clocks) instead of total ordering. Most applications only need causal order anyway.

Quick Recap

WebSocket gateways scale horizontally using broker fanout or sticky sessions.
Store messages persistently first, then deliver; don’t use memory as source of truth.
Use vector clocks for causal ordering; only require total order when truly necessary.
Conflict resolution on offline sync uses timestamps, vector clocks, or application logic.
Presence should be batched and cached, not broadcast on every change.

Copy/Paste Checklist

- [ ] WebSocket gateway with Redis pub/sub fanout
- [ ] Messages stored in DB before delivery
- [ ] Client auto-reconnect with idempotent delivery
- [ ] Cursor-based pagination for message history
- [ ] Vector clocks for causal ordering
- [ ] Conflict resolution on offline sync
- [ ] Presence batched updates (5-10 second windows)
- [ ] Push notification fallback for offline users
- [ ] Rate limiting per connection
- [ ] E2EE encryption for sensitive conversations

Observability Checklist

Metrics to Capture

websocket_connections_active (gauge) - Per server, total
messages_delivered_total (counter) - By type (direct, group, system)
message_delivery_latency_seconds (histogram) - End-to-end, P50/P95/P99
presence_updates_total (counter) - Online/offline transitions
offline_sync_duration_seconds (histogram) - Time to sync missed messages
push_notification_sent_total (counter) - By channel (APNS, FCM, SMS)

Logs to Emit

{
  "timestamp": "2026-03-22T10:30:00Z",
  "event": "message_delivered",
  "message_id": "msg_abc123",
  "conversation_id": "conv_xyz",
  "recipient_id": "user_456",
  "delivery_latency_ms": 45,
  "channel": "websocket"
}

Alerts to Configure

Alert	Threshold	Severity
Connection count > 80% capacity	80% max	Warning
Message latency P99 > 200ms	200ms	Warning
Presence update lag > 10s	10s	Critical
Offline sync duration > 30s	30s	Warning
Push notification queue > 10K	10000	Warning

Security Checklist

WebSocket connections over WSS (TLS)
JWT token validation on connection
Message content encryption (E2EE for sensitive chats)
Rate limiting per user (prevent spam)
Input sanitization (XSS prevention)
Push notification content limited (no PII in notifications)
Audit logging for moderation actions
Device management (revoke stolen device sessions)

Scenario Drills

Scenario 1: WebSocket Gateway Server Crash

Situation: The WebSocket server handling 10,000 connections crashes unexpectedly.

Analysis:

All 10,000 users disconnect simultaneously
Client auto-reconnect kicks in but overwhelms surviving servers
Messages sent during disconnect are in message store, not lost

Solution: Run N+1 instances with health checks. Clients implement exponential backoff on reconnect. Sticky sessions at load balancer ensure reconnection routes to same server. Messages are stored in durable DB before delivery, so offline messages survive server death.

Scenario 2: Presence Flood in Large Channel

Situation: 1,000 users join a popular channel simultaneously. Each presence change is broadcast to all members.

Analysis:

Each join generates 1 presence update broadcast to 999 users
1,000 joins = 999,000 presence messages
Network flooding and server CPU spike

Solution: Batch presence updates into 5-10 second windows. Instead of broadcasting each change, publish aggregated state. Use bloom filters for approximate “is user online” queries. For very large channels, hide individual presence.

Scenario 3: Offline User Comes Back Online After Days

Situation: A user returns after being offline for a week with thousands of missed messages.

Analysis:

Message history is large; fetching all at once causes timeout
Push notification storm if all messages trigger notifications
User sees huge badge count

Solution: Cursor-based pagination syncs in batches. Messages are fetched incrementally with cursors. Push notifications are coalesced into a single “you have messages” notification rather than one per message. The app shows “loading” state while catching up.

Failure Flow Diagrams

Message Delivery Flow

The message delivery flow starts when a sender submits a message through their WebSocket connection. The WebSocket gateway forwards the message to the message router, which immediately persists it to the database. This write-first approach ensures no messages are lost, even if the server crashes right after receiving the message.

After storage, the router checks the recipient’s online status. If the recipient is connected to any WebSocket gateway, the router locates their active connection and delivers the message directly. If the recipient is offline, the router queues a push notification through APNS or FCM. When the recipient opens the app in response to the notification, the client initiates the message sync protocol to fetch the missed messages.

The sender receives an acknowledgement once the message is stored and the delivery path is determined. This acknowledgement confirms the message will not be lost, even if delivery to the recipient is deferred.

graph TD
    A[Sender Sends Message] --> B[WebSocket Gateway]
    B --> C[Message Router]
    C --> D[Store in Database]
    D --> E{Recipient Online?}
    E -->|Yes| F[Find Connection]
    F --> G[Deliver via WS Gateway]
    E -->|No| H[Queue Push Notification]
    G --> I[Send ACK to Sender]
    H --> J[APNS/FCM]
    J --> K[Device Receives]
    K --> L[User Opens App]
    L --> M[Sync Missed Messages]

WebSocket Connection Lifecycle

The lifecycle of a WebSocket connection follows a predictable sequence of states. Understanding each phase helps diagnose connection issues and design robust reconnection logic.

The connection begins with authentication. The client sends a JWT token with the WebSocket upgrade request. The gateway validates the token, extracts the user identity, and either accepts or rejects the connection. On rejection, the gateway sends a 4001 status code with an error message and closes the socket.

Once authenticated, the gateway registers the connection with the message router and subscribes to the user’s active conversations. After registration, the connection enters the established state, where it can send and receive messages. A heartbeat loop runs alongside the message processing to detect stale connections. If the heartbeat fails, meaning the client does not respond within the timeout window, the gateway marks the user as offline and cleans up the connection state.

graph TD
    A[Client Connects] --> B[Authenticate Token]
    B --> C{Valid?}
    C -->|No| D[Close Connection 4001]
    C -->|Yes| E[Register Connection]
    E --> F[Subscribe to Conversations]
    F --> G[Connection Established]
    G --> H[Heartbeat Loop]
    H --> I{Heartbeat OK?}
    I -->|No| J[Mark Offline]
    I -->|Yes| K{Message Received?}
    K -->|Yes| L[Process Message]
    K -->|No| H
    L --> H
    J --> M[Cleanup Connection]

Presence Update Batching

Presence updates are one of the highest-frequency events in a chat system. Every connection event, including connect, disconnect, and heartbeat, generates a status change. Broadcasting each change immediately to all interested clients creates a storm of small messages that consumes bandwidth and CPU on both servers and clients.

The batching mechanism solves this by collecting individual presence events into a time window, typically 5 seconds. When the first presence change arrives, the service starts a timer. All subsequent changes within that window are accumulated in a buffer. When the window expires, the service publishes a single aggregated update containing all changes that occurred during the window.

This reduces the number of presence messages by orders of magnitude. In a channel with 1,000 users where 200 join within a 5-second window, unbroadcast presence generates 200,000 messages: 200 updates multiplied by 1,000 subscribers. With batching, the service sends one aggregated update to each subscriber.

graph LR
    A[User Online Event] --> B[Presence Service]
    B --> C{Batch Window Open?}
    C -->|No| D[Start 5s Timer]
    C -->|Yes| E[Add to Batch Buffer]
    D --> E
    E --> F{Window Elapsed?}
    F -->|No| E
    F -->|Yes| G[Publish Aggregated Update]
    G --> H[Subscribers Receive]

Interview Questions

1. How do you handle message delivery when the recipient is offline?

When the recipient is offline, the message router stores the message in durable storage first. Then it triggers a push notification to the user's device. When the user comes back online, the client syncs missed messages using a cursor-based pagination protocol, fetching messages since the last known sync timestamp.

2. Why is vector clock ordering better than total ordering for chat?

Total ordering requires a single sequencer node that sees all messages. This creates a bottleneck and a single point of failure. Causal ordering only guarantees that causally related messages arrive in order. Unrelated messages can be interleaved freely. Most chat applications only need causal ordering, which is cheaper to implement and more fault-tolerant.

3. How does presence tracking scale in large channels?

Presence changes are batched rather than broadcast immediately. The system sends presence updates every 5-10 seconds, not on every change. For "who is online" queries, bloom filters can approximate membership without broadcasting to all members. For large channels, presence is often not shown at all or shown as "many people online."

4. What is the offline sync conflict resolution strategy?

Messages are identified by client-generated idempotent message IDs. When syncing, the server returns all messages since the last sync cursor. The client merges local and server messages, deduplicating by message ID. For conflicts (same ID, different content), vector clocks determine which is causally newer, or timestamps break ties.

5. How would you design a message store schema for efficient pagination?

Offset-based pagination (LIMIT 100 OFFSET 500) falls apart at scale. As offset grows, the database still scans all those rows before returning your 100. With millions of messages per conversation, page 1000 is slow.

Cursor-based pagination solves this. Instead of "give me rows 500-600", you say "give me messages after message ID XYZ". The database uses the index on (conversation_id, created_at DESC) and returns the next page instantly, regardless of where you are in the conversation.

The cursor can be a message ID or a timestamp — both work with the composite index. The tradeoff is that you can't jump to arbitrary pages; you can only move forward or backward sequentially.

6. What are the trade-offs between WebSockets and HTTP long-polling for chat applications?

WebSockets give you a persistent TCP connection. Once established, sending a message is essentially free — no connection overhead. The tradeoff is infrastructure complexity: WebSockets are stateful, so scaling horizontally means either sticky sessions (same user routes to same server) or a pub/sub layer so all servers can receive messages for all users.

Long-polling works over standard HTTP. The client makes a request, the server holds it open until there's data (or a timeout), then responds and the client immediately opens a new request. It's simpler to scale because every request is stateless, but it generates more HTTP overhead and has higher latency.

For most chat applications, WebSockets are the right call. The infrastructure complexity is manageable with modern tooling. Long-polling is a reasonable fallback for enterprise environments with strict proxies that block WebSocket upgrades.

7. How do you implement rate limiting in a chat system?

Rate limiting needs to be per-user-per-conversation, not just per-user. If Alice can send 100 messages per minute total, she could spam 100 different people simultaneously. Limiting to 1 message per second per recipient forces her to actually want to communicate.

Redis sliding window counters work well here. The key format is something like ratelimit:user123:conversation456:minute. Increment on each message, check the count before delivering. If over the limit, return 429 with a Retry-After header.

Presence updates deserve separate limits. A misbehaving client could send presence heartbeats every second; you want to rate limit those lower than actual messages.

8. What strategies would you use for full-text message search in a chat application?

Don't build this yourself. Elasticsearch or Meilisearch handle tokenization, stemming, ranking, and fuzzy matching — all things you'd otherwise have to implement from scratch. The indexing happens asynchronously: when a message is stored, it's also sent to a queue, and a worker indexes it separately from the hot path.

Scope matters. Global search across all conversations is expensive and usually not what users want. Searching within a single conversation is faster and more useful. Some systems also support searching across a subset of conversations (e.g., "search my starred conversations").

Index freshness vs. search volume is a tradeoff. If you index synchronously, search results are always fresh but you've added latency to every message send. Async indexing (what most systems do) means search results might be a few seconds behind, which is acceptable for most use cases.

9. How would you design an efficient typing indicator system?

Typing indicators are pure ephemeral state — they're not important enough to persist. If the server crashes and you lose who's typing, nobody notices.

Client-side: debounce aggressively. If a user types 10 characters in a second, you shouldn't send 10 typing events. Wait until 300-500ms have passed since the last keystroke before sending one event.

Server-side: batch and broadcast. Instead of notifying all conversation members every time someone starts typing, batch notifications into 2-3 second windows. One user typing rapidly generates one notification, not twenty. Redis sorted sets with timestamps work well for tracking who is typing in a conversation.

State expires automatically. If no typing event comes in for 5-10 seconds, the user has either stopped or their client crashed. The indicator disappears without requiring explicit stop events.

10. What are the key considerations for multi-device message synchronization?

The core challenge: the same user might have the app open on their laptop and phone simultaneously. Messages arrive on one device; they need to appear on the other. Read state needs to sync too — if you read a message on your phone, you shouldn't see it as unread on your laptop.

Each device maintains a cursor — the ID of the last message it has synced. When a device comes online (or periodically while connected), it asks the server for any messages since that cursor. The server returns them; the client merges them with local messages. Vector clocks or simple timestamps handle ordering.

Read receipts are a separate problem. When you mark a message as read on one device, that read state needs to fan out to all your other devices and to the sender. This can be handled via the same message queue infrastructure as regular messages.

On mobile, be careful about sync frequency. Aggressive sync drains battery. Quiesce sync when the app is backgrounded and catch up when it returns to foreground.

11. How does the Signal Protocol provide forward secrecy and break-in recovery?

The Signal Protocol uses the Double Ratchet algorithm to achieve two properties that normal encryption schemes lack.

Forward secrecy means that if your long-term key is compromised, past messages stay safe. This works because messages are encrypted with ephemeral session keys derived from the root key. The root key itself is derived through Diffie-Hellman ratcheting, and once a message key is used, it's deleted. Compromising today's key doesn't reveal yesterday's messages.

Break-in recovery (also called post-compromise security) means that after a device is compromised, the protocol can recover to a secure state without requiring a new key exchange. The ratcheting mechanism advances the chain even if an attacker had visibility, so after a few messages, the attacker loses access again.

The combination makes it computationally expensive for an attacker to decrypt messages — they'd need to compromise each key in the ratchet chain sequentially.

12. How would you handle message deduplication at scale in a distributed chat system?

Idempotent message IDs are the foundation. The client generates a UUID before sending, and the server checks if that ID already exists before processing. This moves deduplication from the server to the client, which is fine because the client is the source of truth for its own sends.

At the server level, you need a fast lookup. Redis with the message ID as a key works for active conversations. For cross-conversation deduplication across millions of messages, you'd use a bloom filter or a distributed set, trading some memory for speed.

The tricky part is cleanup. Message IDs accumulate forever if you keep them. A common approach is to partition by time window — keep deduplication state for 7 days, then expire. Most real duplicates happen immediately due to network retries anyway.

For groups with many members, the deduplication check should happen before fanout, not during. One server checks the ID, then it routes to all recipients without re-checking.

13. What are the trade-offs between optimistic UI updates versus confirmed delivery in chat applications?

Optimistic updates show the message immediately in the UI as if it were sent, before the server confirms it. This feels fast — zero perceived latency. The risk is that if the send fails, you have to roll back the UI, which is jarring for users and requires careful state management.

Confirmed delivery waits for a server ACK before showing the message as "sent." This is safer but slower. If the roundtrip is 50ms, users notice. On mobile with variable latency, it gets worse.

Most apps use optimistic updates with local queuing. The message goes into a "pending" state, shows in the UI, and if the server ACK doesn't come back within a timeout (say, 10 seconds), it's marked as failed with a retry button. The user sees it immediately but knows it's not confirmed.

The middle ground some apps use: optimistic updates for text, confirmed delivery for important actions like payments or message edits. Text is low-stakes; editing a message or sending a large file isn't.

14. How do you prevent read state from getting inconsistent across multiple devices?

Read state is one of the harder problems in multi-device chat. When you read a message on your phone, that needs to sync to your laptop, your tablet, and any web sessions — and it needs to happen fast enough that you don't see it as unread again.

The core mechanism: read receipts flow through the same message infrastructure as regular messages. When you mark something read, that event goes to the server, which fans it out to all your other devices via their active WebSocket connections. All devices converge on the same read cursor.

The tricky part is ordering. If you read message 100 on your phone while message 99 is still syncing to your laptop, the laptop might think you've only read up to message 98. The fix is to only advance the read cursor to the highest consecutive message number, not just the latest.

For conversations with thousands of messages, you don't want to sync read state on every single read. Batch it — send a read receipt every 5-10 seconds or when leaving the conversation, not on every scroll position change.

15. How would you design a system to detect and suspend spammy users in real-time?

Real-time spam detection operates at two levels: rate limiting and content classification.

Rate limiting is straightforward. Redis sliding window counters track messages per user per conversation. If someone sends more than 1 message per second to the same person, or more than 10 messages per minute to different people, flag them. This catches bots immediately without analyzing content.

Content classification is harder. You don't want to scan every message synchronously — that adds latency. Instead, asynchronously analyze messages after delivery. A queue collects messages, a worker runs detection (simple: keyword matching, link density, account age checks; complex: ML models), and if flagged, the content is deleted and the user is warned or suspended.

For real-time response to spam waves (coordinated attacks), use reputation scores. New accounts have low reputation. Accounts with history of good behavior have high reputation. Low reputation users get slower delivery — messages go through a moderation queue before reaching recipients. This buys time without blocking legitimate users.

16. What architectural changes would you make to support end-to-end encrypted group chats with hundreds of members?

Naive E2EE for groups is O(n²) — every message is encrypted separately for every recipient. With 100 members, that's 100 encryptions per message. At scale, this doesn't work.

The practical approach is sender keys. Each member generates one key that they share with every other member once (during group join). When sending, they encrypt the message once with their sender key and attach it to the message. Recipients decrypt with the sender's key. This reduces encryptions to O(n) per message — still linear, but a much smaller constant.

When someone joins late, you need to send them all the sender keys for the current group state. When someone leaves, you need to remove their access to future messages. Group key management is its own sub-problem — you need a protocol for updating keys when membership changes.

For very large groups (broadcast-style with thousands), E2EE often isn't practical. Most large group apps either accept that admins can read messages (operational security model) or they restrict E2EE to small groups where it's tractable.

17. How does cursor-based pagination work for loading message history, and why is it better than offset-based?

Offset-based pagination says "give me rows 500 through 600." The database still scans the first 500 rows before returning the 100 you asked for. With millions of messages per conversation, page 10,000 is effectively a full table scan.

Cursor-based pagination says "give me messages after message ID XYZ." The database uses the composite index on (conversation_id, created_at DESC, id) and jumps directly to the cursor position. The query is O(log n) regardless of where you are in the conversation.

The cursor can be a message ID or a timestamp. Message ID is better because it handles concurrent inserts cleanly — if a new message arrives while you're paginating, you won't see duplicates or gaps. Timestamps have precision limits; concurrent inserts at the same millisecond are ambiguous.

The tradeoff is that cursors don't support random access. You can't jump to page 100 directly. You have to walk forward from the beginning. For most use cases this is fine — users rarely need to jump to arbitrary points in chat history.

18. How would you handle the "last seen" feature without creating privacy or performance issues?

"Last seen" seems simple — store the timestamp of the user's last activity. It's not simple.

The privacy problem: showing exactly when someone was online is intrusive. "She was last active at 11:43pm" tells you someone stayed up late, which you shouldn't know. Most apps blur this: "online," "last seen 5 minutes ago," "last seen today."

The performance problem: every presence update writes to storage. If you track exact timestamps on every keystroke, you have millions of writes per day for a popular app. Instead, round to the nearest 5-minute bucket or only update on explicit presence changes (connect, disconnect, app background/foreground), not on every message.

For showing "last seen" to other users, don't expose exact timestamps from recent activity. Use relative phrases (active 5m ago, active 2h ago) that decay to larger buckets over time. Nobody needs to know exactly when someone was online in real-time except in specific contexts.

19. What strategies exist for managing WebSocket connections during network transitions (e.g., WiFi to cellular)?

Network transitions are a silent killer of WebSocket connections. When a phone switches from WiFi to cellular, the TCP connection doesn't migrate — it dies. The app has to reconnect.

On mobile, the OS handles network transitions transparently at the IP level for a few seconds. Your WebSocket connection might survive a brief WiFi handoff. But a full network switch (WiFi off, cellular on) always breaks connections.

The client-side fix: implement exponential backoff reconnection with jitter. When the connection drops, wait a random interval (0-2 seconds), then reconnect. Don't hammer the server immediately after a network change — the network stack needs a moment to stabilize.

At the protocol level, use a session token that's independent of the TCP connection. The client identifies its session, the server knows which messages it has received, and the client can resume from where it left off without a full re-sync. This is especially important for mobile where connections are inherently unstable.

20. How do you design for message edit and delete semantics that are consistent across devices?

Edit and delete are surprisingly hard because they need to work consistently across all a user's devices and all recipients.

For delete: the cleanest model is soft delete with a tombstone. The message stays in the database but is marked deleted. When a client loads messages, it checks the tombstone and hides the content, showing "This message was deleted" instead. This works across all devices because the tombstone propagates like a regular message.

For edit: edits should create a new message ID with a reference to the original. The original is marked as edited (but not deleted — you still see it was edited). Clients display the latest version but can show edit history if needed. The server rejects edits on messages older than a time window (say, 15 minutes) to prevent abuse.

The hardest case: what if you've read the original message on device A, the sender edits it on device B, and device A is offline when the edit happens? Device A comes online, syncs, and needs to show the edited content without marking it as unread. This requires tracking which messages each device has acknowledged and only updating read state for genuinely new content.

Conclusion

Building a scalable chat system requires:

WebSocket gateways for persistent connections
Message routers that handle online and offline delivery
Durable message storage with efficient queries
Presence tracking with heartbeat mechanism
Push notifications for offline users
Consistent hashing for horizontal scaling

System Design: Real-Time Chat System Architecture

Introduction

Requirements Analysis

Functional Requirements

Non-Functional Requirements

Architecture Overview

WebSocket Connections

Connection Flow

WebSocket Server Implementation

Message Routing

Message Router Service

Message Storage

Database Schema

Message Store Service

Presence System

Presence Service

Heartbeat Mechanism

Push Notifications

Notification Service

API Design

REST Endpoints

WebSocket Messages

Scalability Patterns

Connection Distribution

Horizontal Scaling

End-to-End Encryption

Signal Protocol Overview

Key Exchange Implementation

Message Encryption

Offline Sync and Conflict Resolution

Conflict Resolution Strategies

Message Sync Protocol

Offline Queue Management

Message Ordering Guarantees

Total Order vs Causal Order

Trade-off Analysis

When to Use / When Not to Use

When TO Use WebSocket Scaling with Sticky Sessions

When NOT to Use Sticky Sessions

Production Failure Scenarios

Real-world Failure Scenarios

Scenario 1: Slack Message Reordering Incident

Scenario 2: WhatsApp Broadcast Storm

Scenario 3: Discord WebSocket Connection Storm

Capacity Estimation

Connection Capacity

Storage Estimation

Common Pitfalls / Anti-Patterns

Pitfall 1: Storing Messages in WebSocket Server Memory

Pitfall 2: Not Handling Reconnection Gracefully

Pitfall 3: Presence Flooding

Pitfall 4: Single Point of Failure for Message Ordering

Quick Recap

Copy/Paste Checklist

Observability Checklist

Metrics to Capture

Logs to Emit

Alerts to Configure

Security Checklist

Scenario Drills

Scenario 1: WebSocket Gateway Server Crash

Scenario 2: Presence Flood in Large Channel

Scenario 3: Offline User Comes Back Online After Days

Failure Flow Diagrams

Message Delivery Flow

WebSocket Connection Lifecycle

Presence Update Batching

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

Uber's Architecture: From Monolith to Microservices at Scale

System Design: Netflix Architecture for Global Streaming

System Design: Twitter Feed Architecture and Scalability