System Design: Real-Time Chat System Architecture

Deep dive into chat system design. Learn about WebSocket management, message delivery, presence systems, scalability patterns, and E2E encryption.

published: reading time: 22 min read

System Design: Real-Time Chat System Architecture

Chat systems like WhatsApp, Slack, and Discord handle millions of concurrent users sending messages in real-time. The challenges are fundamentally different from request-response web applications: persistent connections, low-latency delivery, and handling offline users.

This case study walks through designing a scalable chat system.

Requirements Analysis

Functional Requirements

Users should be able to:

  • Send and receive messages in real-time
  • Create and join chat rooms or direct messages
  • See who is online (presence)
  • Receive push notifications when offline
  • Send text, images, and files
  • Search through message history

Non-Functional Requirements

The system needs:

  • Message delivery under 100ms for online users
  • Support 100,000+ concurrent connections per server
  • Handle 10 million daily active users
  • Store message history indefinitely
  • Maintain 99.99% availability

Architecture Overview

graph TB
    A[Mobile App] -->|WSS| B[WebSocket Gateway]
    A -->|HTTPS| C[REST API Gateway]
    B --> D[Message Router]
    D --> E[Message Store]
    D --> F[Presence Service]
    D --> G[Notification Service]
    C --> E
    C --> H[User Service]
    E --> I[(Database)]
    G --> J[(Push Service)]
    G --> K[(Queue)]

WebSocket Connections

WebSockets provide persistent bidirectional connections, essential for real-time chat.

Connection Flow

sequenceDiagram
    participant C as Client
    participant W as WebSocket Gateway
    participant R as Message Router
    participant M as Message Store

    C->>W: CONNECT {auth_token}
    W->>W: Validate token
    W->>W: Extract user_id
    W->>R: Register connection {user_id, connection_id}
    R-->>W: Connection registered
    W-->>C: Connection established

    C->>W: SEND {recipient_id, content}
    W->>R: Route message {from, to, content}
    R->>M: Store message
    M-->>R: Message stored
    R->>R: Find recipient connection
    R->>W: Deliver to recipient
    W-->>C: MESSAGE {from, content}

    Note over C,W: Bidirectional messages flow through router

WebSocket Server Implementation

import asyncio
import websockets
from dataclasses import dataclass
from typing import Dict, Set

@dataclass
class Connection:
    user_id: str
    connection_id: str
    websocket: websockets.WebSocketServerProtocol
    subscribed_rooms: Set[str]

class WebSocketGateway:
    def __init__(self, message_router: MessageRouter):
        self.router = message_router
        self.connections: Dict[str, Connection] = {}
        self.user_connections: Dict[str, Set[str]] = {}

    async def handle_connection(self, websocket: websockets.WebSocketServerProtocol, path: str):
        # Authenticate
        token = websocket.request_headers.get("Authorization", "").replace("Bearer ", "")
        user_id = await self.authenticate(token)

        if not user_id:
            await websocket.close(4001, "Authentication failed")
            return

        connection_id = str(uuid4())
        connection = Connection(
            user_id=user_id,
            connection_id=connection_id,
            websocket=websocket,
            subscribed_rooms=set()
        )

        # Register connection
        self.connections[connection_id] = connection
        if user_id not in self.user_connections:
            self.user_connections[user_id] = set()
        self.user_connections[user_id].add(connection_id)

        # Register with router
        await self.router.register_connection(user_id, connection_id)

        try:
            # Handle incoming messages
            async for message in websocket:
                await self.handle_message(connection, message)
        except websockets.ConnectionClosed:
            pass
        finally:
            await self.cleanup(connection)

    async def handle_message(self, connection: Connection, raw_message: str):
        message = json.loads(raw_message)

        msg_type = message.get("type")

        if msg_type == "message":
            await self.router.route_message(
                from_user=connection.user_id,
                to_user=message["recipient_id"],
                content=message["content"],
                connection_id=connection.connection_id
            )
        elif msg_type == "subscribe":
            room_id = message["room_id"]
            connection.subscribed_rooms.add(room_id)
            await self.router.subscribe_to_room(connection.user_id, room_id)
        elif msg_type == "typing":
            await self.router.send_typing_indicator(
                from_user=connection.user_id,
                to_user=message["recipient_id"]
            )

Message Routing

The message router is the heart of the chat system, handling delivery to online and offline users.

Message Router Service

class MessageRouter:
    def __init__(
        self,
        message_store: MessageStore,
        presence_service: PresenceService,
        notification_service: NotificationService
    ):
        self.message_store = message_store
        self.presence_service = presence_service
        self.notification_service = notification_service
        self.online_connections: Dict[str, Set[str]] = {}

    async def register_connection(self, user_id: str, connection_id: str):
        if user_id not in self.online_connections:
            self.online_connections[user_id] = set()
        self.online_connections[user_id].add(connection_id)
        await self.presence_service.set_online(user_id)

    async def route_message(
        self,
        from_user: str,
        to_user: str,
        content: str,
        connection_id: str = None
    ):
        # Store message first (durability)
        message = await self.message_store.store(
            from_user=from_user,
            to_user=to_user,
            content=content
        )

        # Deliver to online recipients
        if to_user in self.online_connections:
            for conn_id in self.online_connections[to_user]:
                await self.deliver_message(conn_id, message)
            # No push notification needed
        else:
            # User offline - send push notification
            await self.notification_service.send_push_notification(
                user_id=to_user,
                message=message
            )

        # Acknowledge to sender if online
        if connection_id:
            await self.send_ack(connection_id, message)

        return message

    async def deliver_message(self, connection_id: str, message: Message):
        gateway = self.get_gateway_for_connection(connection_id)
        await gateway.send_to_connection(connection_id, {
            "type": "message",
            "message_id": message.id,
            "from": message.from_user,
            "content": message.content,
            "timestamp": message.created_at.isoformat()
        })

Message Storage

Messages must be stored durably and queryable.

Database Schema

-- Messages table
CREATE TABLE messages (
    id BIGSERIAL PRIMARY KEY,
    conversation_id TEXT NOT NULL,
    sender_id BIGINT NOT NULL,
    content TEXT NOT NULL,
    content_type TEXT DEFAULT 'text',  -- text, image, file
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    delivered_at TIMESTAMP WITH TIME ZONE,
    read_at TIMESTAMP WITH TIME ZONE,

    INDEX idx_messages_conversation (conversation_id, created_at DESC),
    INDEX idx_messages_sender (sender_id, created_at DESC)
);

-- Conversations table
CREATE TABLE conversations (
    id TEXT PRIMARY KEY,
    type TEXT NOT NULL,  -- direct, group
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),

    INDEX idx_conversations_updated (updated_at DESC)
);

-- Conversation members
CREATE TABLE conversation_members (
    conversation_id TEXT NOT NULL REFERENCES conversations(id),
    user_id BIGINT NOT NULL,
    joined_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    last_read_at TIMESTAMP WITH TIME ZONE,
    unread_count BIGINT DEFAULT 0,

    PRIMARY KEY (conversation_id, user_id)
);

-- Index for finding user's conversations
CREATE INDEX idx_members_user ON conversation_members(user_id);

Message Store Service

class MessageStore:
    def __init__(self, db: Database, cache: RedisCache):
        self.db = db
        self.cache = cache

    def generate_conversation_id(self, user1: str, user2: str) -> str:
        # Consistent ordering for DMs
        sorted_users = sorted([user1, user2])
        return f"dm:{sorted_users[0]}:{sorted_users[1]}"

    async def store(
        self,
        from_user: str,
        to_user: str,
        content: str,
        conversation_id: str = None
    ) -> Message:
        if not conversation_id:
            conversation_id = self.generate_conversation_id(from_user, to_user)

        # Ensure conversation exists
        await self._ensure_conversation_exists(conversation_id)

        # Insert message
        row = await self.db.fetch_one("""
            INSERT INTO messages (conversation_id, sender_id, content)
            VALUES ($1, $2, $3)
            RETURNING *
        """, conversation_id, from_user, content)

        message = Message(**row)

        # Update conversation timestamp
        await self.db.execute("""
            UPDATE conversations
            SET updated_at = NOW()
            WHERE id = $1
        """, conversation_id)

        # Invalidate cache
        await self.cache.delete(f"conversation:{conversation_id}:messages")

        return message

    async def get_messages(
        self,
        conversation_id: str,
        limit: int = 50,
        before: datetime = None
    ) -> List[Message]:
        # Check cache first
        cache_key = f"conversation:{conversation_id}:messages"
        if not before:
            cached = await self.cache.get(cache_key)
            if cached:
                return [Message(**m) for m in json.loads(cached)]

        # Query database
        query = """
            SELECT * FROM messages
            WHERE conversation_id = $1
        """
        params = [conversation_id]

        if before:
            query += " AND created_at < $2"
            params.append(before)

        query += " ORDER BY created_at DESC LIMIT $" + str(len(params))
        params.append(limit)

        rows = await self.db.fetch(query, *params)
        messages = [Message(**row) for row in rows]

        # Cache if first page
        if not before:
            await self.cache.setex(
                cache_key,
                300,
                json.dumps([m.dict() for m in messages])
            )

        return messages

Presence System

Presence shows users who is online and their activity status.

Presence Service

class PresenceService:
    def __init__(self, redis: Redis):
        self.redis = redis
        self.PRESENCE_TTL = 60  # Heartbeat interval

    async def set_online(self, user_id: str):
        key = f"presence:{user_id}"
        await self.redis.setex(key, self.PRESENCE_TTL, json.dumps({
            "status": "online",
            "last_seen": datetime.utcnow().isoformat()
        })

        # Publish presence change
        await self.redis.publish("presence_updates", json.dumps({
            "user_id": user_id,
            "status": "online"
        }))

    async def set_offline(self, user_id: str):
        key = f"presence:{user_id}"
        await self.redis.setex(key, self.PRESENCE_TTL, json.dumps({
            "status": "offline",
            "last_seen": datetime.utcnow().isoformat()
        }))

        await self.redis.publish("presence_updates", json.dumps({
            "user_id": user_id,
            "status": "offline"
        }))

    async def get_presence(self, user_id: str) -> Dict:
        key = f"presence:{user_id}"
        data = await self.redis.get(key)
        if data:
            return json.loads(data)
        return {"status": "offline"}

    async def get_bulk_presence(self, user_ids: List[str]) -> Dict[str, Dict]:
        keys = [f"presence:{uid}" for uid in user_ids]
        values = await self.redis.mget(keys)

        result = {}
        for uid, value in zip(user_ids, values):
            if value:
                result[uid] = json.loads(value)
            else:
                result[uid] = {"status": "offline"}

        return result

Heartbeat Mechanism

class PresenceHeartbeat:
    def __init__(self, presence_service: PresenceService):
        self.presence_service = presence_service
        self.heartbeat_interval = 30

    async def start_heartbeat(self, websocket, user_id: str):
        while True:
            try:
                await asyncio.sleep(self.heartbeat_interval)
                await websocket.send(json.dumps({"type": "heartbeat_ack"}))
                await self.presence_service.refresh_heartbeat(user_id)
            except websockets.ConnectionClosed:
                await self.presence_service.set_offline(user_id)
                break

    async def refresh_heartbeat(self, user_id: str):
        key = f"presence:{user_id}"
        data = {
            "status": "online",
            "last_seen": datetime.utcnow().isoformat()
        }
        await self.redis.setex(key, self.PRESENCE_TTL, json.dumps(data))

Push Notifications

When users are offline, push notifications deliver new messages.

Notification Service

class NotificationService:
    def __init__(
        self,
        push_service: PushService,
        user_service: UserService
    ):
        self.push_service = push_service
        self.user_service = user_service

    async def send_push_notification(
        self,
        user_id: str,
        message: Message
    ):
        # Get user's push tokens
        tokens = await self.user_service.get_push_tokens(user_id)

        if not tokens:
            return

        # Get sender info
        sender = await self.user_service.get_user(message.from_user)

        # Send to all user's devices
        await asyncio.gather(*[
            self.push_service.send(
                token=token,
                notification={
                    "title": f"Message from {sender.display_name}",
                    "body": message.content[:100],
                    "data": {
                        "type": "message",
                        "conversation_id": message.conversation_id,
                        "message_id": message.id
                    }
                }
            )
            for token in tokens
        ])

Scalability Patterns

Connection Distribution

graph TB
    L[Load Balancer]
    W1[WS Server 1<br/>10K connections]
    W2[WS Server 2<br/>10K connections]
    W3[WS Server 3<br/>10K connections]
    W4[WS Server N<br/>10K connections]

    L -->|Route| W1
    L -->|Route| W2
    L -->|Route| W3
    L -->|Route| W4

    subgraph "Message Bus"
        K[Kafka/RabbitMQ]
    end

    W1 --> K
    W2 --> K
    W3 --> K
    W4 --> K

Horizontal Scaling

class ConsistentHashRouter:
    def __init__(self, nodes: List[str]):
        self.ring = {}
        self.sorted_keys = []
        self._build_ring(nodes)

    def _build_ring(self, nodes: List[str]):
        for node in nodes:
            for i in range(100):  # Virtual nodes
                key = self._hash(f"{node}:{i}")
                self.ring[key] = node
        self.sorted_keys = sorted(self.ring.keys())

    def get_node(self, key: str) -> str:
        hash_key = self._hash(key)
        idx = bisect_left(self.sorted_keys, hash_key) % len(self.sorted_keys)
        return self.ring[self.sorted_keys[idx]]

    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

# Usage: Route message to the server holding recipient's connection
async def route_to_user(user_id: str, message: Message):
    server = consistent_hash.get_node(user_id)
    await kafka.send("messageDelivery", {
        "server": server,
        "user_id": user_id,
        "message": message.dict()
    })

End-to-End Encryption

For privacy, implement end-to-end encryption (E2EE).

Signal Protocol Overview

graph TB
    A[Alice Key Pair] -->|Identity Key| PAB[Public Address Book]
    B[Bob Key Pair] -->|Identity Key| PAB

    PAB -->|Bob's Identity Key| AS[Alice's Device]
    PAB -->|Alice's Identity Key| BS[Bob's Device]

    AS -->|Pre-Key Bundle| BS
    BS -->|Pre-Key Bundle| AS

    AS <-->|X3DH Key Exchange| BS
    BS <-->|Double Ratchet| AS

Key Exchange Implementation

class KeyExchange:
    def generate_identity_key_pair(self) -> Tuple[bytes, bytes]:
        private_key = Curve25519.generate_private()
        public_key = Curve25519.public_from_private(private_key)
        return private_key, public_key

    def generate_prekey_bundle(self, identity_key: bytes) -> PrekeyBundle:
        identity_private, identity_public = self.generate_identity_key_pair()
        signed_prekey_private, signed_prekey_public = self.generate_identity_key_pair()
        one_time_prekey_private, one_time_prekey_public = self.generate_identity_key_pair()

        # Sign the signed prekey
        signature = self.sign(identity_private, signed_prekey_public)

        return PrekeyBundle(
            identity_public=identity_public,
            signed_prekey_public=signed_prekey_public,
            signed_prekey_signature=signature,
            one_time_prekey_public=one_time_prekey_public
        )

    async def x3dh_key_agreement(
        self,
        my_identity_key: bytes,
        their_bundle: PrekeyBundle
    ) -> bytes:
        # Perform X3DH key agreement
        dh1 = Curve25519.diffie_hellman(my_identity_key, their_bundle.signed_prekey_public)
        dh2 = Curve25519.diffie_hellman(my_identity_key, their_bundle.one_time_prekey_public)
        dh3 = Curve25519.diffie_hellman(my_identity_key, their_bundle.signed_prekey_public)
        dh4 = Curve25519.diffie_hellman(my_identity_key, their_bundle.one_time_prekey_public)

        # Combine DH outputs
        shared_secret = dh1 + dh2 + dh3 + dh4
        return HKDF(shared_secret, b"X3DH")

Message Encryption

class MessageEncryptor:
    def __init__(self, root_key: bytes):
        self.root_key = root_key

    def encrypt_message(
        self,
        message: str,
        chain_key: bytes,
        ratchet_key: bytes
    ) -> EncryptedMessage:
        # Generate message key from chain key
        message_key = HKDF(chain_key, b"MessageKey")

        # Encrypt with AES-GCM
        ciphertext = AESGCM.encrypt(message_key, message)

        # Advance chain key
        next_chain_key = HKDF(chain_key, b"ChainKey")

        return EncryptedMessage(
            ciphertext=ciphertext,
            message_key=message_key,
            next_chain_key=next_chain_key
        )

API Design

REST Endpoints

MethodEndpointDescription
GET/api/v1/conversationsList user’s conversations
GET/api/v1/conversations/:id/messagesGet messages
POST/api/v1/conversationsCreate DM or group
POST/api/v1/conversations/:id/messagesSend message (offline)
GET/api/v1/presence/:user_idGet user presence
PUT/api/v1/presenceUpdate own presence
POST/api/v1/devicesRegister push token

WebSocket Messages

TypeDirectionPayload
messagebidirectional{to, content} or {from, content}
ackserver to client{message_id}
typingbidirectional{to} or {from}
presenceserver to client{user_id, status}
heartbeatbidirectional{}
subscribeclient to server{room_id}

Conclusion

Building a scalable chat system requires:

  • WebSocket gateways for persistent connections
  • Message routers that handle online and offline delivery
  • Durable message storage with efficient queries
  • Presence tracking with heartbeat mechanism
  • Push notifications for offline users
  • Consistent hashing for horizontal scaling

When to Use / When Not to Use

ScenarioRecommendation
Real-time messaging appsUse WebSockets with message queue
Team collaboration toolsAdd presence and typing indicators
Customer support chatAdd conversation persistence and search
High-volume broadcast (1:N)Use pub/sub message broker fanout

When TO Use WebSocket Scaling with Sticky Sessions

  • Session affinity important: When connection state is local to server and expensive to rebuild
  • Moderate scale: When you can manage server affinity at load balancer level
  • Simpler operations: When you prefer simpler debugging and tracing

When NOT to Use Sticky Sessions

  • Very high scale: When you need to scale to thousands of servers
  • Geographic distribution: When users connect to nearest region, not a specific server
  • Transparent failover: When connection should survive server restarts

Offline Sync and Conflict Resolution

When users come back online after being offline, their device must sync missed messages without creating duplicates or conflicts.

Conflict Resolution Strategies

class ConflictResolver:
    """Resolve message sync conflicts using vector clocks"""

    def __init__(self):
        self.vector_clocks = {}  # {message_id: vector_clock}

    def resolve_messages(
        self,
        local_messages: List[Message],
        server_messages: List[Message]
    ) -> List[Message]:
        """Merge local and server messages, removing duplicates"""

        all_messages = {}
        for msg in local_messages + server_messages:
            msg_id = msg.id

            if msg_id not in all_messages:
                all_messages[msg_id] = msg
            else:
                # Conflict - use vector clock to resolve
                winner = self._resolve_conflict(all_messages[msg_id], msg)
                all_messages[msg_id] = winner

        return sorted(all_messages.values(), key=lambda m: m.timestamp)

    def _resolve_conflict(self, msg1: Message, msg2: Message) -> Message:
        """Vector clock comparison: later timestamp wins"""
        if msg2.timestamp > msg1.timestamp:
            return msg2
        return msg1

Message Sync Protocol

sequenceDiagram
    participant D as Device (Offline)
    participant S as Sync Server

    D->>D: User comes online
    D->>S: GET /sync?last_sync=1709123456
    S-->>D: {messages: [...], cursor: "abc123"}

    loop Until Caught Up
        D->>S: GET /sync?last_sync=cursor
        S-->>D: {messages: [...], cursor: "def456"}
        Note over D: Insert messages, update cursor
    end

    D->>D: Display "You're all caught up"

Offline Queue Management

class OfflineMessageQueue:
    """Queue messages sent while offline, sync when online"""

    def __init__(self, db: Database):
        self.db = db

    async def queue_message(self, user_id: int, message: Message):
        """Queue message sent offline"""
        await self.db.execute("""
            INSERT INTO offline_outbox (user_id, message_json, created_at)
            VALUES ($1, $2, NOW())
        """, user_id, json.dumps(message.dict()))

    async def sync_pending(self, user_id: int) -> List[Message]:
        """Get and delete pending messages"""
        rows = await self.db.fetch("""
            SELECT message_json FROM offline_outbox
            WHERE user_id = $1
            ORDER BY created_at ASC
        """, user_id)

        await self.db.execute("""
            DELETE FROM offline_outbox WHERE user_id = $1
        """, user_id)

        return [Message(**json.loads(row['message_json'])) for row in rows]

WebSocket Scaling Patterns

Sticky Sessions (Session Affinity)

graph TB
    L[Load Balancer<br/>Sticky Enabled]
    W1[WS Server 1]
    W2[WS Server 2]
    W3[WS Server N]

    L -->|JSESSIONID=W1| W1
    L -->|JSESSIONID=W2| W2
    L -->|JSESSIONID=W3| W3

    Note over L: Same user always routes to same server

Pros: Simple to implement, no shared state needed Cons: Server restart disconnects user, hard to scale

Broker Fanout Pattern

graph TB
    L[Load Balancer]
    W1[WS Server 1]
    W2[WS Server 2]
    W3[WS Server N]

    K[(Redis Pub/Sub)]

    L --> W1
    L --> W2
    L --> W3

    W1 --> K
    W2 --> K
    W3 --> K

    K -->|Fanout| W1
    K -->|Fanout| W2
    K -->|Fanout| W3

    Note over K: Message published once, delivered to all servers
class HybridWebSocketGateway:
    """
    Combines sticky sessions for connection
    with broker fanout for message delivery
    """

    def __init__(
        self,
        redis: Redis,
        message_router: MessageRouter,
        consistent_hash: ConsistentHashRouter
    ):
        self.redis = redis
        self.router = message_router
        self.hash = consistent_hash

    async def route_message(self, message: Message):
        """
        Route message through pub/sub so all
        server instances receive it
        """
        # Publish to Redis channel for this conversation
        channel = f"conversation:{message.conversation_id}"
        await self.redis.publish(channel, json.dumps(message.dict()))

    async def subscribe(self, connection_id: str, conversation_ids: List[str]):
        """Subscribe to message channels"""
        pubsub = self.redis.pubsub()

        for conv_id in conversation_ids:
            await pubsub.subscribe(f"conversation:{conv_id}")

        # Listen in background, deliver to connection
        asyncio.create_task(self._pubsub_listener(connection_id, pubsub))

Message Ordering Guarantees

Total Order vs Causal Order

GuaranteeDescriptionImplementation
Total OrderAll messages in exact global orderSingle sequencer; Lamport clocks
Causal OrderCausally related messages in order; unrelated can be interleavedVector clocks
FIFO per senderMessages from same sender arrive in send orderSender sequence numbers
class MessageOrdering:
    """Implement causal ordering with vector clocks"""

    def __init__(self, node_id: str):
        self.node_id = node_id
        self.vector_clock = {node_id: 0}

    def increment_clock(self):
        self.vector_clock[self.node_id] += 1
        return self.vector_clock.copy()

    def send_message(self, content: str) -> Message:
        clock = self.increment_clock()
        return Message(
            content=content,
            vector_clock=clock,
            sender=self.node_id
        )

    def receive_message(self, message: Message) -> bool:
        """Return True if message should be delivered (causal ordering)"""
        # Update our clock
        for node, val in message.vector_clock.items():
            self.vector_clock[node] = max(
                self.vector_clock.get(node, 0),
                val
            )

        # Increment for receive event
        self.vector_clock[self.node_id] += 1

        # Check if we have all preceding messages
        return self._check_causal_delivery(message)

    def _check_causal_delivery(self, message: Message) -> bool:
        """Verify we have all causally preceding messages"""
        for node, val in message.vector_clock.items():
            if node == self.node_id:
                continue
            if self.vector_clock.get(node, 0) < val - 1:
                return False  # Missing preceding message
        return True

Production Failure Scenarios

Failure ScenarioImpactMitigation
WebSocket gateway crashUsers disconnect, must reconnectRun N+1 instances; health checks; client auto-reconnect
Message queue backlogMessages delayed, offline users fall further behindPriority queues; max backlog limits; batch catchup
Presence cache failureUsers appear offline incorrectlyRedis cluster with replication; fallback to DB query
Push notification service downOffline users don’t get notifiedQueue notifications; retry with backoff; SMS fallback
Database replica lagMessage history incompleteRead from primary for recent messages; accept lag for history
Device conflict on syncDuplicate messages or lost messagesVector clocks; idempotent message IDs; conflict resolution

Capacity Estimation

Connection Capacity

Per WebSocket server (4 vCPU, 8GB RAM):
- Max connections: 10,000-50,000
- Messages/second: 1,000-5,000

For 100,000 concurrent users:
- Minimum servers: 100,000 / 10,000 = 10 servers
- With redundancy (2x): 20 servers
- Peak message rate: 100,000 users × 1 msg/user/10sec = 10,000 msg/s

Storage Estimation

Messages per day: 10M DAU × 10 messages/day = 100M messages
Message size (avg): 500 bytes + 200 bytes metadata = 700 bytes
Daily storage: 100M × 700 = 70 GB
5-year storage: 70GB × 365 × 5 = 127 TB
With 3x replication: ~380 TB

Observability Checklist

Metrics to Capture

  • websocket_connections_active (gauge) - Per server, total
  • messages_delivered_total (counter) - By type (direct, group, system)
  • message_delivery_latency_seconds (histogram) - End-to-end, P50/P95/P99
  • presence_updates_total (counter) - Online/offline transitions
  • offline_sync_duration_seconds (histogram) - Time to sync missed messages
  • push_notification_sent_total (counter) - By channel (APNS, FCM, SMS)

Logs to Emit

{
  "timestamp": "2026-03-22T10:30:00Z",
  "event": "message_delivered",
  "message_id": "msg_abc123",
  "conversation_id": "conv_xyz",
  "recipient_id": "user_456",
  "delivery_latency_ms": 45,
  "channel": "websocket"
}

Alerts to Configure

AlertThresholdSeverity
Connection count > 80% capacity80% maxWarning
Message latency P99 > 200ms200msWarning
Presence update lag > 10s10sCritical
Offline sync duration > 30s30sWarning
Push notification queue > 10K10000Warning

Security Checklist

  • WebSocket connections over WSS (TLS)
  • JWT token validation on connection
  • Message content encryption (E2EE for sensitive chats)
  • Rate limiting per user (prevent spam)
  • Input sanitization (XSS prevention)
  • Push notification content limited (no PII in notifications)
  • Audit logging for moderation actions
  • Device management (revoke stolen device sessions)

Common Pitfalls / Anti-Patterns

Pitfall 1: Storing Messages in WebSocket Server Memory

Problem: Messages sent while user is temporarily disconnected are lost if stored only in server memory.

Solution: Store all messages in persistent storage (DB) immediately. Deliver from storage when user reconnects. Use message queue as buffer, not source of truth.

Pitfall 2: Not Handling Reconnection Gracefully

Problem: When client reconnects, duplicate messages appear or gaps in conversation.

Solution: Implement idempotent message delivery with client-side deduplication. Use cursor-based pagination for loading message history.

Pitfall 3: Presence Flooding

Problem: Every presence change triggers notifications to all connected clients, creating broadcast storms in large channels.

Solution: Batch presence updates (send every 5 seconds, not every change). Use bloom filters for “who is online” queries instead of broadcasting.

Pitfall 4: Single Point of Failure for Message Ordering

Problem: Using a single sequencer node for total ordering creates a bottleneck and failure point.

Solution: Use causal ordering (vector clocks) instead of total ordering. Most applications only need causal order anyway.


Interview Q&A

Q: How do you handle message delivery when the recipient is offline?

A: When the recipient is offline, the message router stores the message in durable storage first. Then it triggers a push notification to the user’s device. When the user comes back online, the client syncs missed messages using a cursor-based pagination protocol, fetching messages since the last known sync timestamp.

Q: Why is vector clock ordering better than total ordering for chat?

A: Total ordering requires a single sequencer node that sees all messages. This creates a bottleneck and a single point of failure. Causal ordering only guarantees that causally related messages arrive in order. Unrelated messages can be interleaved freely. Most chat applications only need causal ordering, which is cheaper to implement and more fault-tolerant.

Q: How does presence tracking scale in large channels?

A: Presence changes are batched rather than broadcast immediately. The system sends presence updates every 5-10 seconds, not on every change. For “who is online” queries, bloom filters can approximate membership without broadcasting to all members. For large channels, presence is often not shown at all or shown as “many people online.”

Q: What is the offline sync conflict resolution strategy?

A: Messages are identified by client-generated idempotent message IDs. When syncing, the server returns all messages since the last sync cursor. The client merges local and server messages, deduplicating by message ID. For conflicts (same ID, different content), vector clocks determine which is causally newer, or timestamps break ties.


Scenario Drills

Scenario 1: WebSocket Gateway Server Crash

Situation: The WebSocket server handling 10,000 connections crashes unexpectedly.

Analysis:

  • All 10,000 users disconnect simultaneously
  • Client auto-reconnect kicks in but overwhelms surviving servers
  • Messages sent during disconnect are in message store, not lost

Solution: Run N+1 instances with health checks. Clients implement exponential backoff on reconnect. Sticky sessions at load balancer ensure reconnection routes to same server. Messages are stored in durable DB before delivery, so offline messages survive server death.

Scenario 2: Presence Flood in Large Channel

Situation: 1,000 users join a popular channel simultaneously. Each presence change is broadcast to all members.

Analysis:

  • Each join generates 1 presence update broadcast to 999 users
  • 1,000 joins = 999,000 presence messages
  • Network flooding and server CPU spike

Solution: Batch presence updates into 5-10 second windows. Instead of broadcasting each change, publish aggregated state. Use bloom filters for approximate “is user online” queries. For very large channels, hide individual presence.

Scenario 3: Offline User Comes Back Online After Days

Situation: A user returns after being offline for a week with thousands of missed messages.

Analysis:

  • Message history is large; fetching all at once causes timeout
  • Push notification storm if all messages trigger notifications
  • User sees huge badge count

Solution: Cursor-based pagination syncs in batches. Messages are fetched incrementally with cursors. Push notifications are coalesced into a single “you have messages” notification rather than one per message. The app shows “loading” state while catching up.


Failure Flow Diagrams

Message Delivery Flow

graph TD
    A[Sender Sends Message] --> B[WebSocket Gateway]
    B --> C[Message Router]
    C --> D[Store in Database]
    D --> E{Recipient Online?}
    E -->|Yes| F[Find Connection]
    F --> G[Deliver via WS Gateway]
    E -->|No| H[Queue Push Notification]
    G --> I[Send ACK to Sender]
    H --> J[APNS/FCM]
    J --> K[Device Receives]
    K --> L[User Opens App]
    L --> M[Sync Missed Messages]

WebSocket Connection Lifecycle

graph TD
    A[Client Connects] --> B[Authenticate Token]
    B --> C{Valid?}
    C -->|No| D[Close Connection 4001]
    C -->|Yes| E[Register Connection]
    E --> F[Subscribe to Conversations]
    F --> G[Connection Established]
    G --> H[Heartbeat Loop]
    H --> I{Heartbeat OK?}
    I -->|No| J[Mark Offline]
    I -->|Yes| K{Message Received?}
    K -->|Yes| L[Process Message]
    K -->|No| H
    L --> H
    J --> M[Cleanup Connection]

Presence Update Batching

graph LR
    A[User Online Event] --> B[Presence Service]
    B --> C{Batch Window Open?}
    C -->|No| D[Start 5s Timer]
    C -->|Yes| E[Add to Batch Buffer]
    D --> E
    E --> F{Window Elapsed?}
    F -->|No| E
    F -->|Yes| G[Publish Aggregated Update]
    G --> H[Subscribers Receive]

Quick Recap

  • WebSocket gateways scale horizontally using broker fanout or sticky sessions.
  • Store messages persistently first, then deliver; don’t use memory as source of truth.
  • Use vector clocks for causal ordering; only require total order when truly necessary.
  • Conflict resolution on offline sync uses timestamps, vector clocks, or application logic.
  • Presence should be batched and cached, not broadcast on every change.

Copy/Paste Checklist

- [ ] WebSocket gateway with Redis pub/sub fanout
- [ ] Messages stored in DB before delivery
- [ ] Client auto-reconnect with idempotent delivery
- [ ] Cursor-based pagination for message history
- [ ] Vector clocks for causal ordering
- [ ] Conflict resolution on offline sync
- [ ] Presence batched updates (5-10 second windows)
- [ ] Push notification fallback for offline users
- [ ] Rate limiting per connection
- [ ] E2EE encryption for sensitive conversations

For more on real-time systems, see our Message Queue Types guide. For message queues that power chat delivery, see the RabbitMQ and Apache Kafka guides.

Category

Related Posts

Uber's Architecture: From Monolith to Microservices at Scale

Explore how Uber evolved from a monolith to a microservices architecture handling millions of real-time marketplace transactions daily.

#microservices #uber #architecture

System Design: Netflix Architecture for Global Streaming

Deep dive into Netflix architecture. Learn about content delivery, CDN design, microservices, recommendation systems, and streaming protocols.

#system-design #case-study #netflix

System Design: Twitter Feed Architecture and Scalability

Deep dive into Twitter system design. Learn about feed generation, fan-out, timeline computation, search, notifications, and scaling challenges.

#system-design #case-study #twitter