Apache Cassandra: Distributed Column Store Built for Scale

Explore Apache Cassandra's peer-to-peer architecture, CQL query language, tunable consistency, compaction strategies, and use cases at scale.

published: reading time: 33 min read

Apache Cassandra: The Distributed Column Store for High Write Throughput

Facebook built Cassandra in 2007 to solve inbox search, needing to store and search hundreds of millions of messages with low latency. Existing solutions did not scale. Cassandra borrowed ideas from Amazon’s Dynamo and Google’s Bigtable.

Facebook open-sourced it in 2008, Apache made it a top-level project in 2010. Today Apple uses it for inbox search, Discord for message storage, Spotify for time-series, Netflix for analytics. Cassandra’s strength is write-heavy throughput that scales linearly as you add nodes, with no single point of failure.


Architecture Fundamentals

Cassandra uses a peer-to-peer architecture. Every node is equal. No master, no coordinator bottleneck, no special node whose failure brings down the cluster. Data distributes via consistent hashing, and any node can serve any request.

A 100-node Cassandra cluster has no bottleneck at any particular node. Writes spread evenly, letting the cluster handle traffic spikes that would crush a single-node database.

graph TD
    A[Client] --> B[Node 1]
    A --> C[Node 2]
    A --> D[Node 3]
    A --> E[Node N]

    B -->|Gossip| F[Node 2]
    B -->|Gossip| C
    C -->|Gossip| D
    D -->|Gossip| B

    B --- G[(Partition 1)]
    C --- H[(Partition 2)]
    D --- I[(Partition N)]

The gossip protocol keeps all nodes aware of cluster state. Each node periodically exchanges state with a few others, and information spreads cluster-wide. Gossip handles additions, failures, and recoveries without central coordination.


Data Model

Cassandra’s data model looks like SQL but differs in important ways:

  • Keyspace: Container for tables, like a database schema
  • Table: Collection of rows with a primary key
  • Row: Single record identified by partition key
  • Column: Name-value pair with a data type
-- Create a keyspace with replication strategy
CREATE KEYSPACE IF NOT EXISTS orders
WITH REPLICATION = {
    'class': 'NetworkTopologyStrategy',
    'datacenter1': 3
};

-- Create a table with compound primary key
CREATE TABLE orders.customers (
    customer_id UUID,
    email TEXT,
    name TEXT,
    created_at TIMESTAMP,
    PRIMARY KEY (customer_id)
);

-- Table with compound partition key
CREATE TABLE orders.order_items (
    customer_id UUID,        -- partition key
    order_id TIMEUUID,       -- clustering key
    product_id UUID,
    quantity INT,
    price DECIMAL,
    PRIMARY KEY (customer_id, order_id)
);

The partition key decides which node stores the data. The clustering key controls sort order within a partition. This design optimizes for the common pattern: fetching all items for a specific customer ordered by time.


CQL: Cassandra Query Language

CQL looks like SQL but behaves differently. Key restrictions:

  • No JOINs (denormalize instead)
  • No subqueries in most contexts
  • No aggregate queries across partitions (without Spark)
  • Queries must use the primary key or an index
-- Insert data
INSERT INTO orders.customers (customer_id, email, name, created_at)
VALUES (uuid(), 'alice@example.com', 'Alice Smith', toTimestamp(now()));

-- Query by partition key (efficient)
SELECT * FROM orders.customers WHERE customer_id = ?;

-- Query by non-primary key (requires secondary index - avoid in hot paths)
SELECT * FROM orders.customers WHERE email = 'alice@example.com';

-- Query by clustering key range (efficient within partition)
SELECT * FROM orders.order_items
WHERE customer_id = ? AND order_id > minTimeuuid('2024-01-01');

These restrictions exist for good reason. JOINs across distributed tables need network round-trips between nodes, which kills performance. By requiring partition-targeted queries, Cassandra routes requests directly to relevant nodes.


Tunable Consistency

Tunable consistency is Cassandra’s most powerful feature. Choose consistency per query, trading off between consistency and performance:

from cassandra.cluster import Cluster

cluster = Cluster(['192.168.1.1', '192.168.1.2'])
session = cluster.connect('orders')

# Write with quorum consistency (strongly consistent writes)
session.execute(
    "INSERT INTO customers (customer_id, email) VALUES (%s, %s)",
    [uuid(), 'alice@example.com'],
    consistency_level=ConsistencyLevel.QUORUM
)

# Read with ONE consistency (fast, possibly stale)
session.execute(
    "SELECT * FROM customers WHERE customer_id = %s",
    [customer_id],
    consistency_level=ConsistencyLevel.ONE
)

Common consistency levels:

LevelDescriptionUse Case
ONEAny single replicaMaximum speed, potentially stale
TWOTwo replicasBetter freshness, still fast
THREEThree replicasFresh data, higher latency
QUORUMMajority of replicas (N/2+1)Balanced consistency
ALLAll replicasStrongest consistency, slowest
LOCAL_QUORUMQuorum in local DCLow latency for multi-DC
EACH_QUORUMQuorum in every DCStrong multi-DC consistency

With 3 replicas, quorum means 2 nodes must acknowledge writes. This gives strong consistency while tolerating one node down.


Client-Side Token-Aware Routing

Cassandra’s driver computes the token (MurmurHash) of the partition key and routes queries directly to the replica that owns that token range. This avoids hitting every node in the cluster for every query.

from cassandra.cluster import Cluster

# Token-aware routing is enabled by default in recent drivers
cluster = Cluster(
    ['192.168.1.1', '192.168.1.2', '192.168.1.3'],
    load_balancing_policy=None  # Default DCAwareRoundRobinPolicy with token-aware routing
)

session = cluster.connect('orders')

# Driver automatically computes token and routes to correct replica
# For partition key 'customer-123', driver computes MurmurHash
# Finds which node owns that token range
# Sends query directly to that node

result = session.execute(
    "SELECT * FROM orders.customers WHERE customer_id = %s",
    ['customer-123']  # Driver routes to replica owning this partition
)

Without token-aware routing:

A query to customer-123 would first hit a coordinator node, which would then query all replicas and return the fastest response. This adds an extra network hop.

With token-aware routing:

The driver knows which replica owns the partition. It sends the query directly to that node.

graph TD
    A[Client with Token-Aware Driver] -->|Direct query to replica| B[Replica 1 - owns partition]
    A -->|Coordinator hop| C[Any Node]
    C -->|Forwards query| B

How token mapping works:

Cassandra uses MurmurHash3 for token computation. The token range (0 to 2^63-1) is divided among nodes. The driver builds a token map from cluster metadata:

# Driver token map inspection
cluster = Cluster(['192.168.1.1'])
session = cluster.connect()

# Get token ranges for a partition key
token_map = cluster.metadata.token_map
partition_key = 'customer-123'
token = cluster.metadata.get_replicas('orders', [partition_key])

print(f"Token for '{partition_key}': {token}")
print(f"Replicas: {cluster.metadata.get_replicas('orders', [partition_key])}")

Token-aware routing and consistency levels:

Token-aware routing works with all consistency levels. For QUORUM, the driver sends to one replica (the primary) and then parallels digest requests to enough replicas to satisfy quorum.

# Driver behavior at QUORUM consistency with 3 replicas:
# 1. Compute token -> find primary replica (e.g., Replica 1)
# 2. Send data query to Replica 1
# 3. Send digest query to Replica 2 and Replica 3 in parallel
# 4. Wait for 2 responses (quorum)
# 5. If digests match, return result; if not, request full data repair

session.execute(
    query,
    parameters,
    consistency_level=ConsistencyLevel.QUORUM  # Driver handles parallel coordination
)

When token-aware routing matters most:

ScenarioImpact
Single-partition readsHigh - eliminates coordinator hop
Cross-partition queries (ORDER BY, etc.)Low - must fan out anyway
Batch statementsMedium - each statement routed independently
ALLOW FILTERING queriesNone - requires full table scan

Token-aware routing limitations:

  • Only works for queries with partition key equality (e.g., WHERE customer_id = ?)
  • Range queries (e.g., WHERE customer_id > ?) must scan all partitions
  • Batch statements with multiple partition keys fan out to multiple nodes
  • ALLOW FILTERING defeats token awareness entirely

Driver configuration for token-aware routing:

from cassandra.cluster import Cluster, ExecutionProfile
from cassandra.policies import TokenAwarePolicy, DCAwareRoundRobinPolicy

# Explicit token-aware policy configuration
cluster = Cluster(
    ['192.168.1.1', '192.168.1.2'],
    load_balancing_policy=TokenAwarePolicy(
        DCAwareRoundRobinPolicy(local_dc='us-east-1')
    )
)

# Execution profile for different query types
fast_profile = ExecutionProfile(
    consistency_level=ConsistencyLevel.ONE,
    request_timeout=2.0
)
strong_profile = ExecutionProfile(
    consistency_level=ConsistencyLevel.QUORUM,
    request_timeout=10.0
)

session = cluster.connect('orders')
session.execute(query, execution_profile=fast_profile)  # Fast local reads

Lightweight Transactions (LWT)

Cassandra supports linearizable consistency for single-partition writes through Lightweight Transactions (LWT), implemented using a Paxos consensus protocol. The IF clause in CQL triggers LWT behavior.

-- LWT: Only insert if user@example.com does not exist
INSERT INTO users (user_id, email, created_at)
VALUES ('user-123', 'user@example.com', toTimestamp(now()))
IF NOT EXISTS;

-- LWT: Conditional update - only update if current balance >= amount
UPDATE account_balances
SET balance = balance - 50
WHERE account_id = 'acc-123'
IF balance >= 50;

-- LWT: Compare-and-set pattern
UPDATE counters
SET count = 100
WHERE page_id = 'homepage'
IF count = 99;  -- Only succeeds if count is currently 99

How LWT works internally:

LWT uses a two-phase Paxos protocol per partition:

  1. Prepare phase: Coordinator proposes a value and asks replicas to promise to accept it
  2. Accept phase: If a majority of replicas promise, coordinator sends the accept message
  3. Commit phase: Replicas commit the value and coordinator responds to client
sequenceDiagram
    participant C as Coordinator
    participant R1 as Replica 1
    participant R2 as Replica 2
    participant R3 as Replica 3

    C->>R1: Prepare (p=1, v=value)
    C->>R2: Prepare (p=1, v=value)
    C->>R3: Prepare (p=1, v=value)
    R1-->>C: Promise(p=1)
    R2-->>C: Promise(p=1)
    R3-->>C: Promise(p=1)
    Note over C: Majority promised
    C->>R1: Accept (p=1, v=value)
    C->>R2: Accept (p=1, v=value)
    C->>R3: Accept (p=1, v=value)
    R1-->>C: Accepted
    R2-->>C: Accepted
    R3-->>C: Accepted
    Note over C: Majority accepted
    C->>R1: Commit
    C->>R2: Commit
    C->>R3: Commit

Latency cost:

LWT rounds add 4 network hops compared to a standard write. With 3 replicas across a datacenter:

OperationLatency (local DC)Latency (cross-DC)
Standard write (QUORUM)~2-5ms~15-30ms
LWT write (QUORUM)~10-20ms~50-100ms

LWT latency is higher because every operation goes through Paxos. Use LOCAL_QUORUM for same-DC operations when possible.

LWT contention and failure:

When multiple clients attempt LWT on the same partition simultaneously, only one succeeds. The others receive UNAVAILABLE or retry conflicts.

from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
from cassandra import ConsistencyLevel

cluster = Cluster(['192.168.1.1', '192.168.1.2', '192.168.1.3'])
session = cluster.connect('users')

# LWT with retry logic for contention
for attempt in range(5):
    try:
        query = SimpleStatement("""
            UPDATE user_sessions
            SET last_active = toTimestamp(now())
            WHERE user_id = 'user-123'
            IF last_active < toTimestamp(now()) - 3600
        """, consistency_level=ConsistencyLevel.LOCAL_QUORUM)

        result = session.execute(query)
        if result[0][0]:
            print("Update succeeded")
        else:
            print("Condition not met - session recently active")
        break
    except Exception as e:
        if 'UNAVAILABLE' in str(e) and attempt < 4:
            time.sleep(0.1 * (2 ** attempt))  # Exponential backoff
            continue
        raise

When to use LWT:

Use CaseLWT Appropriate
Ensure unique email on user creationYes - IF NOT EXISTS
Prevent double-spend / race conditionsYes - IF balance >= amount
Distributed locksYes - but consider Cassandra’s built-in locks
Counter incrementsNo - use Cassandra counters
High-throughput writesNo - LWT is 4-10x slower
Multi-partition transactionsNo - LWT is single partition only

LWT and Cassandra driver:

The Python driver handles LWT automatically but you must configure retry policies:

from cassandra.cluster import Cluster
from cassandra.policies import RetryPolicy, FallthroughRetryPolicy

cluster = Cluster(
    ['192.168.1.1', '192.168.1.2', '192.168.1.3'],
    retry_policy=FallthroughRetryPolicy()  # LWT should not auto-retry
)

Common LWT mistakes:

  • Using LWT on high-contention rows (multiple concurrent updates fail repeatedly)
  • Assuming LWT spans multiple partitions (it cannot)
  • Not handling UNAVAILABLE errors with appropriate backoff
  • Using LWT for counters (counters have their own linearizable implementation)

Materialized Views

Materialized views (MVs) in Cassandra automatically maintain a denormalized view of a base table. When rows in the base table change, Cassandra asynchronously updates the view. This differs from manual denormalization where your application must write to multiple tables.

-- Base table: orders by customer
CREATE TABLE orders.customers (
    customer_id UUID,
    order_id TIMEUUID,
    order_total DECIMAL,
    order_status TEXT,
    created_at TIMESTAMP,
    PRIMARY KEY (customer_id, order_id)
);

-- Materialized view: orders indexed by order_id
CREATE MATERIALIZED VIEW orders.orders_by_id AS
    SELECT order_id, customer_id, order_total, order_status, created_at
    FROM orders.customers
    WHERE order_id IS NOT NULL
    PRIMARY KEY (order_id, customer_id);

-- Materialized view: orders indexed by status
CREATE MATERIALIZED VIEW orders.orders_by_status AS
    SELECT order_status, order_id, customer_id, order_total, created_at
    FROM orders.customers
    WHERE order_status IS NOT NULL AND order_id IS NOT NULL
    PRIMARY KEY (order_status, order_id);

How materialized view updates work:

  1. Write arrives at base table
  2. Coordinator writes to base table SSTable
  3. Local view building mutation is created
  4. View update is written to view’s partition
  5. View updates are atomic per partition but eventual across views
graph LR
    A[Write to Base Table] --> B[Coordinator]
    B --> C[Base Table SSTable]
    B --> D[View Mutation]
    D --> E[View 1 SSTable]
    D --> F[View 2 SSTable]
    D --> G[View N SSTable]

Eventual consistency trade-offs:

View updates are asynchronous and best-effort. This has important implications:

AspectBehavior
View stalenessViews may lag base table by milliseconds to seconds
View failuresFailed view updates are not retried automatically
Lost writesIf a base write fails after succeeding on some replicas, view updates may be inconsistent
Delete propagationDeletes from base table propagate to views
from cassandra.cluster import Cluster

cluster = Cluster(['192.168.1.1', '192.168.1.2'])
session = cluster.connect('orders')

# Read from materialized view - may be slightly stale
rows = session.execute("""
    SELECT * FROM orders.orders_by_id
    WHERE order_id = %s
""", parameters=['some-order-uuid'])

# For critical reads requiring latest data, query base table instead
rows = session.execute("""
    SELECT * FROM orders.customers
    WHERE order_id = %s
""", parameters=['some-order-uuid'])

When materialized views work well:

ScenarioMV Works
Write-heavy workloads with multiple read patternsYes - writes maintain views automatically
Single-row updates across viewsYes - atomic per partition
Time-series with multiple access patternsYes - create views for different time ranges
Read-heavy workloads that can tolerate stale readsYes - views reduce read amplification

When materialized views break down:

ScenarioMV Problems
High contention on base rowView updates may be lost or delayed
Updates that change the view keyMV does not support updating the partition key
Very high write ratesView update backlog can grow unbounded
Strong consistency requirementsViews are eventually consistent
Complex aggregationsMVs cannot aggregate - they only denormalize

View update failure handling:

Cassandra tracks view update failures in system.views_builds_in_progress and system.built_views. You can monitor and rebuild:

# Check view build status
nodetool viewbuildstatus orders orders_by_id

# Rebuild a materialized view if it falls out of sync
nodetool rebuildview orders orders_by_id

# Check for pending view updates
SELECT * FROM system.views_builds_in_progress;
SELECT * FROM system.built_views;

Materialized view limitations:

  • View primary key must include all primary key columns from base table
  • Cannot update a column that is part of the view’s primary key
  • Secondary columns in base table cannot be primary key columns in view
  • TTL on base table does not automatically apply to view (must be set separately)

Write Path and Commit Log

Cassandra optimizes for write throughput. When you write:

  1. Write appends to the commit log (durability)
  2. Data goes to the memtable (in-memory buffer)
  3. Client gets acknowledgment immediately
  4. Memtable flushes to SSTable when full

The write path is append-only, which is fast on disk. B-tree databases must find the right location and overwrite; Cassandra just appends.

sequenceDiagram
    participant C as Client
    participant M as Memtable
    participant L as Commit Log
    participant S as SSTable

    C->>L: Append to Commit Log (durability)
    C->>M: Write to Memtable
    M-->>C: Write Acknowledged
    Note over M: When memtable is full
    M->>S: Flush to SSTable

SSTables (Sorted String Tables) are immutable files on disk. Once written, never modified. Compaction merges SSTables and removes deleted data.


Compaction Strategies

Compaction merges SSTables and reclaims space after deletions or updates. Cassandra offers:

SizeTiered Compaction (STCS):

  • Groups SSTables of similar size
  • Good for write-heavy workloads
  • Older data visible longer

TimeWindow Compaction (TWCS):

  • Groups SSTables by time window
  • Ideal for time-series
  • Simplifies TTL management

Leveled Compaction (LCS):

  • Restricts SSTable size per level
  • Better read performance (check at most one SSTable per level)
  • More writes due to compaction
-- Specify compaction strategy when creating table
CREATE TABLE metrics.sensor_data (
    sensor_id UUID,
    timestamp TIMESTAMP,
    value DOUBLE,
    PRIMARY KEY (sensor_id, timestamp)
) WITH compaction = {
    'class': 'TimeWindowCompactionStrategy',
    'window_unit': 'HOURS',
    'compaction_window_unit': 'HOURS',
    'compaction_window_size': 1
};

For time-series with TTLs, TWCS simplifies tombstone cleanup. For general workloads, TWCS or STCS depending on your read/write ratio.

Compaction Strategy Selection Decision Tree

Choosing the right compaction strategy requires understanding your workload characteristics. Here is a practical decision framework:

graph TD
    A[Start: Choose Compaction Strategy] --> B{Data Model?}
    B -->|Time-series / TTL| C[TWCS - TimeWindow Compaction]
    B -->|Wide partitions / Heavy writes| D[STCS - SizeTiered Compaction]
    B -->|General workload / Balanced| E{Latency Priority?}
    E -->|Optimize reads| F[LCS - Leveled Compaction]
    E -->|Optimize writes| D
    E -->|Mixed workload| G{Partition Size?}
    G -->|Wide partitions >100MB| D
    G -->|Narrow partitions| H[TWCS or LCS]

    C --> I[Best for: IoT, metrics, event logs]
    D --> J[Best for: Write-heavy, archival]
    F --> K[Best for: Read-heavy, time-series with range queries]
    H --> L[Consider: LCS if reads matter, TWCS if time-bucketed]

Decision matrix by workload type:

WorkloadPrimary StrategyAlternativeWhy
Time-series with TTLTWCS-TTL expiration aligns with window compaction
Write-heavy (logs, events)STCSTWCS (if time-bucketed)Append-heavy, infrequent reads
Read-heavy (user profiles)LCSTWCSFrequent reads benefit from leveled layout
Mixed (balanced)TWCSLCSTime-bucketed queries work well with TWCS
Heavy deletesTWCSSTCSTombstones expire with time windows
Wide partitionsSTCS-LCS struggles with partitions > 10MB
Counter tablesSTCS-Counters have many updates, benefit from STCS

Compaction strategy characteristics:

StrategySSTable LevelsWrite AmplificationRead AmplificationSpace AmplificationBest For
STCSN (unbounded)1x (low)High (scans similar-sized SSTables)High (may have duplicates)Write-heavy
LCSL0 + L1-L7 (max 7)20-30% moreLow (1 SSTable per level)10-20% (none by design)Read-heavy
TWCSN per windowLowLowLow (timestamps co-located)Time-series

How to switch compaction strategies:

# WARNING: Changing compaction strategy requires nodetool upgradesstables
# Never run without understanding the implications

# 1. Disable auto-compaction for the table
nodetool disableautocompaction orders.customers

# 2. Run major compaction to flush current data
nodetool compact orders.customers

# 3. Change compaction strategy
nodetool setcompactionstrategy orders.customers TimeWindowCompactionStrategy

# 4. Configure strategy options
nodetool setcompactionstrategy_options orders.customers "window_unit=HOURS,compaction_window_size=1"

# 5. Re-enable auto-compaction
nodetool enableautocompaction orders.customers

# 6. Run upgradesstables to rewrite data with new strategy
nodetool upgradesstables orders.customers

# 7. Monitor compaction during transition
nodetool compactionstats

Compaction strategy and Cassandra version:

StrategyCassandra 2.1Cassandra 3.0+Notes
STCSYesYesDefault in 2.1
LCSYesYesImproved significantly in 3.0+
TWCSNoYesAdded in 3.0.8
DTCSYes (deprecated)NoDeprecated in 3.0 - use TWCS

Common compaction mistakes:

  1. Using LCS for wide partitions: LCS assumes partitions fit in a single SSTable level. Wide partitions cause compaction failures and read timeouts.

  2. Using TWCS without time-based access: TWCS groups data by timestamp. If queries span multiple time windows, performance degrades.

  3. Changing strategies without running upgradesstables: Old SSTables remain in the old format until explicitly upgraded.

  4. Setting TWCS windows too small: Windows < 1 hour cause excessive SSTable fragmentation. Windows > 1 day defeat the tombstone purging purpose.

  5. Ignoring compaction during strategy change: The table is vulnerable to tombstone accumulation during the transition period.

Tombstone Handling

When you delete data in Cassandra, the deletion is not immediate. Cassandra writes a tombstone - a marker indicating the data is deleted - and the actual removal happens during compaction.

Why tombstones exist:

Cassandra’s append-only SSTable design means data cannot be overwritten in place. Deletions are writes that mark data as dead. The tombstone persists until compaction removes both the tombstone and the dead data it marks.

-- This writes a tombstone, not immediate deletion
DELETE FROM orders.order_items
WHERE order_id = 'order-123';

Tombstone lifecycle:

  1. Delete request arrives at coordinator
  2. Coordinator writes tombstone to SSTable with the deletion timestamp
  3. Tombstone is visible immediately - queries exclude the deleted data
  4. During compaction, tombstones older than gc_grace_seconds are removed along with the data they mark
  5. If compaction has not run before gc_grace_seconds, the tombstone disappears but deleted data can “resurrect”

gc_grace_seconds and resurrection risk:

# Table with gc_grace_seconds = 10 days (default)
# This setting is critical for managing tombstones
CREATE TABLE orders.order_items (
order_id UUID,
item_id UUID,
quantity INT,
PRIMARY KEY (order_id, item_id)
) WITH gc_grace_seconds = 864000;  -- 10 days in seconds

The gc_grace_seconds setting controls how long tombstones must survive before compaction can delete them. This window allows anti-entropy repair to propagate deletions to all replicas. If you run compaction before gc_grace_seconds elapses and a replica missed the original deletion, the deleted data can resurrect on that replica.

Scenariogc_grace_secondsRisk
Single datacenter, frequent repairs86400 (1 day)Low - repairs catch missed deletions
Multi-DC, infrequent repairs432000 (5 days)Medium
Very infrequent repairs864000 (10 days, default)Lower - larger repair window
No repairs everInfinityTombstones never safely removed

Tombstone impact on read performance:

Queries must scan through tombstones to find live data. A partition with many tombstones can cause read timeouts even when the query returns few results.

-- This query scans all tombstones in the partition
-- May timeout on partitions with 100k+ tombstones
SELECT * FROM orders.order_items
WHERE order_id = 'big-order-with-many-items';

Diagnosing tombstone problems:

# Check tombstone count per partition
nodetool tablestats orders.order_items -H

# Look for "Number of tombstones" in output
# Higher numbers indicate potential read pressure

# Enable trace for queries to see tombstone scanning
TRACING ON;
SELECT * FROM orders.order_items WHERE order_id = 'test';
TRACING OFF;

# Check compaction history
nodetool compactionhistory | grep order_items

Mitigation strategies:

  1. Time-window bucketing: Split time-series into daily or weekly partitions. Old partitions with tombstones are not queried.
CREATE TABLE metrics.sensor_data (
    sensor_id UUID,
    date TEXT,  -- '2024-01-15' bucket
    timestamp TIMESTAMP,
    value DOUBLE,
    PRIMARY KEY ((sensor_id, date), timestamp)
);
  1. Prevent wide rows with mixed active/deleted data: Separate active entities from archived ones.

  2. Use TTL instead of explicit deletes: TTL tombstones are automatically handled by TWCS.

  3. Reduce gc_grace_seconds with frequent repairs: If you run repair daily, you can safely reduce gc_grace_seconds.

  4. Monitor partition statistics: Identify partitions with excessive tombstones before they cause outages.

# Python script to check partition tombstone density
from cassandra.cluster import Cluster

cluster = Cluster(['192.168.1.1'])
session = cluster.connect('orders')

# Get partition statistics
rows = session.execute("""
    SELECT key, todckeys, tombstones, cells
    FROM system.size_estimates
    WHERE keyspace_name = 'orders'
    AND table_name = 'order_items'
""")

for row in rows:
    if row.tombstones > 10000:
        print(f"Partition {row.key} has {row.tombstones} tombstones - consider archival")

Anti-Entropy Repair

Cassandra uses anti-entropy repair to synchronize data between replicas and ensure consistency. The nodetool repair command triggers this process, which compares Merkle trees between replicas to find and fix discrepancies.

How Merkle trees work in Cassandra:

  1. Each replica builds a Merkle tree from its SSTable data
  2. Trees are hash-based binary trees where leaf nodes contain hash values of data ranges
  3. Replicas exchange tree roots to detect differences
  4. When roots differ, nodes traverse to find the exact range that diverged
  5. Only the differing data ranges are exchanged and repaired
graph TD
    A[Replica 1 SSTable] --> B[Merkle Tree Root Hash]
    C[Replica 2 SSTable] --> D[Merkle Tree Root Hash]
    B --> E{Compare Roots}
    D --> E
    E -->|Match| F[No repair needed]
    E -->|Mismatch| G[Traverse tree to find differing range]
    G --> H[Exchange only differing data]

Operational considerations:

  • Running repair: nodetool repair -pr runs repair on primary ranges only
  • Frequency: Run weekly at minimum; daily for critical data
  • Resource cost: Repair is expensive - it reads entire SSTables and generates network traffic
  • Incremental repair: Use -inc for partial repairs that spread load across runs
  • Size estimates: nodetool compactionstats shows pending tasks before running repair

Common repair issues:

ProblemSymptomMitigation
Repair running too longCluster slowdownUse incremental repair
Merkle tree memoryOOM on large partitionsSubdivide large partitions
Overlap with writesInconsistency during repairRepair during low-write windows
Node down during repairIncomplete syncRe-run repair after node recovers

Anti-entropy vs Read Repair:

Anti-entropy repair is proactive (runs periodically), while read repair is reactive (happens during reads). Read repair sends digest requests to replicas on every read and fixes inconsistencies discovered. Anti-entropy catches issues that read-heavy workloads might miss.

Read Repair: The Passive Consistency Path on Every Read

Read repair is Cassandra’s mechanism for passively repairing inconsistencies during normal read operations. It happens automatically without requiring separate repair jobs.

How read repair works:

When you read with a consistency level greater than ONE, Cassandra checks consistency across replicas and repairs discrepancies as part of the read path.

sequenceDiagram
    participant C as Coordinator
    participant R1 as Replica 1
    participant R2 as Replica 2
    participant R3 as Replica 3

    C->>R1: Send data query
    C->>R2: Send data query
    C->>R3: Send data query
    R1-->>C: Data (digest: abc123)
    R2-->>C: Data (digest: abc123)
    R3-->>C: Data (digest: xyz789)  <- Digest mismatch!

    Note over C: Digests differ - R3 has stale data
    C->>R3: Send full data repair
    R3-->>C: Latest data
    C->>R1: Confirm sync (optional digest)
    C->>R2: Confirm sync (optional digest)
    Note over C: Return latest to client

Read repair with QUORUM (3 replicas):

from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
from cassandra import ConsistencyLevel

cluster = Cluster(['192.168.1.1', '192.168.1.2', '192.168.1.3'])
session = cluster.connect('orders')

# Read with QUORUM consistency
# Coordinator queries all 3 replicas
# Waits for 2 responses (quorum)
# If digests differ, repair happens as part of read
query = SimpleStatement("""
    SELECT * FROM orders.customers WHERE customer_id = %s
""", consistency_level=ConsistencyLevel.QUORUM)

result = session.execute(query, ['customer-123'])
# During this read:
# 1. Coordinator sends to all 3 replicas
# 2. R1 and R2 return data with digest
# 3. Coordinator compares digests
# 4. If R3 digest differs, coordinator fetches latest from R1/R2 and sends to R3
# 5. Client gets consistent data

Read repair probability:

Consistency LevelRead Repair Chance
ONE0% - no digest check
LOCAL_ONE0% - no digest check
QUORUM~33% (1 replica may differ with RF=3)
ALL100% - all replicas checked
EACH_QUORUM100% - all DCs checked

With RF=3 and QUORUM, the chance that a stale replica participates in the read is 100% (since 2 of 3 replicas respond), but the probability that exactly one replica has stale data and it responds to the read is lower.

Read repair at ONE consistency:

Even at ONE consistency, you can enable background read repair:

# cassandra.yaml
# When read_repair_chance is 0.1 (default), 10% of ONE reads
# trigger background repair to all replicas
read_repair_chance: 0.1
read_repair_page_size: 1000
-- Table-specific read repair chance
CREATE TABLE orders.customers (
    customer_id UUID,
    email TEXT,
    name TEXT,
    PRIMARY KEY (customer_id)
) WITH read_repair_chance = 0.1
  AND dclocal_read_repair_chance = 0.05;

dclocal_read_repair_chance vs read_repair_chance:

  • dclocal_read_repair_chance: Probability of read repair for the local DC only (runs faster, does not wait for cross-DC)
  • read_repair_chance: Probability of read repair across all DCs (waits for all replicas)
SettingScopeLatency ImpactUse Case
dclocal_read_repair_chanceLocal DC onlyLower - no cross-DC coordinationMulti-DC with local-first reads
read_repair_chanceAll DCsHigher - cross-DC coordinationCritical data requiring global consistency

Read repair and write latency interaction:

Read repair happens asynchronously after the coordinator receives enough responses to satisfy the consistency level. This means:

# QUORUM read timeline:
# T+0ms: Coordinator sends to all replicas
# T+2ms: QUORUM (2 replicas) respond - coordinator returns to client
# T+2ms to T+10ms: Background read repair to stale replica (if any)

The client sees low latency (just waiting for quorum), but the repair happens in the background.

Monitoring read repair:

# Check read repair activity
nodetool proxyhistograms | grep ReadRepair

# Monitor read repair timing
nodetool compactionstats

# Enable debug logging for read repair
nodetool setlogginglevels org.apache.cassandra.service.read_repair=DEBUG
# JMX monitoring for read repair metrics
from cassandra.cluster import Cluster

cluster = Cluster(['192.168.1.1'])
session = cluster.connect()

# Query read repair statistics
rows = session.execute("""
    SELECT keyspace_name, table_name,
           sum(read_repairs_attempted) as repairs_attempted,
           sum(read_repairs_blocked) as repairs_blocked
    FROM system.size_estimates
    GROUP BY keyspace_name, table_name
""")

for row in rows:
    print(f"{row.keyspace_name}.{row.table_name}: "
          f"{row.repairs_attempted} attempted, "
          f"{row.repairs_blocked} blocked")

When read repair is insufficient:

Read repair only fixes inconsistencies on partitions that are actively read. Partitions that are never read accumulate inconsistencies until anti-entropy repair runs. This is why anti-entropy repair (nodetool repair) is essential for long-term consistency, especially for rarely-accessed data.

ScenarioRead RepairAnti-Entropy Repair
Frequently accessed partitionsKeeps them consistentRarely needed
Rarely accessed partitionsDoes not runRequired periodically
Large partitionsCan cause latency spikesBetter - runs in background
Cross-DC consistencydclocal only doesn’t coverFull cluster repair

Data Center Awareness

NetworkTopologyStrategy defines replication rules that respect data center boundaries:

  • Lower latency by reading from local replicas
  • Disaster resilience via cross-DC replication
  • Workload isolation (analytics vs. transactional)
CREATE KEYSPACE orders WITH REPLICATION = {
    'class': 'NetworkTopologyStrategy',
    'us-east-1': 3,      -- 3 replicas in primary DC
    'us-west-2': 3,      -- 3 replicas in backup DC
    'eu-west-1': 1        -- 1 replica in EU for GDPR
};

For cross-DC replication, LOCAL_ONE, LOCAL_QUORUM, and EACH_QUORUM let you control whether remote DC waits affect latency.

Multi-DC Consistency Pitfalls

Cassandra’s tunable consistency works differently across datacenters, and EACH_QUORUM has surprising behavior during DC failures that catches many users.

EACH_QUORUM behavior during DC failure:

EACH_QUORUM requires a quorum in every datacenter before responding to the client. If one DC is down, writes fail entirely even if all other DCs are available.

from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
from cassandra import ConsistencyLevel

cluster = Cluster(['192.168.1.1', '192.168.1.2'])  # us-east-1
session = cluster.connect('orders')

# EACH_QUORUM write - waits for quorum in EACH datacenter
# If us-west-2 DC is down, this write FAILS even if us-east-1 quorum succeeds
query = SimpleStatement("""
    INSERT INTO orders.customers (customer_id, email, name)
    VALUES (%s, %s, %s)
""", consistency_level=ConsistencyLevel.EACH_QUORUM)

try:
    session.execute(query, ['cust-123', 'a@b.com', 'Alice'])
except Exception as e:
    print(f"EACH_QUORUM failed: {e}")
    # If us-west-2 is down, write fails even though us-east-1 succeeded

The EACH_QUORUM failure scenario:

With replication factor 3 in each of 2 DCs (6 total replicas):

DCReplicasQuorum NeededStatus
us-east-132Available
us-west-232DOWN
  • EACH_QUORUM write: Requires 2 from us-east-1 AND 2 from us-west-2. FAILS because us-west-2 is unavailable.
  • LOCAL_QUORUM write: Requires 2 from us-east-1 only. SUCCEEDS.
  • QUORUM write: Requires 4 total across both DCs. FAILS because us-west-2 replicas cannot respond.
graph TD
    A[Write with EACH_QUORUM] --> B{DC us-west-2 available?}
    B -->|No| C[Write FAILS - cannot satisfy EACH_QUORUM]
    B -->|Yes| D[Write succeeds if both DCs have quorum]
    C --> E[Even though us-east-1 has quorum]
    E --> F[us-east-1: 3/3 replicas acknowledge]
    F --> G[But requirement: quorum in EVERY DC]
    G --> C

Why teams pick EACH_QUORUM incorrectly:

EACH_QUORUM seems appealing for “strong consistency across DCs” but most applications do not need it. LOCAL_QUORUM in each DC gives you local linearizability with cross-DC eventual consistency, which matches how most applications work.

Consistency LevelUse When
EACH_QUORUMFinancial transactions requiring all DCs to acknowledge; accepting that any DC failure causes unavailability
LOCAL_QUORUMMost multi-DC deployments; strong locally, eventual across DCs
LOCAL_ONENon-critical reads; can tolerate stale local data

Read consistency cross-DC:

Reads also behave unexpectedly with EACH_QUORUM:

# EACH_QUORUM read - returns data if ALL DCs have quorum available
# If any DC is down, returns unavailable
query = SimpleStatement("""
    SELECT * FROM orders.customers WHERE customer_id = %s
""", consistency_level=ConsistencyLevel.EACH_QUORUM)

For reads, EACH_QUORUM returns the minimum timestamp across DCs (the most stale data), which is rarely what you want.

DC failure during write:

When a DC goes down mid-write with EACH_QUORUM:

  1. Coordinator receives write
  2. Successfully writes to available DCs (us-east-1: 3/3)
  3. Fails to write to down DC (us-west-2: 0/3)
  4. Returns error to client
  5. Down DC has no record of the write

When the DC comes back up, it replays hints (if hints are enabled) but may miss writes that completed during the outage window.

Mitigation strategies:

  1. Use LOCAL_QUORUM instead of EACH_QUORUM unless you have a specific requirement for all-DC acknowledgment.

  2. Configure hints replay for DC failure recovery:

# cassandra.yaml
hints_directory: /var/lib/cassandra/hints
max_hints_delivery_seconds: 10800 # 3 hours
hints_flush_period_seconds: 300
  1. Monitor DC availability:
from cassandra.cluster import Cluster

cluster = Cluster(['192.168.1.1'])
session = cluster.connect()

# Check peer DC status
rows = session.execute("""
    SELECT peer, data_center, rack, host_id, rpc_address
    FROM system.peers
    WHERE data_center = 'us-west-2'
""")

for row in rows:
    print(f"DC us-west-2 node: {row.rpc_address}, status: {'up' if row.peer else 'down'}")
  1. Set write acknowledgments appropriately:
-- For global acknowledgment requirement (financial/regulatory):
CREATE KEYSPACE orders WITH REPLICATION = {
    'class': 'NetworkTopologyStrategy',
    'us-east-1': 3,
    'us-west-2': 3
} AND durability = 'sync';

-- For local-first writes (most applications):
-- Just use LOCAL_QUORUM for writes

Secondary Indexes

Secondary indexes in Cassandra are local to each node. When you create an index on an attribute, each node indexes its local data.

-- Create secondary index on frequently queried column
CREATE INDEX IF NOT EXISTS idx_customer_email
ON orders.customers (email);

Secondary indexes work for low-cardinality attributes distributed across many partitions. They break down for high-cardinality attributes like UUIDs that create massive indexes on every node.

For high-cardinality lookups, use a separate table. Denormalization is expected in Cassandra and often beats indexed queries.


Use Cases Where Cassandra Excels

Cassandra fits when:

  • High write throughput: Logging, IoT sensor data, event streams
  • Time-series data: Metrics, sensor readings, user activity
  • Geo-distributed data: Multi-region with local reads
  • Always-available writes: Dynamo-style “your write will never be rejected”

Cassandra struggles when:

  • Read-heavy with complex queries: PostgreSQL or Elasticsearch
  • ACID transactions across entities: Cassandra’s per-partition transactions cannot span entities
  • Small datasets: Overhead only makes sense at scale
  • Strong consistency: QUORUM or ALL work, but expect latency


When to Use / When Not to Use Cassandra

When to Use Cassandra

For write-heavy workloads at scale, Cassandra truly shines. IoT sensor data, log aggregation, and event streams all play to Cassandra’s strengths because it is built from the ground up for append-heavy access patterns. If you need to write to any region and read locally with tunable consistency, multi-region active-active deployment comes naturally. Cassandra also never rejects writes — that Dynamo-style “your write will never be rejected” semantics means no capacity planning surprises. Time-series data works well too, since timestamp-based access patterns map naturally to Cassandra’s partition model.

When Not to Use Cassandra

That said, Cassandra is not the right tool for every job. Read-heavy workloads with complex queries do not play to its strengths — it is not a general-purpose database, and complex secondary-index queries are slow and do not scale. Per-partition transactions only means cross-entity operations require saga patterns or similar workarounds. If strong consistency is a hard requirement, QUORUM reads and writes give you linearizability but with latency trade-offs you may not want. And for small datasets, the overhead of Cassandra’s architecture only pays off once you reach meaningful scale.

Production Failure Scenarios

FailureImpactMitigation
Node failure during writeWrite succeeds if replication factor met; failed node goes into hinted handoff recoveryMonitor node health; use nodetool repair regularly; hinted handoff queues have a 3-hour window
Tombstone accumulationDeleted data leaves tombstones; queries must scan past them; read latency spikesConfigure gc_grace_seconds appropriately; run compaction regularly; avoid delete-heavy workloads without compaction strategy
Partition size exceeds limitWide rows with millions of cells cause read timeouts, memory pressureMonitor partition size; split large partitions; use bucketing to limit partition width
Repair-induced latency spikenodetool repair is I/O intensive; can cause read latency spikes on affected nodesRun repairs during low-traffic windows; use incremental repair; consider Cassandra reaper for scheduling
Consistency level misconfigurationONE reads may return stale data from failed replicaUse QUORUM for important reads; LOCAL_ONE for low-latency acceptable-staleness reads; understand each level’s semantics
Batch statement misuseLogged batch is fine; unlogged batch across multiple partitions is an anti-patternUse logged batches only for ordered updates to the same partition; treat batch as a last resort
SSTable compaction pressureCompaction struggles to keep up with write throughput; disk fillsChoose compaction strategy based on workload (TWCS for time-series, LCS for general); monitor compaction backlog

Summary

Cassandra is purpose-built for write-heavy, globally distributed workloads where availability matters more than strong consistency. Its peer-to-peer architecture removes single points of failure, and tunable consistency lets you dial in the right trade-off per query.

Key architectural decisions:

  • Append-only write path for throughput
  • Peer-to-peer gossip for coordination
  • Consistency levels from ONE to ALL
  • Compaction strategies for different access patterns
  • NetworkTopologyStrategy for DC awareness

Cassandra rewards good data modeling. Design tables around your query patterns, denormalize aggressively, and pick the right compaction strategy. Do this and you have a database scaling to millions of writes per second with predictable latency.

Facebook chose Cassandra to handle billions of messages with five-nines availability. That same architecture serves IoT platforms, messaging systems, and analytics pipelines today.

Category

Related Posts

Amazon DynamoDB: Scalable NoSQL with Predictable Performance

Deep dive into Amazon DynamoDB architecture, partitioned tables, eventual consistency, on-demand capacity, and the single-digit millisecond SLA.

#distributed-systems #databases #amazon

Column-Family Databases: Cassandra and HBase Architecture

Cassandra and HBase data storage explained. Learn partition key design, column families, time-series modeling, and consistency tradeoffs.

#database #nosql #column-family

NoSQL Databases: Document, Key-Value, Column-Family and Graph

Explore NoSQL database types, CAP theorem implications, and when to choose MongoDB, Cassandra, DynamoDB, or graph databases over relational systems.

#databases #nosql #mongodb