Apache Cassandra: Distributed Column Store Built for Scale
Explore Apache Cassandra's peer-to-peer architecture, CQL query language, tunable consistency, compaction strategies, and use cases at scale.
Apache Cassandra: The Distributed Column Store for High Write Throughput
Facebook built Cassandra in 2007 to solve inbox search, needing to store and search hundreds of millions of messages with low latency. Existing solutions did not scale. Cassandra borrowed ideas from Amazon’s Dynamo and Google’s Bigtable.
Facebook open-sourced it in 2008, Apache made it a top-level project in 2010. Today Apple uses it for inbox search, Discord for message storage, Spotify for time-series, Netflix for analytics. Cassandra’s strength is write-heavy throughput that scales linearly as you add nodes, with no single point of failure.
Architecture Fundamentals
Cassandra uses a peer-to-peer architecture. Every node is equal. No master, no coordinator bottleneck, no special node whose failure brings down the cluster. Data distributes via consistent hashing, and any node can serve any request.
A 100-node Cassandra cluster has no bottleneck at any particular node. Writes spread evenly, letting the cluster handle traffic spikes that would crush a single-node database.
graph TD
A[Client] --> B[Node 1]
A --> C[Node 2]
A --> D[Node 3]
A --> E[Node N]
B -->|Gossip| F[Node 2]
B -->|Gossip| C
C -->|Gossip| D
D -->|Gossip| B
B --- G[(Partition 1)]
C --- H[(Partition 2)]
D --- I[(Partition N)]
The gossip protocol keeps all nodes aware of cluster state. Each node periodically exchanges state with a few others, and information spreads cluster-wide. Gossip handles additions, failures, and recoveries without central coordination.
Data Model
Cassandra’s data model looks like SQL but differs in important ways:
- Keyspace: Container for tables, like a database schema
- Table: Collection of rows with a primary key
- Row: Single record identified by partition key
- Column: Name-value pair with a data type
-- Create a keyspace with replication strategy
CREATE KEYSPACE IF NOT EXISTS orders
WITH REPLICATION = {
'class': 'NetworkTopologyStrategy',
'datacenter1': 3
};
-- Create a table with compound primary key
CREATE TABLE orders.customers (
customer_id UUID,
email TEXT,
name TEXT,
created_at TIMESTAMP,
PRIMARY KEY (customer_id)
);
-- Table with compound partition key
CREATE TABLE orders.order_items (
customer_id UUID, -- partition key
order_id TIMEUUID, -- clustering key
product_id UUID,
quantity INT,
price DECIMAL,
PRIMARY KEY (customer_id, order_id)
);
The partition key decides which node stores the data. The clustering key controls sort order within a partition. This design optimizes for the common pattern: fetching all items for a specific customer ordered by time.
CQL: Cassandra Query Language
CQL looks like SQL but behaves differently. Key restrictions:
- No JOINs (denormalize instead)
- No subqueries in most contexts
- No aggregate queries across partitions (without Spark)
- Queries must use the primary key or an index
-- Insert data
INSERT INTO orders.customers (customer_id, email, name, created_at)
VALUES (uuid(), 'alice@example.com', 'Alice Smith', toTimestamp(now()));
-- Query by partition key (efficient)
SELECT * FROM orders.customers WHERE customer_id = ?;
-- Query by non-primary key (requires secondary index - avoid in hot paths)
SELECT * FROM orders.customers WHERE email = 'alice@example.com';
-- Query by clustering key range (efficient within partition)
SELECT * FROM orders.order_items
WHERE customer_id = ? AND order_id > minTimeuuid('2024-01-01');
These restrictions exist for good reason. JOINs across distributed tables need network round-trips between nodes, which kills performance. By requiring partition-targeted queries, Cassandra routes requests directly to relevant nodes.
Tunable Consistency
Tunable consistency is Cassandra’s most powerful feature. Choose consistency per query, trading off between consistency and performance:
from cassandra.cluster import Cluster
cluster = Cluster(['192.168.1.1', '192.168.1.2'])
session = cluster.connect('orders')
# Write with quorum consistency (strongly consistent writes)
session.execute(
"INSERT INTO customers (customer_id, email) VALUES (%s, %s)",
[uuid(), 'alice@example.com'],
consistency_level=ConsistencyLevel.QUORUM
)
# Read with ONE consistency (fast, possibly stale)
session.execute(
"SELECT * FROM customers WHERE customer_id = %s",
[customer_id],
consistency_level=ConsistencyLevel.ONE
)
Common consistency levels:
| Level | Description | Use Case |
|---|---|---|
| ONE | Any single replica | Maximum speed, potentially stale |
| TWO | Two replicas | Better freshness, still fast |
| THREE | Three replicas | Fresh data, higher latency |
| QUORUM | Majority of replicas (N/2+1) | Balanced consistency |
| ALL | All replicas | Strongest consistency, slowest |
| LOCAL_QUORUM | Quorum in local DC | Low latency for multi-DC |
| EACH_QUORUM | Quorum in every DC | Strong multi-DC consistency |
With 3 replicas, quorum means 2 nodes must acknowledge writes. This gives strong consistency while tolerating one node down.
Client-Side Token-Aware Routing
Cassandra’s driver computes the token (MurmurHash) of the partition key and routes queries directly to the replica that owns that token range. This avoids hitting every node in the cluster for every query.
from cassandra.cluster import Cluster
# Token-aware routing is enabled by default in recent drivers
cluster = Cluster(
['192.168.1.1', '192.168.1.2', '192.168.1.3'],
load_balancing_policy=None # Default DCAwareRoundRobinPolicy with token-aware routing
)
session = cluster.connect('orders')
# Driver automatically computes token and routes to correct replica
# For partition key 'customer-123', driver computes MurmurHash
# Finds which node owns that token range
# Sends query directly to that node
result = session.execute(
"SELECT * FROM orders.customers WHERE customer_id = %s",
['customer-123'] # Driver routes to replica owning this partition
)
Without token-aware routing:
A query to customer-123 would first hit a coordinator node, which would then query all replicas and return the fastest response. This adds an extra network hop.
With token-aware routing:
The driver knows which replica owns the partition. It sends the query directly to that node.
graph TD
A[Client with Token-Aware Driver] -->|Direct query to replica| B[Replica 1 - owns partition]
A -->|Coordinator hop| C[Any Node]
C -->|Forwards query| B
How token mapping works:
Cassandra uses MurmurHash3 for token computation. The token range (0 to 2^63-1) is divided among nodes. The driver builds a token map from cluster metadata:
# Driver token map inspection
cluster = Cluster(['192.168.1.1'])
session = cluster.connect()
# Get token ranges for a partition key
token_map = cluster.metadata.token_map
partition_key = 'customer-123'
token = cluster.metadata.get_replicas('orders', [partition_key])
print(f"Token for '{partition_key}': {token}")
print(f"Replicas: {cluster.metadata.get_replicas('orders', [partition_key])}")
Token-aware routing and consistency levels:
Token-aware routing works with all consistency levels. For QUORUM, the driver sends to one replica (the primary) and then parallels digest requests to enough replicas to satisfy quorum.
# Driver behavior at QUORUM consistency with 3 replicas:
# 1. Compute token -> find primary replica (e.g., Replica 1)
# 2. Send data query to Replica 1
# 3. Send digest query to Replica 2 and Replica 3 in parallel
# 4. Wait for 2 responses (quorum)
# 5. If digests match, return result; if not, request full data repair
session.execute(
query,
parameters,
consistency_level=ConsistencyLevel.QUORUM # Driver handles parallel coordination
)
When token-aware routing matters most:
| Scenario | Impact |
|---|---|
| Single-partition reads | High - eliminates coordinator hop |
| Cross-partition queries (ORDER BY, etc.) | Low - must fan out anyway |
| Batch statements | Medium - each statement routed independently |
| ALLOW FILTERING queries | None - requires full table scan |
Token-aware routing limitations:
- Only works for queries with partition key equality (e.g.,
WHERE customer_id = ?) - Range queries (e.g.,
WHERE customer_id > ?) must scan all partitions - Batch statements with multiple partition keys fan out to multiple nodes
- ALLOW FILTERING defeats token awareness entirely
Driver configuration for token-aware routing:
from cassandra.cluster import Cluster, ExecutionProfile
from cassandra.policies import TokenAwarePolicy, DCAwareRoundRobinPolicy
# Explicit token-aware policy configuration
cluster = Cluster(
['192.168.1.1', '192.168.1.2'],
load_balancing_policy=TokenAwarePolicy(
DCAwareRoundRobinPolicy(local_dc='us-east-1')
)
)
# Execution profile for different query types
fast_profile = ExecutionProfile(
consistency_level=ConsistencyLevel.ONE,
request_timeout=2.0
)
strong_profile = ExecutionProfile(
consistency_level=ConsistencyLevel.QUORUM,
request_timeout=10.0
)
session = cluster.connect('orders')
session.execute(query, execution_profile=fast_profile) # Fast local reads
Lightweight Transactions (LWT)
Cassandra supports linearizable consistency for single-partition writes through Lightweight Transactions (LWT), implemented using a Paxos consensus protocol. The IF clause in CQL triggers LWT behavior.
-- LWT: Only insert if user@example.com does not exist
INSERT INTO users (user_id, email, created_at)
VALUES ('user-123', 'user@example.com', toTimestamp(now()))
IF NOT EXISTS;
-- LWT: Conditional update - only update if current balance >= amount
UPDATE account_balances
SET balance = balance - 50
WHERE account_id = 'acc-123'
IF balance >= 50;
-- LWT: Compare-and-set pattern
UPDATE counters
SET count = 100
WHERE page_id = 'homepage'
IF count = 99; -- Only succeeds if count is currently 99
How LWT works internally:
LWT uses a two-phase Paxos protocol per partition:
- Prepare phase: Coordinator proposes a value and asks replicas to promise to accept it
- Accept phase: If a majority of replicas promise, coordinator sends the accept message
- Commit phase: Replicas commit the value and coordinator responds to client
sequenceDiagram
participant C as Coordinator
participant R1 as Replica 1
participant R2 as Replica 2
participant R3 as Replica 3
C->>R1: Prepare (p=1, v=value)
C->>R2: Prepare (p=1, v=value)
C->>R3: Prepare (p=1, v=value)
R1-->>C: Promise(p=1)
R2-->>C: Promise(p=1)
R3-->>C: Promise(p=1)
Note over C: Majority promised
C->>R1: Accept (p=1, v=value)
C->>R2: Accept (p=1, v=value)
C->>R3: Accept (p=1, v=value)
R1-->>C: Accepted
R2-->>C: Accepted
R3-->>C: Accepted
Note over C: Majority accepted
C->>R1: Commit
C->>R2: Commit
C->>R3: Commit
Latency cost:
LWT rounds add 4 network hops compared to a standard write. With 3 replicas across a datacenter:
| Operation | Latency (local DC) | Latency (cross-DC) |
|---|---|---|
| Standard write (QUORUM) | ~2-5ms | ~15-30ms |
| LWT write (QUORUM) | ~10-20ms | ~50-100ms |
LWT latency is higher because every operation goes through Paxos. Use LOCAL_QUORUM for same-DC operations when possible.
LWT contention and failure:
When multiple clients attempt LWT on the same partition simultaneously, only one succeeds. The others receive UNAVAILABLE or retry conflicts.
from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
from cassandra import ConsistencyLevel
cluster = Cluster(['192.168.1.1', '192.168.1.2', '192.168.1.3'])
session = cluster.connect('users')
# LWT with retry logic for contention
for attempt in range(5):
try:
query = SimpleStatement("""
UPDATE user_sessions
SET last_active = toTimestamp(now())
WHERE user_id = 'user-123'
IF last_active < toTimestamp(now()) - 3600
""", consistency_level=ConsistencyLevel.LOCAL_QUORUM)
result = session.execute(query)
if result[0][0]:
print("Update succeeded")
else:
print("Condition not met - session recently active")
break
except Exception as e:
if 'UNAVAILABLE' in str(e) and attempt < 4:
time.sleep(0.1 * (2 ** attempt)) # Exponential backoff
continue
raise
When to use LWT:
| Use Case | LWT Appropriate |
|---|---|
| Ensure unique email on user creation | Yes - IF NOT EXISTS |
| Prevent double-spend / race conditions | Yes - IF balance >= amount |
| Distributed locks | Yes - but consider Cassandra’s built-in locks |
| Counter increments | No - use Cassandra counters |
| High-throughput writes | No - LWT is 4-10x slower |
| Multi-partition transactions | No - LWT is single partition only |
LWT and Cassandra driver:
The Python driver handles LWT automatically but you must configure retry policies:
from cassandra.cluster import Cluster
from cassandra.policies import RetryPolicy, FallthroughRetryPolicy
cluster = Cluster(
['192.168.1.1', '192.168.1.2', '192.168.1.3'],
retry_policy=FallthroughRetryPolicy() # LWT should not auto-retry
)
Common LWT mistakes:
- Using LWT on high-contention rows (multiple concurrent updates fail repeatedly)
- Assuming LWT spans multiple partitions (it cannot)
- Not handling
UNAVAILABLEerrors with appropriate backoff - Using LWT for counters (counters have their own linearizable implementation)
Materialized Views
Materialized views (MVs) in Cassandra automatically maintain a denormalized view of a base table. When rows in the base table change, Cassandra asynchronously updates the view. This differs from manual denormalization where your application must write to multiple tables.
-- Base table: orders by customer
CREATE TABLE orders.customers (
customer_id UUID,
order_id TIMEUUID,
order_total DECIMAL,
order_status TEXT,
created_at TIMESTAMP,
PRIMARY KEY (customer_id, order_id)
);
-- Materialized view: orders indexed by order_id
CREATE MATERIALIZED VIEW orders.orders_by_id AS
SELECT order_id, customer_id, order_total, order_status, created_at
FROM orders.customers
WHERE order_id IS NOT NULL
PRIMARY KEY (order_id, customer_id);
-- Materialized view: orders indexed by status
CREATE MATERIALIZED VIEW orders.orders_by_status AS
SELECT order_status, order_id, customer_id, order_total, created_at
FROM orders.customers
WHERE order_status IS NOT NULL AND order_id IS NOT NULL
PRIMARY KEY (order_status, order_id);
How materialized view updates work:
- Write arrives at base table
- Coordinator writes to base table SSTable
- Local view building mutation is created
- View update is written to view’s partition
- View updates are atomic per partition but eventual across views
graph LR
A[Write to Base Table] --> B[Coordinator]
B --> C[Base Table SSTable]
B --> D[View Mutation]
D --> E[View 1 SSTable]
D --> F[View 2 SSTable]
D --> G[View N SSTable]
Eventual consistency trade-offs:
View updates are asynchronous and best-effort. This has important implications:
| Aspect | Behavior |
|---|---|
| View staleness | Views may lag base table by milliseconds to seconds |
| View failures | Failed view updates are not retried automatically |
| Lost writes | If a base write fails after succeeding on some replicas, view updates may be inconsistent |
| Delete propagation | Deletes from base table propagate to views |
from cassandra.cluster import Cluster
cluster = Cluster(['192.168.1.1', '192.168.1.2'])
session = cluster.connect('orders')
# Read from materialized view - may be slightly stale
rows = session.execute("""
SELECT * FROM orders.orders_by_id
WHERE order_id = %s
""", parameters=['some-order-uuid'])
# For critical reads requiring latest data, query base table instead
rows = session.execute("""
SELECT * FROM orders.customers
WHERE order_id = %s
""", parameters=['some-order-uuid'])
When materialized views work well:
| Scenario | MV Works |
|---|---|
| Write-heavy workloads with multiple read patterns | Yes - writes maintain views automatically |
| Single-row updates across views | Yes - atomic per partition |
| Time-series with multiple access patterns | Yes - create views for different time ranges |
| Read-heavy workloads that can tolerate stale reads | Yes - views reduce read amplification |
When materialized views break down:
| Scenario | MV Problems |
|---|---|
| High contention on base row | View updates may be lost or delayed |
| Updates that change the view key | MV does not support updating the partition key |
| Very high write rates | View update backlog can grow unbounded |
| Strong consistency requirements | Views are eventually consistent |
| Complex aggregations | MVs cannot aggregate - they only denormalize |
View update failure handling:
Cassandra tracks view update failures in system.views_builds_in_progress and system.built_views. You can monitor and rebuild:
# Check view build status
nodetool viewbuildstatus orders orders_by_id
# Rebuild a materialized view if it falls out of sync
nodetool rebuildview orders orders_by_id
# Check for pending view updates
SELECT * FROM system.views_builds_in_progress;
SELECT * FROM system.built_views;
Materialized view limitations:
- View primary key must include all primary key columns from base table
- Cannot update a column that is part of the view’s primary key
- Secondary columns in base table cannot be primary key columns in view
- TTL on base table does not automatically apply to view (must be set separately)
Write Path and Commit Log
Cassandra optimizes for write throughput. When you write:
- Write appends to the commit log (durability)
- Data goes to the memtable (in-memory buffer)
- Client gets acknowledgment immediately
- Memtable flushes to SSTable when full
The write path is append-only, which is fast on disk. B-tree databases must find the right location and overwrite; Cassandra just appends.
sequenceDiagram
participant C as Client
participant M as Memtable
participant L as Commit Log
participant S as SSTable
C->>L: Append to Commit Log (durability)
C->>M: Write to Memtable
M-->>C: Write Acknowledged
Note over M: When memtable is full
M->>S: Flush to SSTable
SSTables (Sorted String Tables) are immutable files on disk. Once written, never modified. Compaction merges SSTables and removes deleted data.
Compaction Strategies
Compaction merges SSTables and reclaims space after deletions or updates. Cassandra offers:
SizeTiered Compaction (STCS):
- Groups SSTables of similar size
- Good for write-heavy workloads
- Older data visible longer
TimeWindow Compaction (TWCS):
- Groups SSTables by time window
- Ideal for time-series
- Simplifies TTL management
Leveled Compaction (LCS):
- Restricts SSTable size per level
- Better read performance (check at most one SSTable per level)
- More writes due to compaction
-- Specify compaction strategy when creating table
CREATE TABLE metrics.sensor_data (
sensor_id UUID,
timestamp TIMESTAMP,
value DOUBLE,
PRIMARY KEY (sensor_id, timestamp)
) WITH compaction = {
'class': 'TimeWindowCompactionStrategy',
'window_unit': 'HOURS',
'compaction_window_unit': 'HOURS',
'compaction_window_size': 1
};
For time-series with TTLs, TWCS simplifies tombstone cleanup. For general workloads, TWCS or STCS depending on your read/write ratio.
Compaction Strategy Selection Decision Tree
Choosing the right compaction strategy requires understanding your workload characteristics. Here is a practical decision framework:
graph TD
A[Start: Choose Compaction Strategy] --> B{Data Model?}
B -->|Time-series / TTL| C[TWCS - TimeWindow Compaction]
B -->|Wide partitions / Heavy writes| D[STCS - SizeTiered Compaction]
B -->|General workload / Balanced| E{Latency Priority?}
E -->|Optimize reads| F[LCS - Leveled Compaction]
E -->|Optimize writes| D
E -->|Mixed workload| G{Partition Size?}
G -->|Wide partitions >100MB| D
G -->|Narrow partitions| H[TWCS or LCS]
C --> I[Best for: IoT, metrics, event logs]
D --> J[Best for: Write-heavy, archival]
F --> K[Best for: Read-heavy, time-series with range queries]
H --> L[Consider: LCS if reads matter, TWCS if time-bucketed]
Decision matrix by workload type:
| Workload | Primary Strategy | Alternative | Why |
|---|---|---|---|
| Time-series with TTL | TWCS | - | TTL expiration aligns with window compaction |
| Write-heavy (logs, events) | STCS | TWCS (if time-bucketed) | Append-heavy, infrequent reads |
| Read-heavy (user profiles) | LCS | TWCS | Frequent reads benefit from leveled layout |
| Mixed (balanced) | TWCS | LCS | Time-bucketed queries work well with TWCS |
| Heavy deletes | TWCS | STCS | Tombstones expire with time windows |
| Wide partitions | STCS | - | LCS struggles with partitions > 10MB |
| Counter tables | STCS | - | Counters have many updates, benefit from STCS |
Compaction strategy characteristics:
| Strategy | SSTable Levels | Write Amplification | Read Amplification | Space Amplification | Best For |
|---|---|---|---|---|---|
| STCS | N (unbounded) | 1x (low) | High (scans similar-sized SSTables) | High (may have duplicates) | Write-heavy |
| LCS | L0 + L1-L7 (max 7) | 20-30% more | Low (1 SSTable per level) | 10-20% (none by design) | Read-heavy |
| TWCS | N per window | Low | Low | Low (timestamps co-located) | Time-series |
How to switch compaction strategies:
# WARNING: Changing compaction strategy requires nodetool upgradesstables
# Never run without understanding the implications
# 1. Disable auto-compaction for the table
nodetool disableautocompaction orders.customers
# 2. Run major compaction to flush current data
nodetool compact orders.customers
# 3. Change compaction strategy
nodetool setcompactionstrategy orders.customers TimeWindowCompactionStrategy
# 4. Configure strategy options
nodetool setcompactionstrategy_options orders.customers "window_unit=HOURS,compaction_window_size=1"
# 5. Re-enable auto-compaction
nodetool enableautocompaction orders.customers
# 6. Run upgradesstables to rewrite data with new strategy
nodetool upgradesstables orders.customers
# 7. Monitor compaction during transition
nodetool compactionstats
Compaction strategy and Cassandra version:
| Strategy | Cassandra 2.1 | Cassandra 3.0+ | Notes |
|---|---|---|---|
| STCS | Yes | Yes | Default in 2.1 |
| LCS | Yes | Yes | Improved significantly in 3.0+ |
| TWCS | No | Yes | Added in 3.0.8 |
| DTCS | Yes (deprecated) | No | Deprecated in 3.0 - use TWCS |
Common compaction mistakes:
-
Using LCS for wide partitions: LCS assumes partitions fit in a single SSTable level. Wide partitions cause compaction failures and read timeouts.
-
Using TWCS without time-based access: TWCS groups data by timestamp. If queries span multiple time windows, performance degrades.
-
Changing strategies without running upgradesstables: Old SSTables remain in the old format until explicitly upgraded.
-
Setting TWCS windows too small: Windows < 1 hour cause excessive SSTable fragmentation. Windows > 1 day defeat the tombstone purging purpose.
-
Ignoring compaction during strategy change: The table is vulnerable to tombstone accumulation during the transition period.
Tombstone Handling
When you delete data in Cassandra, the deletion is not immediate. Cassandra writes a tombstone - a marker indicating the data is deleted - and the actual removal happens during compaction.
Why tombstones exist:
Cassandra’s append-only SSTable design means data cannot be overwritten in place. Deletions are writes that mark data as dead. The tombstone persists until compaction removes both the tombstone and the dead data it marks.
-- This writes a tombstone, not immediate deletion
DELETE FROM orders.order_items
WHERE order_id = 'order-123';
Tombstone lifecycle:
- Delete request arrives at coordinator
- Coordinator writes tombstone to SSTable with the deletion timestamp
- Tombstone is visible immediately - queries exclude the deleted data
- During compaction, tombstones older than
gc_grace_secondsare removed along with the data they mark - If compaction has not run before
gc_grace_seconds, the tombstone disappears but deleted data can “resurrect”
gc_grace_seconds and resurrection risk:
# Table with gc_grace_seconds = 10 days (default)
# This setting is critical for managing tombstones
CREATE TABLE orders.order_items (
order_id UUID,
item_id UUID,
quantity INT,
PRIMARY KEY (order_id, item_id)
) WITH gc_grace_seconds = 864000; -- 10 days in seconds
The gc_grace_seconds setting controls how long tombstones must survive before compaction can delete them. This window allows anti-entropy repair to propagate deletions to all replicas. If you run compaction before gc_grace_seconds elapses and a replica missed the original deletion, the deleted data can resurrect on that replica.
| Scenario | gc_grace_seconds | Risk |
|---|---|---|
| Single datacenter, frequent repairs | 86400 (1 day) | Low - repairs catch missed deletions |
| Multi-DC, infrequent repairs | 432000 (5 days) | Medium |
| Very infrequent repairs | 864000 (10 days, default) | Lower - larger repair window |
| No repairs ever | Infinity | Tombstones never safely removed |
Tombstone impact on read performance:
Queries must scan through tombstones to find live data. A partition with many tombstones can cause read timeouts even when the query returns few results.
-- This query scans all tombstones in the partition
-- May timeout on partitions with 100k+ tombstones
SELECT * FROM orders.order_items
WHERE order_id = 'big-order-with-many-items';
Diagnosing tombstone problems:
# Check tombstone count per partition
nodetool tablestats orders.order_items -H
# Look for "Number of tombstones" in output
# Higher numbers indicate potential read pressure
# Enable trace for queries to see tombstone scanning
TRACING ON;
SELECT * FROM orders.order_items WHERE order_id = 'test';
TRACING OFF;
# Check compaction history
nodetool compactionhistory | grep order_items
Mitigation strategies:
- Time-window bucketing: Split time-series into daily or weekly partitions. Old partitions with tombstones are not queried.
CREATE TABLE metrics.sensor_data (
sensor_id UUID,
date TEXT, -- '2024-01-15' bucket
timestamp TIMESTAMP,
value DOUBLE,
PRIMARY KEY ((sensor_id, date), timestamp)
);
-
Prevent wide rows with mixed active/deleted data: Separate active entities from archived ones.
-
Use TTL instead of explicit deletes: TTL tombstones are automatically handled by TWCS.
-
Reduce gc_grace_seconds with frequent repairs: If you run repair daily, you can safely reduce
gc_grace_seconds. -
Monitor partition statistics: Identify partitions with excessive tombstones before they cause outages.
# Python script to check partition tombstone density
from cassandra.cluster import Cluster
cluster = Cluster(['192.168.1.1'])
session = cluster.connect('orders')
# Get partition statistics
rows = session.execute("""
SELECT key, todckeys, tombstones, cells
FROM system.size_estimates
WHERE keyspace_name = 'orders'
AND table_name = 'order_items'
""")
for row in rows:
if row.tombstones > 10000:
print(f"Partition {row.key} has {row.tombstones} tombstones - consider archival")
Anti-Entropy Repair
Cassandra uses anti-entropy repair to synchronize data between replicas and ensure consistency. The nodetool repair command triggers this process, which compares Merkle trees between replicas to find and fix discrepancies.
How Merkle trees work in Cassandra:
- Each replica builds a Merkle tree from its SSTable data
- Trees are hash-based binary trees where leaf nodes contain hash values of data ranges
- Replicas exchange tree roots to detect differences
- When roots differ, nodes traverse to find the exact range that diverged
- Only the differing data ranges are exchanged and repaired
graph TD
A[Replica 1 SSTable] --> B[Merkle Tree Root Hash]
C[Replica 2 SSTable] --> D[Merkle Tree Root Hash]
B --> E{Compare Roots}
D --> E
E -->|Match| F[No repair needed]
E -->|Mismatch| G[Traverse tree to find differing range]
G --> H[Exchange only differing data]
Operational considerations:
- Running repair:
nodetool repair -prruns repair on primary ranges only - Frequency: Run weekly at minimum; daily for critical data
- Resource cost: Repair is expensive - it reads entire SSTables and generates network traffic
- Incremental repair: Use
-incfor partial repairs that spread load across runs - Size estimates:
nodetool compactionstatsshows pending tasks before running repair
Common repair issues:
| Problem | Symptom | Mitigation |
|---|---|---|
| Repair running too long | Cluster slowdown | Use incremental repair |
| Merkle tree memory | OOM on large partitions | Subdivide large partitions |
| Overlap with writes | Inconsistency during repair | Repair during low-write windows |
| Node down during repair | Incomplete sync | Re-run repair after node recovers |
Anti-entropy vs Read Repair:
Anti-entropy repair is proactive (runs periodically), while read repair is reactive (happens during reads). Read repair sends digest requests to replicas on every read and fixes inconsistencies discovered. Anti-entropy catches issues that read-heavy workloads might miss.
Read Repair: The Passive Consistency Path on Every Read
Read repair is Cassandra’s mechanism for passively repairing inconsistencies during normal read operations. It happens automatically without requiring separate repair jobs.
How read repair works:
When you read with a consistency level greater than ONE, Cassandra checks consistency across replicas and repairs discrepancies as part of the read path.
sequenceDiagram
participant C as Coordinator
participant R1 as Replica 1
participant R2 as Replica 2
participant R3 as Replica 3
C->>R1: Send data query
C->>R2: Send data query
C->>R3: Send data query
R1-->>C: Data (digest: abc123)
R2-->>C: Data (digest: abc123)
R3-->>C: Data (digest: xyz789) <- Digest mismatch!
Note over C: Digests differ - R3 has stale data
C->>R3: Send full data repair
R3-->>C: Latest data
C->>R1: Confirm sync (optional digest)
C->>R2: Confirm sync (optional digest)
Note over C: Return latest to client
Read repair with QUORUM (3 replicas):
from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
from cassandra import ConsistencyLevel
cluster = Cluster(['192.168.1.1', '192.168.1.2', '192.168.1.3'])
session = cluster.connect('orders')
# Read with QUORUM consistency
# Coordinator queries all 3 replicas
# Waits for 2 responses (quorum)
# If digests differ, repair happens as part of read
query = SimpleStatement("""
SELECT * FROM orders.customers WHERE customer_id = %s
""", consistency_level=ConsistencyLevel.QUORUM)
result = session.execute(query, ['customer-123'])
# During this read:
# 1. Coordinator sends to all 3 replicas
# 2. R1 and R2 return data with digest
# 3. Coordinator compares digests
# 4. If R3 digest differs, coordinator fetches latest from R1/R2 and sends to R3
# 5. Client gets consistent data
Read repair probability:
| Consistency Level | Read Repair Chance |
|---|---|
| ONE | 0% - no digest check |
| LOCAL_ONE | 0% - no digest check |
| QUORUM | ~33% (1 replica may differ with RF=3) |
| ALL | 100% - all replicas checked |
| EACH_QUORUM | 100% - all DCs checked |
With RF=3 and QUORUM, the chance that a stale replica participates in the read is 100% (since 2 of 3 replicas respond), but the probability that exactly one replica has stale data and it responds to the read is lower.
Read repair at ONE consistency:
Even at ONE consistency, you can enable background read repair:
# cassandra.yaml
# When read_repair_chance is 0.1 (default), 10% of ONE reads
# trigger background repair to all replicas
read_repair_chance: 0.1
read_repair_page_size: 1000
-- Table-specific read repair chance
CREATE TABLE orders.customers (
customer_id UUID,
email TEXT,
name TEXT,
PRIMARY KEY (customer_id)
) WITH read_repair_chance = 0.1
AND dclocal_read_repair_chance = 0.05;
dclocal_read_repair_chance vs read_repair_chance:
dclocal_read_repair_chance: Probability of read repair for the local DC only (runs faster, does not wait for cross-DC)read_repair_chance: Probability of read repair across all DCs (waits for all replicas)
| Setting | Scope | Latency Impact | Use Case |
|---|---|---|---|
| dclocal_read_repair_chance | Local DC only | Lower - no cross-DC coordination | Multi-DC with local-first reads |
| read_repair_chance | All DCs | Higher - cross-DC coordination | Critical data requiring global consistency |
Read repair and write latency interaction:
Read repair happens asynchronously after the coordinator receives enough responses to satisfy the consistency level. This means:
# QUORUM read timeline:
# T+0ms: Coordinator sends to all replicas
# T+2ms: QUORUM (2 replicas) respond - coordinator returns to client
# T+2ms to T+10ms: Background read repair to stale replica (if any)
The client sees low latency (just waiting for quorum), but the repair happens in the background.
Monitoring read repair:
# Check read repair activity
nodetool proxyhistograms | grep ReadRepair
# Monitor read repair timing
nodetool compactionstats
# Enable debug logging for read repair
nodetool setlogginglevels org.apache.cassandra.service.read_repair=DEBUG
# JMX monitoring for read repair metrics
from cassandra.cluster import Cluster
cluster = Cluster(['192.168.1.1'])
session = cluster.connect()
# Query read repair statistics
rows = session.execute("""
SELECT keyspace_name, table_name,
sum(read_repairs_attempted) as repairs_attempted,
sum(read_repairs_blocked) as repairs_blocked
FROM system.size_estimates
GROUP BY keyspace_name, table_name
""")
for row in rows:
print(f"{row.keyspace_name}.{row.table_name}: "
f"{row.repairs_attempted} attempted, "
f"{row.repairs_blocked} blocked")
When read repair is insufficient:
Read repair only fixes inconsistencies on partitions that are actively read. Partitions that are never read accumulate inconsistencies until anti-entropy repair runs. This is why anti-entropy repair (nodetool repair) is essential for long-term consistency, especially for rarely-accessed data.
| Scenario | Read Repair | Anti-Entropy Repair |
|---|---|---|
| Frequently accessed partitions | Keeps them consistent | Rarely needed |
| Rarely accessed partitions | Does not run | Required periodically |
| Large partitions | Can cause latency spikes | Better - runs in background |
| Cross-DC consistency | dclocal only doesn’t cover | Full cluster repair |
Data Center Awareness
NetworkTopologyStrategy defines replication rules that respect data center boundaries:
- Lower latency by reading from local replicas
- Disaster resilience via cross-DC replication
- Workload isolation (analytics vs. transactional)
CREATE KEYSPACE orders WITH REPLICATION = {
'class': 'NetworkTopologyStrategy',
'us-east-1': 3, -- 3 replicas in primary DC
'us-west-2': 3, -- 3 replicas in backup DC
'eu-west-1': 1 -- 1 replica in EU for GDPR
};
For cross-DC replication, LOCAL_ONE, LOCAL_QUORUM, and EACH_QUORUM let you control whether remote DC waits affect latency.
Multi-DC Consistency Pitfalls
Cassandra’s tunable consistency works differently across datacenters, and EACH_QUORUM has surprising behavior during DC failures that catches many users.
EACH_QUORUM behavior during DC failure:
EACH_QUORUM requires a quorum in every datacenter before responding to the client. If one DC is down, writes fail entirely even if all other DCs are available.
from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
from cassandra import ConsistencyLevel
cluster = Cluster(['192.168.1.1', '192.168.1.2']) # us-east-1
session = cluster.connect('orders')
# EACH_QUORUM write - waits for quorum in EACH datacenter
# If us-west-2 DC is down, this write FAILS even if us-east-1 quorum succeeds
query = SimpleStatement("""
INSERT INTO orders.customers (customer_id, email, name)
VALUES (%s, %s, %s)
""", consistency_level=ConsistencyLevel.EACH_QUORUM)
try:
session.execute(query, ['cust-123', 'a@b.com', 'Alice'])
except Exception as e:
print(f"EACH_QUORUM failed: {e}")
# If us-west-2 is down, write fails even though us-east-1 succeeded
The EACH_QUORUM failure scenario:
With replication factor 3 in each of 2 DCs (6 total replicas):
| DC | Replicas | Quorum Needed | Status |
|---|---|---|---|
| us-east-1 | 3 | 2 | Available |
| us-west-2 | 3 | 2 | DOWN |
EACH_QUORUMwrite: Requires 2 from us-east-1 AND 2 from us-west-2. FAILS because us-west-2 is unavailable.LOCAL_QUORUMwrite: Requires 2 from us-east-1 only. SUCCEEDS.QUORUMwrite: Requires 4 total across both DCs. FAILS because us-west-2 replicas cannot respond.
graph TD
A[Write with EACH_QUORUM] --> B{DC us-west-2 available?}
B -->|No| C[Write FAILS - cannot satisfy EACH_QUORUM]
B -->|Yes| D[Write succeeds if both DCs have quorum]
C --> E[Even though us-east-1 has quorum]
E --> F[us-east-1: 3/3 replicas acknowledge]
F --> G[But requirement: quorum in EVERY DC]
G --> C
Why teams pick EACH_QUORUM incorrectly:
EACH_QUORUM seems appealing for “strong consistency across DCs” but most applications do not need it. LOCAL_QUORUM in each DC gives you local linearizability with cross-DC eventual consistency, which matches how most applications work.
| Consistency Level | Use When |
|---|---|
| EACH_QUORUM | Financial transactions requiring all DCs to acknowledge; accepting that any DC failure causes unavailability |
| LOCAL_QUORUM | Most multi-DC deployments; strong locally, eventual across DCs |
| LOCAL_ONE | Non-critical reads; can tolerate stale local data |
Read consistency cross-DC:
Reads also behave unexpectedly with EACH_QUORUM:
# EACH_QUORUM read - returns data if ALL DCs have quorum available
# If any DC is down, returns unavailable
query = SimpleStatement("""
SELECT * FROM orders.customers WHERE customer_id = %s
""", consistency_level=ConsistencyLevel.EACH_QUORUM)
For reads, EACH_QUORUM returns the minimum timestamp across DCs (the most stale data), which is rarely what you want.
DC failure during write:
When a DC goes down mid-write with EACH_QUORUM:
- Coordinator receives write
- Successfully writes to available DCs (us-east-1: 3/3)
- Fails to write to down DC (us-west-2: 0/3)
- Returns error to client
- Down DC has no record of the write
When the DC comes back up, it replays hints (if hints are enabled) but may miss writes that completed during the outage window.
Mitigation strategies:
-
Use LOCAL_QUORUM instead of EACH_QUORUM unless you have a specific requirement for all-DC acknowledgment.
-
Configure hints replay for DC failure recovery:
# cassandra.yaml
hints_directory: /var/lib/cassandra/hints
max_hints_delivery_seconds: 10800 # 3 hours
hints_flush_period_seconds: 300
- Monitor DC availability:
from cassandra.cluster import Cluster
cluster = Cluster(['192.168.1.1'])
session = cluster.connect()
# Check peer DC status
rows = session.execute("""
SELECT peer, data_center, rack, host_id, rpc_address
FROM system.peers
WHERE data_center = 'us-west-2'
""")
for row in rows:
print(f"DC us-west-2 node: {row.rpc_address}, status: {'up' if row.peer else 'down'}")
- Set write acknowledgments appropriately:
-- For global acknowledgment requirement (financial/regulatory):
CREATE KEYSPACE orders WITH REPLICATION = {
'class': 'NetworkTopologyStrategy',
'us-east-1': 3,
'us-west-2': 3
} AND durability = 'sync';
-- For local-first writes (most applications):
-- Just use LOCAL_QUORUM for writes
Secondary Indexes
Secondary indexes in Cassandra are local to each node. When you create an index on an attribute, each node indexes its local data.
-- Create secondary index on frequently queried column
CREATE INDEX IF NOT EXISTS idx_customer_email
ON orders.customers (email);
Secondary indexes work for low-cardinality attributes distributed across many partitions. They break down for high-cardinality attributes like UUIDs that create massive indexes on every node.
For high-cardinality lookups, use a separate table. Denormalization is expected in Cassandra and often beats indexed queries.
Use Cases Where Cassandra Excels
Cassandra fits when:
- High write throughput: Logging, IoT sensor data, event streams
- Time-series data: Metrics, sensor readings, user activity
- Geo-distributed data: Multi-region with local reads
- Always-available writes: Dynamo-style “your write will never be rejected”
Cassandra struggles when:
- Read-heavy with complex queries: PostgreSQL or Elasticsearch
- ACID transactions across entities: Cassandra’s per-partition transactions cannot span entities
- Small datasets: Overhead only makes sense at scale
- Strong consistency: QUORUM or ALL work, but expect latency
Related Concepts
- NoSQL Databases covers the broader NoSQL landscape
- CAP Theorem explains why Cassandra chose tunable availability
- Consistent Hashing describes the partitioning mechanism
- Database Replication explains the replica synchronization model
When to Use / When Not to Use Cassandra
When to Use Cassandra
For write-heavy workloads at scale, Cassandra truly shines. IoT sensor data, log aggregation, and event streams all play to Cassandra’s strengths because it is built from the ground up for append-heavy access patterns. If you need to write to any region and read locally with tunable consistency, multi-region active-active deployment comes naturally. Cassandra also never rejects writes — that Dynamo-style “your write will never be rejected” semantics means no capacity planning surprises. Time-series data works well too, since timestamp-based access patterns map naturally to Cassandra’s partition model.
When Not to Use Cassandra
That said, Cassandra is not the right tool for every job. Read-heavy workloads with complex queries do not play to its strengths — it is not a general-purpose database, and complex secondary-index queries are slow and do not scale. Per-partition transactions only means cross-entity operations require saga patterns or similar workarounds. If strong consistency is a hard requirement, QUORUM reads and writes give you linearizability but with latency trade-offs you may not want. And for small datasets, the overhead of Cassandra’s architecture only pays off once you reach meaningful scale.
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Node failure during write | Write succeeds if replication factor met; failed node goes into hinted handoff recovery | Monitor node health; use nodetool repair regularly; hinted handoff queues have a 3-hour window |
| Tombstone accumulation | Deleted data leaves tombstones; queries must scan past them; read latency spikes | Configure gc_grace_seconds appropriately; run compaction regularly; avoid delete-heavy workloads without compaction strategy |
| Partition size exceeds limit | Wide rows with millions of cells cause read timeouts, memory pressure | Monitor partition size; split large partitions; use bucketing to limit partition width |
| Repair-induced latency spike | nodetool repair is I/O intensive; can cause read latency spikes on affected nodes | Run repairs during low-traffic windows; use incremental repair; consider Cassandra reaper for scheduling |
| Consistency level misconfiguration | ONE reads may return stale data from failed replica | Use QUORUM for important reads; LOCAL_ONE for low-latency acceptable-staleness reads; understand each level’s semantics |
| Batch statement misuse | Logged batch is fine; unlogged batch across multiple partitions is an anti-pattern | Use logged batches only for ordered updates to the same partition; treat batch as a last resort |
| SSTable compaction pressure | Compaction struggles to keep up with write throughput; disk fills | Choose compaction strategy based on workload (TWCS for time-series, LCS for general); monitor compaction backlog |
Summary
Cassandra is purpose-built for write-heavy, globally distributed workloads where availability matters more than strong consistency. Its peer-to-peer architecture removes single points of failure, and tunable consistency lets you dial in the right trade-off per query.
Key architectural decisions:
- Append-only write path for throughput
- Peer-to-peer gossip for coordination
- Consistency levels from ONE to ALL
- Compaction strategies for different access patterns
- NetworkTopologyStrategy for DC awareness
Cassandra rewards good data modeling. Design tables around your query patterns, denormalize aggressively, and pick the right compaction strategy. Do this and you have a database scaling to millions of writes per second with predictable latency.
Facebook chose Cassandra to handle billions of messages with five-nines availability. That same architecture serves IoT platforms, messaging systems, and analytics pipelines today.
Category
Related Posts
Amazon DynamoDB: Scalable NoSQL with Predictable Performance
Deep dive into Amazon DynamoDB architecture, partitioned tables, eventual consistency, on-demand capacity, and the single-digit millisecond SLA.
Column-Family Databases: Cassandra and HBase Architecture
Cassandra and HBase data storage explained. Learn partition key design, column families, time-series modeling, and consistency tradeoffs.
NoSQL Databases: Document, Key-Value, Column-Family and Graph
Explore NoSQL database types, CAP theorem implications, and when to choose MongoDB, Cassandra, DynamoDB, or graph databases over relational systems.