Apache Cassandra: Distributed Column Store Built for Scale

Explore Apache Cassandra's peer-to-peer architecture, CQL query language, tunable consistency, compaction strategies, and use cases at scale.

published: March 24, 2026 reading time: 50 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Cassandra uses a peer-to-peer architecture with gossip-based failure detection, so there is no master node bottleneck and the cluster simply absorbs the loss of any individual node. Its append-only write path (commit log to memtable to SSTable) makes it exceptionally fast for write-heavy workloads, and you tune consistency per query between the speed of reading from one node and the safety of waiting for all replicas. Compaction strategies like TimeWindow for time-series data or Leveled for read-heavy workloads let you match the engine to your access patterns. The practical result is a database that scales linearly as you add nodes, handles geo-distribution naturally, and rarely rejects writes no matter what.

Apache Cassandra: The Distributed Column Store for High Write Throughput

Facebook built Cassandra in 2007 to solve inbox search, needing to store and search hundreds of millions of messages with low latency. Existing solutions did not scale. Cassandra borrowed ideas from Amazon’s Dynamo and Google’s Bigtable.

Facebook open-sourced it in 2008, Apache made it a top-level project in 2010. Today Apple uses it for inbox search, Discord for message storage, Spotify for time-series, Netflix for analytics. Cassandra’s strength is write-heavy throughput that scales linearly as you add nodes, with no single point of failure.

Introduction

Apache Cassandra is a distributed, NoSQL column store built for horizontal scaling and high write throughput. Originally created by Facebook in 2007 to solve inbox search at scale, it was open-sourced in 2008 and became an Apache top-level project in 2010. Today it powers storage systems at Apple, Discord, Spotify, and Netflix — environments where write-heavy workloads, geo-distribution, and linear scalability are non-negotiable requirements.

Cassandra’s defining architectural choice is a peer-to-peer model with no master node. Data distributes via consistent hashing, and any node in the cluster can serve any request. This design eliminates single points of failure and enables linear scaling: adding nodes increases capacity proportionally without creating coordination bottlenecks. Combined with tunable consistency, Cassandra lets you trade off between consistency and performance on a per-query basis, making it a practical choice for globally distributed systems where latency and availability often matter more than strict serializability.

This guide covers the architectural fundamentals that make Cassandra tick — its peer-to-peer topology, CQL query model, write and compaction internals, consistency mechanisms, and operational realities like anti-entropy repair and multi-DC deployments. By the end, you will understand when Cassandra is the right tool and what trade-offs you are making.

Architecture Fundamentals

Cassandra uses a peer-to-peer architecture. Every node is equal. No master, no coordinator bottleneck, no special node whose failure brings down the cluster. Data distributes via consistent hashing, and any node can serve any request.

A 100-node Cassandra cluster has no bottleneck at any particular node. Writes spread evenly, letting the cluster handle traffic spikes that would crush a single-node database.

graph TD
    A[Client] --> B[Node 1]
    A --> C[Node 2]
    A --> D[Node 3]
    A --> E[Node N]

    B -->|Gossip| F[Node 2]
    B -->|Gossip| C
    C -->|Gossip| D
    D -->|Gossip| B

    B --- G[(Partition 1)]
    C --- H[(Partition 2)]
    D --- I[(Partition N)]

The gossip protocol keeps all nodes aware of cluster state. Each node periodically exchanges state with a few others, and information spreads cluster-wide. Gossip handles additions, failures, and recoveries without central coordination.

Data Model

Cassandra’s data model looks like SQL but differs in important ways:

Keyspace: Container for tables, like a database schema
Table: Collection of rows with a primary key
Row: Single record identified by partition key
Column: Name-value pair with a data type

-- Create a keyspace with replication strategy
CREATE KEYSPACE IF NOT EXISTS orders
WITH REPLICATION = {
    'class': 'NetworkTopologyStrategy',
    'datacenter1': 3
};

-- Create a table with compound primary key
CREATE TABLE orders.customers (
    customer_id UUID,
    email TEXT,
    name TEXT,
    created_at TIMESTAMP,
    PRIMARY KEY (customer_id)
);

-- Table with compound partition key
CREATE TABLE orders.order_items (
    customer_id UUID,        -- partition key
    order_id TIMEUUID,       -- clustering key
    product_id UUID,
    quantity INT,
    price DECIMAL,
    PRIMARY KEY (customer_id, order_id)
);

The partition key decides which node stores the data. The clustering key controls sort order within a partition. This design optimizes for the common pattern: fetching all items for a specific customer ordered by time.

CQL: Cassandra Query Language

CQL looks like SQL but behaves differently. Key restrictions:

No JOINs (denormalize instead)
No subqueries in most contexts
No aggregate queries across partitions (without Spark)
Queries must use the primary key or an index

-- Insert data
INSERT INTO orders.customers (customer_id, email, name, created_at)
VALUES (uuid(), 'alice@example.com', 'Alice Smith', toTimestamp(now()));

-- Query by partition key (efficient)
SELECT * FROM orders.customers WHERE customer_id = ?;

-- Query by non-primary key (requires secondary index - avoid in hot paths)
SELECT * FROM orders.customers WHERE email = 'alice@example.com';

-- Query by clustering key range (efficient within partition)
SELECT * FROM orders.order_items
WHERE customer_id = ? AND order_id > minTimeuuid('2024-01-01');

These restrictions exist for good reason. JOINs across distributed tables need network round-trips between nodes, which kills performance. By requiring partition-targeted queries, Cassandra routes requests directly to relevant nodes.

Tunable Consistency

Tunable consistency is Cassandra’s most powerful feature. Choose consistency per query, trading off between consistency and performance:

from cassandra.cluster import Cluster

cluster = Cluster(['192.168.1.1', '192.168.1.2'])
session = cluster.connect('orders')

# Write with quorum consistency (strongly consistent writes)
session.execute(
    "INSERT INTO customers (customer_id, email) VALUES (%s, %s)",
    [uuid(), 'alice@example.com'],
    consistency_level=ConsistencyLevel.QUORUM
)

# Read with ONE consistency (fast, possibly stale)
session.execute(
    "SELECT * FROM customers WHERE customer_id = %s",
    [customer_id],
    consistency_level=ConsistencyLevel.ONE
)

Common consistency levels:

Level	Description	Use Case
ONE	Any single replica	Maximum speed, potentially stale
TWO	Two replicas	Better freshness, still fast
THREE	Three replicas	Fresh data, higher latency
QUORUM	Majority of replicas (N/2+1)	Balanced consistency
ALL	All replicas	Strongest consistency, slowest
LOCAL_QUORUM	Quorum in local DC	Low latency for multi-DC
EACH_QUORUM	Quorum in every DC	Strong multi-DC consistency

With 3 replicas, quorum means 2 nodes must acknowledge writes. This gives strong consistency while tolerating one node down.

Secondary Indexes

Secondary indexes in Cassandra are local to each node. When you create an index on an attribute, each node indexes its local data.

-- Create secondary index on frequently queried column
CREATE INDEX IF NOT EXISTS idx_customer_email
ON orders.customers (email);

Secondary indexes work for low-cardinality attributes distributed across many partitions. They break down for high-cardinality attributes like UUIDs that create massive indexes on every node.

For high-cardinality lookups, use a separate table. Denormalization is expected in Cassandra and often beats indexed queries.

Materialized Views

Materialized views (MVs) in Cassandra automatically maintain a denormalized view of a base table. When rows in the base table change, Cassandra asynchronously updates the view. This differs from manual denormalization where your application must write to multiple tables.

-- Base table: orders by customer
CREATE TABLE orders.customers (
    customer_id UUID,
    order_id TIMEUUID,
    order_total DECIMAL,
    order_status TEXT,
    created_at TIMESTAMP,
    PRIMARY KEY (customer_id, order_id)
);

-- Materialized view: orders indexed by order_id
CREATE MATERIALIZED VIEW orders.orders_by_id AS
    SELECT order_id, customer_id, order_total, order_status, created_at
    FROM orders.customers
    WHERE order_id IS NOT NULL
    PRIMARY KEY (order_id, customer_id);

-- Materialized view: orders indexed by status
CREATE MATERIALIZED VIEW orders.orders_by_status AS
    SELECT order_status, order_id, customer_id, order_total, created_at
    FROM orders.customers
    WHERE order_status IS NOT NULL AND order_id IS NOT NULL
    PRIMARY KEY (order_status, order_id);

How materialized view updates work:

Write arrives at base table
Coordinator writes to base table SSTable
Local view building mutation is created
View update is written to view’s partition
View updates are atomic per partition but eventual across views

graph LR
    A[Write to Base Table] --> B[Coordinator]
    B --> C[Base Table SSTable]
    B --> D[View Mutation]
    D --> E[View 1 SSTable]
    D --> F[View 2 SSTable]
    D --> G[View N SSTable]

Eventual consistency trade-offs:

Aspect	Behavior
View staleness	Views may lag base table by milliseconds to seconds
View failures	Failed view updates are not retried automatically
Lost writes	If a base write fails after succeeding on some replicas, view updates may be inconsistent
Delete propagation	Deletes from base table propagate to views

from cassandra.cluster import Cluster

cluster = Cluster(['192.168.1.1', '192.168.1.2'])
session = cluster.connect('orders')

# Read from materialized view - may be slightly stale
rows = session.execute("""
    SELECT * FROM orders.orders_by_id
    WHERE order_id = %s
""", parameters=['some-order-uuid'])

# For critical reads requiring latest data, query base table instead
rows = session.execute("""
    SELECT * FROM orders.customers
    WHERE order_id = %s
""", parameters=['some-order-uuid'])

When materialized views work well:

Scenario	MV Works
Write-heavy workloads with multiple read patterns	Yes - writes maintain views automatically
Single-row updates across views	Yes - atomic per partition
Time-series with multiple access patterns	Yes - create views for different time ranges
Read-heavy workloads that can tolerate stale reads	Yes - views reduce read amplification

When materialized views break down:

Scenario	MV Problems
High contention on base row	View updates may be lost or delayed
Updates that change the view key	MV does not support updating the partition key
Very high write rates	View update backlog can grow unbounded
Strong consistency requirements	Views are eventually consistent
Complex aggregations	MVs cannot aggregate - they only denormalize

View update failure handling:

Cassandra tracks view update failures in system.views_builds_in_progress and system.built_views. You can monitor and rebuild:

# Check view build status
nodetool viewbuildstatus orders orders_by_id

# Rebuild a materialized view if it falls out of sync
nodetool rebuildview orders orders_by_id

# Check for pending view updates
SELECT * FROM system.views_builds_in_progress;
SELECT * FROM system.built_views;

Materialized view limitations:

View primary key must include all primary key columns from base table
Cannot update a column that is part of the view’s primary key
Secondary columns in base table cannot be primary key columns in view
TTL on base table does not automatically apply to view (must be set separately)

Client-Side Token-Aware Routing

Cassandra’s driver computes the token (MurmurHash) of the partition key and routes queries directly to the replica that owns that token range. This avoids hitting every node in the cluster for every query.

from cassandra.cluster import Cluster

# Token-aware routing is enabled by default in recent drivers
cluster = Cluster(
    ['192.168.1.1', '192.168.1.2', '192.168.1.3'],
    load_balancing_policy=None  # Default DCAwareRoundRobinPolicy with token-aware routing
)

session = cluster.connect('orders')

# Driver automatically computes token and routes to correct replica
# For partition key 'customer-123', driver computes MurmurHash
# Finds which node owns that token range
# Sends query directly to that node

result = session.execute(
    "SELECT * FROM orders.customers WHERE customer_id = %s",
    ['customer-123']  # Driver routes to replica owning this partition
)

Without token-aware routing:

A query to customer-123 would first hit a coordinator node, which would then query all replicas and return the fastest response. This adds an extra network hop.

With token-aware routing:

The driver knows which replica owns the partition. It sends the query directly to that node.

graph TD
    A[Client with Token-Aware Driver] -->|Direct query to replica| B[Replica 1 - owns partition]
    A -->|Coordinator hop| C[Any Node]
    C -->|Forwards query| B

How token mapping works:

Cassandra uses MurmurHash3 for token computation. The token range (0 to 2^63-1) is divided among nodes. The driver builds a token map from cluster metadata:

# Driver token map inspection
cluster = Cluster(['192.168.1.1'])
session = cluster.connect()

# Get token ranges for a partition key
token_map = cluster.metadata.token_map
partition_key = 'customer-123'
token = cluster.metadata.get_replicas('orders', [partition_key])

print(f"Token for '{partition_key}': {token}")
print(f"Replicas: {cluster.metadata.get_replicas('orders', [partition_key])}")

Token-aware routing and consistency levels:

Token-aware routing works with all consistency levels. For QUORUM, the driver sends to one replica (the primary) and then parallels digest requests to enough replicas to satisfy quorum.

# Driver behavior at QUORUM consistency with 3 replicas:
# 1. Compute token -> find primary replica (e.g., Replica 1)
# 2. Send data query to Replica 1
# 3. Send digest query to Replica 2 and Replica 3 in parallel
# 4. Wait for 2 responses (quorum)
# 5. If digests match, return result; if not, request full data repair

session.execute(
    query,
    parameters,
    consistency_level=ConsistencyLevel.QUORUM  # Driver handles parallel coordination
)

When token-aware routing matters most:

Scenario	Impact
Single-partition reads	High - eliminates coordinator hop
Cross-partition queries (ORDER BY, etc.)	Low - must fan out anyway
Batch statements	Medium - each statement routed independently
ALLOW FILTERING queries	None - requires full table scan

Token-aware routing limitations:

Only works for queries with partition key equality (e.g., WHERE customer_id = ?)
Range queries (e.g., WHERE customer_id > ?) must scan all partitions
Batch statements with multiple partition keys fan out to multiple nodes
ALLOW FILTERING defeats token awareness entirely

Driver configuration for token-aware routing:

from cassandra.cluster import Cluster, ExecutionProfile
from cassandra.policies import TokenAwarePolicy, DCAwareRoundRobinPolicy

# Explicit token-aware policy configuration
cluster = Cluster(
    ['192.168.1.1', '192.168.1.2'],
    load_balancing_policy=TokenAwarePolicy(
        DCAwareRoundRobinPolicy(local_dc='us-east-1')
    )
)

# Execution profile for different query types
fast_profile = ExecutionProfile(
    consistency_level=ConsistencyLevel.ONE,
    request_timeout=2.0
)
strong_profile = ExecutionProfile(
    consistency_level=ConsistencyLevel.QUORUM,
    request_timeout=10.0
)

session = cluster.connect('orders')
session.execute(query, execution_profile=fast_profile)  # Fast local reads

Lightweight Transactions (LWT)

Cassandra supports linearizable consistency for single-partition writes through Lightweight Transactions (LWT), implemented using a Paxos consensus protocol. The IF clause in CQL triggers LWT behavior.

-- LWT: Only insert if user@example.com does not exist
INSERT INTO users (user_id, email, created_at)
VALUES ('user-123', 'user@example.com', toTimestamp(now()))
IF NOT EXISTS;

-- LWT: Conditional update - only update if current balance >= amount
UPDATE account_balances
SET balance = balance - 50
WHERE account_id = 'acc-123'
IF balance >= 50;

-- LWT: Compare-and-set pattern
UPDATE counters
SET count = 100
WHERE page_id = 'homepage'
IF count = 99;  -- Only succeeds if count is currently 99

How LWT works internally:

LWT uses a two-phase Paxos protocol per partition:

Prepare phase: Coordinator proposes a value and asks replicas to promise to accept it
Accept phase: If a majority of replicas promise, coordinator sends the accept message
Commit phase: Replicas commit the value and coordinator responds to client

sequenceDiagram
    participant C as Coordinator
    participant R1 as Replica 1
    participant R2 as Replica 2
    participant R3 as Replica 3

    C->>R1: Prepare (p=1, v=value)
    C->>R2: Prepare (p=1, v=value)
    C->>R3: Prepare (p=1, v=value)
    R1-->>C: Promise(p=1)
    R2-->>C: Promise(p=1)
    R3-->>C: Promise(p=1)
    Note over C: Majority promised
    C->>R1: Accept (p=1, v=value)
    C->>R2: Accept (p=1, v=value)
    C->>R3: Accept (p=1, v=value)
    R1-->>C: Accepted
    R2-->>C: Accepted
    R3-->>C: Accepted
    Note over C: Majority accepted
    C->>R1: Commit
    C->>R2: Commit
    C->>R3: Commit

Latency cost:

LWT rounds add 4 network hops compared to a standard write. With 3 replicas across a datacenter:

Operation	Latency (local DC)	Latency (cross-DC)
Standard write (QUORUM)	~2-5ms	~15-30ms
LWT write (QUORUM)	~10-20ms	~50-100ms

LWT latency is higher because every operation goes through Paxos. Use LOCAL_QUORUM for same-DC operations when possible.

LWT contention and failure:

When multiple clients attempt LWT on the same partition simultaneously, only one succeeds. The others receive UNAVAILABLE or retry conflicts.

from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
from cassandra import ConsistencyLevel

cluster = Cluster(['192.168.1.1', '192.168.1.2', '192.168.1.3'])
session = cluster.connect('users')

# LWT with retry logic for contention
for attempt in range(5):
    try:
        query = SimpleStatement("""
            UPDATE user_sessions
            SET last_active = toTimestamp(now())
            WHERE user_id = 'user-123'
            IF last_active < toTimestamp(now()) - 3600
        """, consistency_level=ConsistencyLevel.LOCAL_QUORUM)

        result = session.execute(query)
        if result[0][0]:
            print("Update succeeded")
        else:
            print("Condition not met - session recently active")
        break
    except Exception as e:
        if 'UNAVAILABLE' in str(e) and attempt < 4:
            time.sleep(0.1 * (2 ** attempt))  # Exponential backoff
            continue
        raise

When to use LWT:

Use Case	LWT Appropriate
Ensure unique email on user creation	Yes - `IF NOT EXISTS`
Prevent double-spend / race conditions	Yes - `IF balance >= amount`
Distributed locks	Yes - but consider Cassandra’s built-in locks
Counter increments	No - use Cassandra counters
High-throughput writes	No - LWT is 4-10x slower
Multi-partition transactions	No - LWT is single partition only

LWT and Cassandra driver:

The Python driver handles LWT automatically but you must configure retry policies:

from cassandra.cluster import Cluster
from cassandra.policies import RetryPolicy, FallthroughRetryPolicy

cluster = Cluster(
    ['192.168.1.1', '192.168.1.2', '192.168.1.3'],
    retry_policy=FallthroughRetryPolicy()  # LWT should not auto-retry
)

Common LWT mistakes:

Using LWT on high-contention rows (multiple concurrent updates fail repeatedly)
Assuming LWT spans multiple partitions (it cannot)
Not handling UNAVAILABLE errors with appropriate backoff
Using LWT for counters (counters have their own linearizable implementation)

Write Path and Commit Log

Cassandra optimizes for write throughput. When you write:

Write appends to the commit log (durability)
Data goes to the memtable (in-memory buffer)
Client gets acknowledgment immediately
Memtable flushes to SSTable when full

The write path is append-only, which is fast on disk. B-tree databases must find the right location and overwrite; Cassandra just appends.

sequenceDiagram
    participant C as Client
    participant M as Memtable
    participant L as Commit Log
    participant S as SSTable

    C->>L: Append to Commit Log (durability)
    C->>M: Write to Memtable
    M-->>C: Write Acknowledged
    Note over M: When memtable is full
    M->>S: Flush to SSTable

SSTables (Sorted String Tables) are immutable files on disk. Once written, never modified. Compaction merges SSTables and removes deleted data.

Compaction Strategies

Compaction merges SSTables and reclaims space after deletions or updates. Cassandra offers:

SizeTiered Compaction (STCS):

Groups SSTables of similar size
Good for write-heavy workloads
Older data visible longer

TimeWindow Compaction (TWCS):

Groups SSTables by time window
Ideal for time-series
Simplifies TTL management

Leveled Compaction (LCS):

Restricts SSTable size per level
Better read performance (check at most one SSTable per level)
More writes due to compaction

-- Specify compaction strategy when creating table
CREATE TABLE metrics.sensor_data (
    sensor_id UUID,
    timestamp TIMESTAMP,
    value DOUBLE,
    PRIMARY KEY (sensor_id, timestamp)
) WITH compaction = {
    'class': 'TimeWindowCompactionStrategy',
    'window_unit': 'HOURS',
    'compaction_window_unit': 'HOURS',
    'compaction_window_size': 1
};

For time-series with TTLs, TWCS simplifies tombstone cleanup. For general workloads, TWCS or STCS depending on your read/write ratio.

Compaction Strategy Selection Decision Tree

Choosing the right compaction strategy requires understanding your workload characteristics. Here is a practical decision framework:

graph TD
    A[Start: Choose Compaction Strategy] --> B{Data Model?}
    B -->|Time-series / TTL| C[TWCS - TimeWindow Compaction]
    B -->|Wide partitions / Heavy writes| D[STCS - SizeTiered Compaction]
    B -->|General workload / Balanced| E{Latency Priority?}
    E -->|Optimize reads| F[LCS - Leveled Compaction]
    E -->|Optimize writes| D
    E -->|Mixed workload| G{Partition Size?}
    G -->|Wide partitions >100MB| D
    G -->|Narrow partitions| H[TWCS or LCS]

    C --> I[Best for: IoT, metrics, event logs]
    D --> J[Best for: Write-heavy, archival]
    F --> K[Best for: Read-heavy, time-series with range queries]
    H --> L[Consider: LCS if reads matter, TWCS if time-bucketed]

Decision matrix by workload type:

Workload	Primary Strategy	Alternative	Why
Time-series with TTL	TWCS	-	TTL expiration aligns with window compaction
Write-heavy (logs, events)	STCS	TWCS (if time-bucketed)	Append-heavy, infrequent reads
Read-heavy (user profiles)	LCS	TWCS	Frequent reads benefit from leveled layout
Mixed (balanced)	TWCS	LCS	Time-bucketed queries work well with TWCS
Heavy deletes	TWCS	STCS	Tombstones expire with time windows
Wide partitions	STCS	-	LCS struggles with partitions > 10MB
Counter tables	STCS	-	Counters have many updates, benefit from STCS

Compaction strategy characteristics:

Strategy	SSTable Levels	Write Amplification	Read Amplification	Space Amplification	Best For
STCS	N (unbounded)	1x (low)	High (scans similar-sized SSTables)	High (may have duplicates)	Write-heavy
LCS	L0 + L1-L7 (max 7)	20-30% more	Low (1 SSTable per level)	10-20% (none by design)	Read-heavy
TWCS	N per window	Low	Low	Low (timestamps co-located)	Time-series

How to switch compaction strategies:

# WARNING: Changing compaction strategy requires nodetool upgradesstables
# Never run without understanding the implications

# 1. Disable auto-compaction for the table
nodetool disableautocompaction orders.customers

# 2. Run major compaction to flush current data
nodetool compact orders.customers

# 3. Change compaction strategy
nodetool setcompactionstrategy orders.customers TimeWindowCompactionStrategy

# 4. Configure strategy options
nodetool setcompactionstrategy_options orders.customers "window_unit=HOURS,compaction_window_size=1"

# 5. Re-enable auto-compaction
nodetool enableautocompaction orders.customers

# 6. Run upgradesstables to rewrite data with new strategy
nodetool upgradesstables orders.customers

# 7. Monitor compaction during transition
nodetool compactionstats

Compaction strategy and Cassandra version:

Strategy	Cassandra 2.1	Cassandra 3.0+	Notes
STCS	Yes	Yes	Default in 2.1
LCS	Yes	Yes	Improved significantly in 3.0+
TWCS	No	Yes	Added in 3.0.8
DTCS	Yes (deprecated)	No	Deprecated in 3.0 - use TWCS

Common compaction mistakes:

Using LCS for wide partitions: LCS assumes partitions fit in a single SSTable level. Wide partitions cause compaction failures and read timeouts.
Using TWCS without time-based access: TWCS groups data by timestamp. If queries span multiple time windows, performance degrades.
Changing strategies without running upgradesstables: Old SSTables remain in the old format until explicitly upgraded.
Setting TWCS windows too small: Windows < 1 hour cause excessive SSTable fragmentation. Windows > 1 day defeat the tombstone purging purpose.
Ignoring compaction during strategy change: The table is vulnerable to tombstone accumulation during the transition period.

Tombstone Handling

When you delete data in Cassandra, the deletion is not immediate. Cassandra writes a tombstone - a marker indicating the data is deleted - and the actual removal happens during compaction.

Why tombstones exist:

Cassandra’s append-only SSTable design means data cannot be overwritten in place. Deletions are writes that mark data as dead. The tombstone persists until compaction removes both the tombstone and the dead data it marks.

-- This writes a tombstone, not immediate deletion
DELETE FROM orders.order_items
WHERE order_id = 'order-123';

Tombstone lifecycle:

Delete request arrives at coordinator
Coordinator writes tombstone to SSTable with the deletion timestamp
Tombstone is visible immediately - queries exclude the deleted data
During compaction, tombstones older than gc_grace_seconds are removed along with the data they mark
If compaction has not run before gc_grace_seconds, the tombstone disappears but deleted data can “resurrect”

gc_grace_seconds and resurrection risk:

# Table with gc_grace_seconds = 10 days (default)
# This setting is critical for managing tombstones
CREATE TABLE orders.order_items (
order_id UUID,
item_id UUID,
quantity INT,
PRIMARY KEY (order_id, item_id)
) WITH gc_grace_seconds = 864000;  -- 10 days in seconds

The gc_grace_seconds setting controls how long tombstones must survive before compaction can delete them. This window allows anti-entropy repair to propagate deletions to all replicas. If you run compaction before gc_grace_seconds elapses and a replica missed the original deletion, the deleted data can resurrect on that replica.

Scenario	gc_grace_seconds	Risk
Single datacenter, frequent repairs	86400 (1 day)	Low - repairs catch missed deletions
Multi-DC, infrequent repairs	432000 (5 days)	Medium
Very infrequent repairs	864000 (10 days, default)	Lower - larger repair window
No repairs ever	Infinity	Tombstones never safely removed

Tombstone impact on read performance:

Queries must scan through tombstones to find live data. A partition with many tombstones can cause read timeouts even when the query returns few results.

-- This query scans all tombstones in the partition
-- May timeout on partitions with 100k+ tombstones
SELECT * FROM orders.order_items
WHERE order_id = 'big-order-with-many-items';

Diagnosing tombstone problems:

# Check tombstone count per partition
nodetool tablestats orders.order_items -H

# Look for "Number of tombstones" in output
# Higher numbers indicate potential read pressure

# Enable trace for queries to see tombstone scanning
TRACING ON;
SELECT * FROM orders.order_items WHERE order_id = 'test';
TRACING OFF;

# Check compaction history
nodetool compactionhistory | grep order_items

Mitigation strategies:

Time-window bucketing: Split time-series into daily or weekly partitions. Old partitions with tombstones are not queried.

CREATE TABLE metrics.sensor_data (
    sensor_id UUID,
    date TEXT,  -- '2024-01-15' bucket
    timestamp TIMESTAMP,
    value DOUBLE,
    PRIMARY KEY ((sensor_id, date), timestamp)
);

Prevent wide rows with mixed active/deleted data: Separate active entities from archived ones.
Use TTL instead of explicit deletes: TTL tombstones are automatically handled by TWCS.
Reduce gc_grace_seconds with frequent repairs: If you run repair daily, you can safely reduce gc_grace_seconds.
Monitor partition statistics: Identify partitions with excessive tombstones before they cause outages.

# Python script to check partition tombstone density
from cassandra.cluster import Cluster

cluster = Cluster(['192.168.1.1'])
session = cluster.connect('orders')

# Get partition statistics
rows = session.execute("""
    SELECT key, todckeys, tombstones, cells
    FROM system.size_estimates
    WHERE keyspace_name = 'orders'
    AND table_name = 'order_items'
""")

for row in rows:
    if row.tombstones > 10000:
        print(f"Partition {row.key} has {row.tombstones} tombstones - consider archival")

Anti-Entropy Repair

Cassandra uses anti-entropy repair to synchronize data between replicas and ensure consistency. The nodetool repair command triggers this process, which compares Merkle trees between replicas to find and fix discrepancies.

How Merkle trees work in Cassandra:

Each replica builds a Merkle tree from its SSTable data
Trees are hash-based binary trees where leaf nodes contain hash values of data ranges
Replicas exchange tree roots to detect differences
When roots differ, nodes traverse to find the exact range that diverged
Only the differing data ranges are exchanged and repaired

graph TD
    A[Replica 1 SSTable] --> B[Merkle Tree Root Hash]
    C[Replica 2 SSTable] --> D[Merkle Tree Root Hash]
    B --> E{Compare Roots}
    D --> E
    E -->|Match| F[No repair needed]
    E -->|Mismatch| G[Traverse tree to find differing range]
    G --> H[Exchange only differing data]

Operational considerations:

Running repair: nodetool repair -pr runs repair on primary ranges only
Frequency: Run weekly at minimum; daily for critical data
Resource cost: Repair is expensive - it reads entire SSTables and generates network traffic
Incremental repair: Use -inc for partial repairs that spread load across runs
Size estimates: nodetool compactionstats shows pending tasks before running repair

Common repair issues:

Problem	Symptom	Mitigation
Repair running too long	Cluster slowdown	Use incremental repair
Merkle tree memory	OOM on large partitions	Subdivide large partitions
Overlap with writes	Inconsistency during repair	Repair during low-write windows
Node down during repair	Incomplete sync	Re-run repair after node recovers

Anti-entropy vs Read Repair:

Anti-entropy repair is proactive (runs periodically), while read repair is reactive (happens during reads). Read repair sends digest requests to replicas on every read and fixes inconsistencies discovered. Anti-entropy catches issues that read-heavy workloads might miss.

Read Repair: The Passive Consistency Path on Every Read

Read repair is Cassandra’s mechanism for passively repairing inconsistencies during normal read operations. It happens automatically without requiring separate repair jobs.

How read repair works:

When you read with a consistency level greater than ONE, Cassandra checks consistency across replicas and repairs discrepancies as part of the read path.

sequenceDiagram
    participant C as Coordinator
    participant R1 as Replica 1
    participant R2 as Replica 2
    participant R3 as Replica 3

    C->>R1: Send data query
    C->>R2: Send data query
    C->>R3: Send data query
    R1-->>C: Data (digest: abc123)
    R2-->>C: Data (digest: abc123)
    R3-->>C: Data (digest: xyz789)  <- Digest mismatch!

    Note over C: Digests differ - R3 has stale data
    C->>R3: Send full data repair
    R3-->>C: Latest data
    C->>R1: Confirm sync (optional digest)
    C->>R2: Confirm sync (optional digest)
    Note over C: Return latest to client

Read repair with QUORUM (3 replicas):

from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
from cassandra import ConsistencyLevel

cluster = Cluster(['192.168.1.1', '192.168.1.2', '192.168.1.3'])
session = cluster.connect('orders')

# Read with QUORUM consistency
# Coordinator queries all 3 replicas
# Waits for 2 responses (quorum)
# If digests differ, repair happens as part of read
query = SimpleStatement("""
    SELECT * FROM orders.customers WHERE customer_id = %s
""", consistency_level=ConsistencyLevel.QUORUM)

result = session.execute(query, ['customer-123'])
# During this read:
# 1. Coordinator sends to all 3 replicas
# 2. R1 and R2 return data with digest
# 3. Coordinator compares digests
# 4. If R3 digest differs, coordinator fetches latest from R1/R2 and sends to R3
# 5. Client gets consistent data

Read repair probability:

Consistency Level	Read Repair Chance
ONE	0% - no digest check
LOCAL_ONE	0% - no digest check
QUORUM	~33% (1 replica may differ with RF=3)
ALL	100% - all replicas checked
EACH_QUORUM	100% - all DCs checked

With RF=3 and QUORUM, the chance that a stale replica participates in the read is 100% (since 2 of 3 replicas respond), but the probability that exactly one replica has stale data and it responds to the read is lower.

Read repair at ONE consistency:

Even at ONE consistency, you can enable background read repair:

# cassandra.yaml
# When read_repair_chance is 0.1 (default), 10% of ONE reads
# trigger background repair to all replicas
read_repair_chance: 0.1
read_repair_page_size: 1000

-- Table-specific read repair chance
CREATE TABLE orders.customers (
    customer_id UUID,
    email TEXT,
    name TEXT,
    PRIMARY KEY (customer_id)
) WITH read_repair_chance = 0.1
  AND dclocal_read_repair_chance = 0.05;

dclocal_read_repair_chance vs read_repair_chance:

dclocal_read_repair_chance: Probability of read repair for the local DC only (runs faster, does not wait for cross-DC)
read_repair_chance: Probability of read repair across all DCs (waits for all replicas)

Setting	Scope	Latency Impact	Use Case
dclocal_read_repair_chance	Local DC only	Lower - no cross-DC coordination	Multi-DC with local-first reads
read_repair_chance	All DCs	Higher - cross-DC coordination	Critical data requiring global consistency

Read repair and write latency interaction:

Read repair happens asynchronously after the coordinator receives enough responses to satisfy the consistency level. This means:

# QUORUM read timeline:
# T+0ms: Coordinator sends to all replicas
# T+2ms: QUORUM (2 replicas) respond - coordinator returns to client
# T+2ms to T+10ms: Background read repair to stale replica (if any)

The client sees low latency (just waiting for quorum), but the repair happens in the background.

Monitoring read repair:

# Check read repair activity
nodetool proxyhistograms | grep ReadRepair

# Monitor read repair timing
nodetool compactionstats

# Enable debug logging for read repair
nodetool setlogginglevels org.apache.cassandra.service.read_repair=DEBUG

# JMX monitoring for read repair metrics
from cassandra.cluster import Cluster

cluster = Cluster(['192.168.1.1'])
session = cluster.connect()

# Query read repair statistics
rows = session.execute("""
    SELECT keyspace_name, table_name,
           sum(read_repairs_attempted) as repairs_attempted,
           sum(read_repairs_blocked) as repairs_blocked
    FROM system.size_estimates
    GROUP BY keyspace_name, table_name
""")

for row in rows:
    print(f"{row.keyspace_name}.{row.table_name}: "
          f"{row.repairs_attempted} attempted, "
          f"{row.repairs_blocked} blocked")

When read repair is insufficient:

Read repair only fixes inconsistencies on partitions that are actively read. Partitions that are never read accumulate inconsistencies until anti-entropy repair runs. This is why anti-entropy repair (nodetool repair) is essential for long-term consistency, especially for rarely-accessed data.

Scenario	Read Repair	Anti-Entropy Repair
Frequently accessed partitions	Keeps them consistent	Rarely needed
Rarely accessed partitions	Does not run	Required periodically
Large partitions	Can cause latency spikes	Better - runs in background
Cross-DC consistency	dclocal only doesn’t cover	Full cluster repair

Data Center Awareness

NetworkTopologyStrategy defines replication rules that respect data center boundaries:

Lower latency by reading from local replicas
Disaster resilience via cross-DC replication
Workload isolation (analytics vs. transactional)

CREATE KEYSPACE orders WITH REPLICATION = {
    'class': 'NetworkTopologyStrategy',
    'us-east-1': 3,      -- 3 replicas in primary DC
    'us-west-2': 3,      -- 3 replicas in backup DC
    'eu-west-1': 1        -- 1 replica in EU for GDPR
};

For cross-DC replication, LOCAL_ONE, LOCAL_QUORUM, and EACH_QUORUM let you control whether remote DC waits affect latency.

Multi-DC Consistency Pitfalls

Cassandra’s tunable consistency works differently across datacenters, and EACH_QUORUM has surprising behavior during DC failures that catches many users.

EACH_QUORUM behavior during DC failure:

EACH_QUORUM requires a quorum in every datacenter before responding to the client. If one DC is down, writes fail entirely even if all other DCs are available.

from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
from cassandra import ConsistencyLevel

cluster = Cluster(['192.168.1.1', '192.168.1.2'])  # us-east-1
session = cluster.connect('orders')

# EACH_QUORUM write - waits for quorum in EACH datacenter
# If us-west-2 DC is down, this write FAILS even if us-east-1 quorum succeeds
query = SimpleStatement("""
    INSERT INTO orders.customers (customer_id, email, name)
    VALUES (%s, %s, %s)
""", consistency_level=ConsistencyLevel.EACH_QUORUM)

try:
    session.execute(query, ['cust-123', 'a@b.com', 'Alice'])
except Exception as e:
    print(f"EACH_QUORUM failed: {e}")
    # If us-west-2 is down, write fails even though us-east-1 succeeded

The EACH_QUORUM failure scenario:

With replication factor 3 in each of 2 DCs (6 total replicas):

DC	Replicas	Quorum Needed	Status
us-east-1	3	2	Available
us-west-2	3	2	DOWN

EACH_QUORUM write: Requires 2 from us-east-1 AND 2 from us-west-2. FAILS because us-west-2 is unavailable.
LOCAL_QUORUM write: Requires 2 from us-east-1 only. SUCCEEDS.
QUORUM write: Requires 4 total across both DCs. FAILS because us-west-2 replicas cannot respond.

graph TD
    A[Write with EACH_QUORUM] --> B{DC us-west-2 available?}
    B -->|No| C[Write FAILS - cannot satisfy EACH_QUORUM]
    B -->|Yes| D[Write succeeds if both DCs have quorum]
    C --> E[Even though us-east-1 has quorum]
    E --> F[us-east-1: 3/3 replicas acknowledge]
    F --> G[But requirement: quorum in EVERY DC]
    G --> C

Why teams pick EACH_QUORUM incorrectly:

EACH_QUORUM seems appealing for “strong consistency across DCs” but most applications do not need it. LOCAL_QUORUM in each DC gives you local linearizability with cross-DC eventual consistency, which matches how most applications work.

Consistency Level	Use When
EACH_QUORUM	Financial transactions requiring all DCs to acknowledge; accepting that any DC failure causes unavailability
LOCAL_QUORUM	Most multi-DC deployments; strong locally, eventual across DCs
LOCAL_ONE	Non-critical reads; can tolerate stale local data

Read consistency cross-DC:

Reads also behave unexpectedly with EACH_QUORUM:

# EACH_QUORUM read - returns data if ALL DCs have quorum available
# If any DC is down, returns unavailable
query = SimpleStatement("""
    SELECT * FROM orders.customers WHERE customer_id = %s
""", consistency_level=ConsistencyLevel.EACH_QUORUM)

For reads, EACH_QUORUM returns the minimum timestamp across DCs (the most stale data), which is rarely what you want.

DC failure during write:

When a DC goes down mid-write with EACH_QUORUM:

Coordinator receives write
Successfully writes to available DCs (us-east-1: 3/3)
Fails to write to down DC (us-west-2: 0/3)
Returns error to client
Down DC has no record of the write

When the DC comes back up, it replays hints (if hints are enabled) but may miss writes that completed during the outage window.

Mitigation strategies:

Use LOCAL_QUORUM instead of EACH_QUORUM unless you have a specific requirement for all-DC acknowledgment.
Configure hints replay for DC failure recovery:

# cassandra.yaml
hints_directory: /var/lib/cassandra/hints
max_hints_delivery_seconds: 10800 # 3 hours
hints_flush_period_seconds: 300

Monitor DC availability:

from cassandra.cluster import Cluster

cluster = Cluster(['192.168.1.1'])
session = cluster.connect()

# Check peer DC status
rows = session.execute("""
    SELECT peer, data_center, rack, host_id, rpc_address
    FROM system.peers
    WHERE data_center = 'us-west-2'
""")

for row in rows:
    print(f"DC us-west-2 node: {row.rpc_address}, status: {'up' if row.peer else 'down'}")

Set write acknowledgments appropriately:

-- For global acknowledgment requirement (financial/regulatory):
CREATE KEYSPACE orders WITH REPLICATION = {
    'class': 'NetworkTopologyStrategy',
    'us-east-1': 3,
    'us-west-2': 3
} AND durability = 'sync';

-- For local-first writes (most applications):
-- Just use LOCAL_QUORUM for writes

Use Cases Where Cassandra Excels

Cassandra fits when:

High write throughput: Logging, IoT sensor data, event streams
Time-series data: Metrics, sensor readings, user activity
Geo-distributed data: Multi-region with local reads
Always-available writes: Dynamo-style “your write will never be rejected”

Cassandra struggles when:

Read-heavy with complex queries: PostgreSQL or Elasticsearch
ACID transactions across entities: Cassandra’s per-partition transactions cannot span entities
Small datasets: Overhead only makes sense at scale
Strong consistency: QUORUM or ALL work, but expect latency

Production Failure Scenarios

Failure	Impact	Mitigation
Node failure during write	Write succeeds if replication factor met; failed node goes into hinted handoff recovery	Monitor node health; use `nodetool repair` regularly; hinted handoff queues have a 3-hour window
Tombstone accumulation	Deleted data leaves tombstones; queries must scan past them; read latency spikes	Configure `gc_grace_seconds` appropriately; run compaction regularly; avoid delete-heavy workloads without compaction strategy
Partition size exceeds limit	Wide rows with millions of cells cause read timeouts, memory pressure	Monitor partition size; split large partitions; use bucketing to limit partition width
Repair-induced latency spike	`nodetool repair` is I/O intensive; can cause read latency spikes on affected nodes	Run repairs during low-traffic windows; use incremental repair; consider Cassandra reaper for scheduling
Consistency level misconfiguration	ONE reads may return stale data from failed replica	Use QUORUM for important reads; LOCAL_ONE for low-latency acceptable-staleness reads; understand each level’s semantics
Batch statement misuse	Logged batch is fine; unlogged batch across multiple partitions is an anti-pattern	Use logged batches only for ordered updates to the same partition; treat batch as a last resort

Common Pitfalls / Anti-Patterns

When to Use Cassandra

Cassandra suits a particular workload profile: high-volume writes, horizontal scale, and geographic distribution. Match this profile and it works well; mismatch it and you get unnecessary complexity for little gain. It is not a general-purpose database.

Cassandra handles three scenarios well. First, write-heavy throughput — IoT sensor networks generating millions of events per second, application logging pipelines, and clickstream aggregation fit this category. Cassandra’s append-only write path turns every write into a sequential disk append, skipping B-tree page splits or in-place overwrites. Write latency stays consistent regardless of dataset size or volume, and throughput scales linearly as you add nodes. If your workload leans heavily on reads with complex ad-hoc queries, look elsewhere. Second, geo-distributed active-active deployment — when you need to write to any region and read from the closest replica with tunable consistency, Cassandra’s peer-to-peer model and NetworkTopologyStrategy handle multi-region setups. Discord stores message data across multiple regions this way, with local reads for latency-sensitive paths. Third, always-available writes — Cassandra’s Dynamo-inspired “your write will never be rejected” semantics means the database does not fail writes due to node failures, network partitions, or capacity limits. Writes succeed as long as enough replicas are available, which matters for IoT platforms and event systems where data loss is unacceptable.

These use cases share a common thread. Cassandra is built for scenarios where writes dominate, data is distributed by design, and availability matters more than strong consistency. If your workload fits this profile, Cassandra’s architecture pays off. If your workload relies on query-heavy operations with complex joins, needs ACID transactions across entities, or involves relatively small datasets, Cassandra’s operational complexity is hard to justify.

Here is what those three scenarios look like in practice:

Scenario	Cassandra Strength	Concrete Example
Write-heavy throughput	Append-only write path; no in-place overwrites	IoT platform ingesting 10M events/second across 50 nodes; write latency stays under 5ms at QUORUM
Geo-distributed active-active	Peer-to-peer; any node can accept writes; LOCAL_QUORUM for low-latency local reads	Discord stores messages in us-east-1 and eu-west-1; users read from nearest DC with LOCAL_ONE
Always-available writes	Dynamo-style “write never rejected” while quorum met	Netflix stores viewing history; writes succeed even during AZ outages

That contrast sets up what Cassandra cannot do well. If your workload is dominated by reads with ad-hoc queries, a B-tree database like PostgreSQL handles those efficiently. If you need ACID transactions across multiple entities, Cassandra’s per-partition model requires saga patterns. If strong consistency is non-negotiable and latency trade-offs are unacceptable, QUORUM or ALL consistency carries 2-10x latency premium over ONE.

When Not to Use Cassandra

That said, Cassandra is not the right tool for every job. Read-heavy workloads with complex queries do not play to its strengths — it is not a general-purpose database, and complex secondary-index queries are slow and do not scale. Per-partition transactions only means cross-entity operations require saga patterns or similar workarounds. If strong consistency is a hard requirement, QUORUM reads and writes give you linearizability but with latency trade-offs you may not want. And for small datasets, the overhead of Cassandra’s architecture only pays off once you reach meaningful scale.

But Cassandra has real limitations. Here is where teams get into trouble:

Complex relational queries. If your workload involves multi-entity joins, aggregations across tables, or ad-hoc queries that cannot be answered from a single partition, Cassandra fights you at every turn. There are no JOINs, no subqueries in most contexts, and secondary indexes are local to each node, making high-cardinality lookups expensive. PostgreSQL, MongoDB, or Elasticsearch handle these patterns far better.

ACID transactions across entities. Cassandra guarantees linearizability only within a single partition. If you need to update two tables atomically — for example, transferring funds between accounts where both must succeed or fail together — Cassandra cannot do it without application-level saga orchestration. Each step of the saga is a separate write with its own consistency level. If any step fails, compensating transactions roll back the others. This is operationally complex compared to a single ACID transaction.

Strong consistency with low latency. QUORUM consistency gives you strong consistency, but at 2-5ms per operation in a local DC (vs. ~0.5ms for ONE). For some workloads this trade-off is worth it. For others it is not. If your application requires both strong consistency and sub-millisecond latency, Cassandra is the wrong choice regardless of your replication factor. ALL gives you the strongest guarantees but doubles latency again and makes the cluster unavailable if even one replica is down.

Small datasets. Cassandra’s architecture assumes you are distributing data across multiple nodes. If you have less than a few hundred gigabytes of data and do not need geo-distribution, the operational complexity is hard to justify. A single PostgreSQL instance handles millions of reads and tens of thousands of writes per second. The bar for Cassandra payoff is higher than many teams expect.

Anti-Pattern	Why Cassandra Struggles	Alternative
Ad-hoc analytical queries	No cross-partition aggregation; fan-out queries are slow	ClickHouse, BigQuery, Athena
Multi-entity ACID transactions	Per-partition only; saga patterns add complexity	PostgreSQL, CockroachDB
Sub-millisecond latency with strong consistency	QUORUM minimum 2-5ms; ALL 5-10ms	Redis, PostgreSQL (with caveats)
Small datasets (<100GB)	Architecture overhead not justified	PostgreSQL, SQLite
Frequent small reads with complex filters	Secondary indexes are node-local; high-cardinality filters scan every node	Elasticsearch, PostgreSQL

The common thread across all these anti-patterns is a mismatch between Cassandra’s design assumptions and the workload. Cassandra assumes writes dominate, queries target single partitions, and data lives on many nodes. When those assumptions hold, the architecture pays off. When they do not, you spend operational budget fighting the database instead of building features.

Quick Recap Checklist

Core concepts to keep in mind when working with Cassandra:

Trade-off Comparison

Key Cassandra design decisions compared to mainstream database alternatives:

Dimension	Cassandra Approach	Alternative Approach	Trade-off
Consistency model	Tunable per query	Fixed consistency (e.g., PostgreSQL)	Flexibility vs. cognitive load; must understand each level
Data model	Denormalized, query-first	Normalized, relationship-first	Optimized for read patterns vs. storage efficiency
Write path	Append-only (commit log + memtable + SSTable)	In-place B-tree writes	Writes are faster; reads may touch multiple SSTables
Consensus	None by default; Paxos for LWT	Single-master with failover	Higher write availability vs. stronger consistency
Query patterns	Partition key equality required	Any column indexed	Direct routing vs. flexible querying
Compaction	Background SSTable merging	No compaction	Reclaims space; causes background I/O
TTL/tombstones	Soft deletion via tombstones	Hard delete	Cassandra’s design consequence; requires gc_grace management
Multi-DC	Active-active by default	Primary/DR	Lower latency for local reads vs. complexity of dual-writes
Transactions	Per-partition only	Multi-entity ACID	Scalability vs. transactional integrity across entities

Interview Questions

1. Explain Cassandra's peer-to-peer architecture and why it eliminates single points of failure.

Expected answer points:

Every node in a Cassandra cluster is equal — there is no master node or coordinator bottleneck
Data distributes via consistent hashing across all nodes
Any node can accept any request and route it appropriately
The gossip protocol keeps all nodes aware of cluster state without central coordination
If any single node fails, the cluster continues operating since no node is special

2. How does Cassandra's write path work, and why is it optimized for write throughput?

Expected answer points:

Write arrives at the coordinator, which appends to the commit log for durability
Data is written to the memtable (in-memory buffer) and client receives acknowledgment immediately
When memtable is full, it is flushed to an immutable SSTable on disk
Append-only design is fast because disk heads do not need to seek; unlike B-tree databases, there is no in-place overwrite
Compaction later merges SSTables and removes deleted data (tombstones)

3. What are the main compaction strategies in Cassandra, and when would you use each one?

Expected answer points:

SizeTiered Compaction (STCS) — Groups SSTables of similar size; best for write-heavy workloads with infrequent reads; high read amplification
TimeWindow Compaction (TWCS) — Groups SSTables by time window; ideal for time-series data with TTLs; simplifies tombstone cleanup
Leveled Compaction (LCS) — Restricts SSTable size per level (L0 + L1-L7); best for read-heavy workloads; 10-20% space overhead; 20-30% more writes
Choose TWCS for IoT/metrics, STCS for archival write-heavy, LCS for balanced read-heavy workloads

4. Explain tunable consistency in Cassandra. What consistency levels are available and when would you use each?

Expected answer points:

Consistency level is configurable per query, trading off durability vs. latency
ONE — Any single replica; fastest, potentially stale data
QUORUM — Majority of replicas (N/2+1); balanced strong consistency
ALL — All replicas must acknowledge; strongest consistency, highest latency
LOCAL_QUORUM — Quorum in local DC only; low-latency for multi-DC deployments
EACH_QUORUM — Quorum in every DC; fails if any DC is down; rarely the right choice

5. What is the partition key in Cassandra, and how does it affect data distribution and query performance?

Expected answer points:

Partition key determines which node stores a row via MurmurHash3 token computation
All rows with the same partition key reside on the same nodes (replicas)
Clustering key controls sort order within a partition
Queries must include the partition key to be efficient; cross-partition queries require full cluster scans
Poor partition key choice leads to hot spots or read amplification

6. What are Lightweight Transactions (LWT) in Cassandra, and what is their performance cost?

Expected answer points:

LWT provides linearizable consistency for single-partition writes using a two-phase Paxos protocol
Triggered by the IF clause in CQL (e.g., IF NOT EXISTS, IF balance >= amount)
Latency cost is ~4 additional network hops compared to standard writes
Standard write with QUORUM: ~2-5ms local; LWT with QUORUM: ~10-20ms local
High contention causes most concurrent LWT attempts to fail; use exponential backoff retry logic
LWT cannot span multiple partitions; do not use for high-throughput paths

7. How does token-aware routing work, and what problem does it solve?

Expected answer points:

The Cassandra driver computes the MurmurHash3 token of the partition key locally
The driver looks up which replica owns that token range from its metadata
Query is sent directly to the owning replica, skipping the coordinator hop
Without token-aware routing, the coordinator would query all replicas and return the fastest response
Only works for queries with partition key equality; range queries must fan out to all nodes

8. What is the EACH_QUORUM consistency level pitfall in multi-DC deployments?

Expected answer points:

EACH_QUORUM requires a quorum in EVERY datacenter before responding to the client
If any single DC is unavailable, writes fail entirely even if other DCs are fully available
This surprises many teams who expect "strong consistency across DCs"
Example: 2 DCs with RF=3 each; if us-west-2 is down, EACH_QUORUM fails even though us-east-1 has quorum
Most multi-DC applications should use LOCAL_QUORUM instead, which gives linearizability within each DC

9. What are tombstones in Cassandra, and how do they interact with gc_grace_seconds?

Expected answer points:

Deleting data in Cassandra writes a tombstone marker, not immediate removal — because SSTables are immutable
Tombstones persist until compaction removes both the tombstone and the dead data it marks
gc_grace_seconds controls how long tombstones must survive before compaction can delete them (default: 10 days)
This window allows anti-entropy repair to propagate deletions to all replicas
If compaction runs before gc_grace_seconds elapses and a replica missed the original deletion, deleted data can resurrect
Partitions with many tombstones cause read timeouts because queries must scan past tombstone markers

10. How do materialized views differ from secondary indexes in Cassandra?

Expected answer points:

Secondary indexes are local to each node — each node indexes only its own data; good for low-cardinality attributes spread across many partitions
Materialized views maintain a full denormalized copy of the base table with a different primary key; Cassandra handles updates automatically
MVs are globally distributed like base tables; queries can target any replica
MV updates are asynchronous and eventual — views may be stale for milliseconds to seconds
MVs cannot update the partition key; high contention on base rows can cause MV update backlogs

11. What is anti-entropy repair, and how does it differ from read repair?

Expected answer points:

Anti-entropy repair (nodetool repair) proactively synchronizes data between replicas using Merkle trees; runs periodically as a batch job
Read repair is reactive — happens automatically during reads when digest mismatches are detected
Anti-entropy catches inconsistencies on partitions that are never or rarely read; read repair only fixes active partitions
Anti-entropy is expensive — it reads entire SSTables and generates significant network traffic
Read repair probability at QUORUM with RF=3 is roughly 33% per read; at ALL it is 100%
Both are needed: read repair for active data, anti-entropy for long-term consistency of all data

12. What are the main use cases where Cassandra excels, and what are its key limitations?

Expected answer points:

Excels at: High write throughput (IoT sensors, log aggregation), time-series data, geo-distributed active-active deployments, always-available writes
Struggles with: Read-heavy complex queries (use PostgreSQL or Elasticsearch), ACID transactions across entities (per-partition only), small datasets (overhead not justified), strong consistency requirements (QUORUM/ALL have latency costs)
Denormalization is expected and necessary — Cassandra has no JOINs
Design tables around query patterns, not data relationships

13. How does Cassandra's NetworkTopologyStrategy work for multi-DC deployments?

Expected answer points:

NetworkTopologyStrategy defines replication rules that respect data center boundaries
You specify a replication factor per DC (e.g., 'us-east-1': 3, 'us-west-2': 3)
Reads can be directed to local DC replicas using LOCAL_ONE or LOCAL_QUORUM, reducing cross-DC latency
Workload isolation is possible — analytics in one DC, transactional in another
Disaster resilience: loss of one DC does not bring down the entire cluster

14. What is the purpose of the commit log in Cassandra's write path?

Expected answer points:

The commit log is an append-only file that provides durability for writes
Before the client receives acknowledgment, the write is persisted to the commit log
If the node crashes before the memtable is flushed to SSTable, the commit log can replay writes on recovery
This is the safety net that allows Cassandra to acknowledge writes before data is on disk as SSTables
Commit logs can be cleared after the corresponding memtable is flushed to SSTable

15. What are the trade-offs between using batch statements vs. individual writes in Cassandra?

Expected answer points:

Logged batches group multiple writes to the same partition into a single internal operation — efficient for storing related data together
Unlogged batches that span multiple partitions are an anti-pattern — they fan out to many nodes and add latency without atomicity benefits
Individual writes are usually the right choice for cross-partition data; use logged batches only for same-partition ordered updates
Batch statements do not provide multi-partition transactions — they are purely a performance optimization

16. What is the difference between read_repair_chance and dclocal_read_repair_chance?

Expected answer points:

read_repair_chance — probability that a read triggers repair across ALL datacenters; waits for cross-DC coordination
dclocal_read_repair_chance — probability of read repair for the local DC only; faster, no cross-DC wait
dclocal is appropriate for multi-DC deployments where local reads are acceptable; read_repair is for critical data requiring global consistency
Default values: dclocal_read_repair_chance = 0.05, read_repair_chance = 0.1

17. How would you diagnose and mitigate tombstone-related read latency spikes in Cassandra?

Expected answer points:

Use nodetool tablestats to check tombstone counts per partition; partitions with 10k+ tombstones indicate problems
Enable TRACING ON to see tombstone scanning during queries
Mitigation strategies: use time-window bucketing (separate daily/weekly partitions), reduce gc_grace_seconds if running daily repairs, use TTL instead of explicit deletes, monitor partition statistics proactively
For time-series: split into date buckets so old partitions with tombstones are not queried

18. What is hinted handoff in Cassandra and when does it help vs. when can it cause issues?

Expected answer points:

Hinted handoff stores write intent on a coordinator when a replica is temporarily down
When the replica recovers, the coordinator replays the hints to bring it in sync
Helps during brief node failures (minutes to hours); writes succeed as if the node was up
max_hints_delivery_seconds (default 3 hours) limits the window; hints older than this are discarded
If a node is down longer than the hints window, anti-entropy repair is needed to restore consistency
Hint accumulation on coordinators during extended outages can cause storage pressure

19. How does TWCS (TimeWindow CompactionStrategy) simplify TTL management compared to other strategies?

Expected answer points:

TWCS groups SSTables by time window (e.g., hourly or daily)
When a TTL expires, the entire time window SSTable contains only expired data and compaction removes it entirely
With STCS or LCS, tombstones must survive gc_grace_seconds across all SSTables before compaction can remove expired data
TWCS is ideal for time-series workloads where data is queried within a recent window and older data is either archived or deleted
Setting window size: too small (<1 hour) causes SSTable fragmentation; too large (>1 day) defeats tombstone purging purpose

20. What are the main operational commands a Cassandra DBA should know for monitoring and maintenance?

Expected answer points:

nodetool status — Check cluster health and node status (UUN/NL)
nodetool compactionstats — Monitor pending compaction tasks and disk usage
nodetool tablestats — Inspect partition statistics including tombstone counts
nodetool repair -pr — Run anti-entropy repair on primary ranges (run weekly minimum)
nodetool viewbuildstatus — Check materialized view build status
nodetool proxyhistograms — Analyze read/write latency percentiles
nodetool setlogginglevels — Enable debug logging for troubleshooting

Conclusion

Cassandra is purpose-built for write-heavy, globally distributed workloads where availability matters more than strong consistency. Its peer-to-peer architecture removes single points of failure, and tunable consistency lets you dial in the right trade-off per query.

Key architectural decisions:

Append-only write path for throughput
Peer-to-peer gossip for coordination
Consistency levels from ONE to ALL
Compaction strategies for different access patterns
NetworkTopologyStrategy for DC awareness

Cassandra rewards good data modeling. Design tables around your query patterns, denormalize aggressively, and pick the right compaction strategy. Do this and you have a database scaling to millions of writes per second with predictable latency.

Facebook chose Cassandra to handle billions of messages with five-nines availability. That same architecture serves IoT platforms, messaging systems, and analytics pipelines today.

Apache Cassandra: The Distributed Column Store for High Write Throughput

Introduction

Architecture Fundamentals

Data Model

CQL: Cassandra Query Language

Tunable Consistency

Secondary Indexes

Materialized Views

Client-Side Token-Aware Routing

Lightweight Transactions (LWT)

Write Path and Commit Log

Compaction Strategies

Compaction Strategy Selection Decision Tree

Tombstone Handling

Anti-Entropy Repair

Read Repair: The Passive Consistency Path on Every Read

Data Center Awareness

Multi-DC Consistency Pitfalls

Use Cases Where Cassandra Excels

Production Failure Scenarios

Common Pitfalls / Anti-Patterns

When to Use Cassandra

When Not to Use Cassandra

Quick Recap Checklist

Trade-off Comparison

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

Amazon DynamoDB: Scalable NoSQL with Predictable Performance

Column-Family Databases: Cassandra and HBase Architecture

NoSQL Databases: Document, Key-Value, Column-Family, Graph