Column-Family Databases: Cassandra and HBase Architecture
Cassandra and HBase data storage explained. Learn partition key design, column families, time-series modeling, and consistency tradeoffs.
Column-Family Databases: Cassandra and HBase Architecture
Column-family databases organize data into column families, rows, and columns. They are optimized for write-heavy workloads, massive scale, and queries that retrieve many columns of a single row. The “wide-column” model sounds like a spreadsheet but behaves very differently from relational databases.
Cassandra and HBase are the two major players in this space. They share the wide-column model but differ significantly in architecture and operational characteristics.
The Wide-Column Model
Despite the name, column-family databases are not column-oriented like analytical databases (ClickHouse, Redshift). They store data row by row, but group columns into families.
-- Cassandra CQL looks like SQL but isn't
CREATE TABLE users (
user_id uuid PRIMARY KEY,
profile frozen<profile_type>,
preferences map<text, text>,
created_at timestamp
);
Under the hood, data is stored as:
RowKey: user-123
|
+-- Column Family: profile
| +-- Column: first_name, value: "Alice", timestamp: 1709256000
| +-- Column: last_name, value: "Chen", timestamp: 1709256000
| +-- Column: bio, value: "Engineer", timestamp: 1709256000
|
+-- Column Family: preferences
+-- Column: theme, value: "dark", timestamp: 1709256000
+-- Column: language, value: "en", timestamp: 1709256000
Each cell has a value and a timestamp. The timestamp enables optimistic concurrency and can track data versioning.
Cassandra: Distributed by Default
Cassandra is designed as a distributed, masterless database. Every node can serve any request. Data is replicated across nodes based on a configurable replication factor.
Partition Key Design
The partition key determines data distribution. All columns with the same partition key reside on the same node(s).
-- Good: high cardinality column as partition key
CREATE TABLE orders_by_user (
user_id uuid,
order_id timeuuid,
total decimal,
status text,
items list<text>,
PRIMARY KEY (user_id, order_id)
);
-- Query: get all orders for a user (efficient - single partition lookup)
SELECT * FROM orders_by_user WHERE user_id = ?;
-- Bad: low cardinality as partition key
CREATE TABLE orders_by_status (
user_id uuid,
order_id timeuuid,
status text,
PRIMARY KEY (status, order_id)
);
-- Query: get all orders with status = 'pending' (scans many partitions)
SELECT * FROM orders_by_status WHERE status = 'pending';
Data Modeling for Query Patterns
In Cassandra, you model for queries, not entities. You create separate tables for each access pattern.
-- Table for user profile lookups
CREATE TABLE user_profiles (
user_id uuid PRIMARY KEY,
email text,
name text,
created_at timestamp
);
-- Table for looking up users by email
CREATE TABLE user_emails (
email text PRIMARY KEY,
user_id uuid
);
-- Table for getting all users (for admin views)
CREATE TABLE all_users (
tenant_id uuid,
user_id uuid,
email text,
name text,
PRIMARY KEY (tenant_id, user_id)
);
-- Table for time-based activity queries
CREATE TABLE user_activity (
user_id uuid,
activity_date date,
activity_hour int,
event_type text,
event_data map<text, text>,
PRIMARY KEY ((user_id), activity_date, activity_hour, event_type)
);
Consistency Levels
Cassandra offers tunable consistency. You can trade consistency for availability.
from cassandra.cluster import Cluster
cluster = Cluster(['cassandra-node-1', 'cassandra-node-2', 'cassandra-node-3'])
session = cluster.connect('mykeyspace')
# Write with quorum consistency (most common)
session.execute(
"INSERT INTO user_profiles (user_id, email, name) VALUES (%s, %s, %s)",
[user_id, email, name],
consistency_level=ConsistencyLevel.QUORUM
)
# Read with quorum
result = session.execute(
"SELECT * FROM user_profiles WHERE user_id = %s",
[user_id],
consistency_level=ConsistencyLevel.QUORUM
)
# Consistency levels:
# ONE: fastest, weakest consistency
# TWO: better consistency, still fast
# THREE: strong consistency
# QUORUM: majority (n/2 + 1)
# ALL: strongest, slowest, unavailable if any node down
# LOCAL_ONE: local datacenter only
# LOCAL_QUORUM: quorum in local datacenter
Time-Series Modeling
Cassandra handles time-series data well when modeled correctly.
-- Time-series table with bucketing
CREATE TABLE sensor_readings (
sensor_id text,
bucket date, -- daily bucket
reading_time timestamp,
temperature float,
humidity float,
PRIMARY KEY ((sensor_id, bucket), reading_time)
) WITH CLUSTERING ORDER BY (reading_time DESC);
-- Query a specific sensor for a day
SELECT * FROM sensor_readings
WHERE sensor_id = 'temp-sensor-001'
AND bucket = '2024-03-15';
-- Query for a time range within a bucket
SELECT * FROM sensor_readings
WHERE sensor_id = 'temp-sensor-001'
AND bucket = '2024-03-15'
AND reading_time > '2024-03-15T10:00:00'
AND reading_time < '2024-03-15T14:00:00';
The bucket prevents unbounded partition growth. Each partition holds one day’s worth of data for one sensor.
flowchart TD
W["Write Request"]
WL["WAL<br/>(Write-Ahead Log)"]
MS["MemStore<br/>(In-Memory)"]
HF["HFiles<br/>(Disk)"]
RS["Read Request"]
BC["BlockCache"]
Merge["HFiles<br/>(Merged)"]
W --> WL --> MS -->|Flush when full| HF
RS --> BC
BC --> MS
Merge -->|Compaction| HF
HF -.->|"On read"| Merge
HBase’s write path goes through WAL (durability) then MemStore (speed). Reads check BlockCache first, then MemStore, then merged HFiles. Compaction runs in the background to merge HFiles and remove deleted cells.
HBase: Hadoop’s Column Store
HBase is built on top of HDFS (Hadoop Distributed File System). It integrates with the Hadoop ecosystem and is optimized for sequential reads and writes.
Architecture
HBase stores data in HFiles on HDFS. Regions serve ranges of row keys and split as they grow. HBase uses ZooKeeper for coordination.
Client -> ZooKeeper -> Meta Table (lists regions) -> Region Server
|
v
HFile (stored on HDFS)
Data Model
HBase uses a sparse, multi-dimensional sorted map.
Key: (Row Key, Column Family, Column Qualifier, Timestamp)
Value: byte[]
from happybase import Connection
connection = Connection('hbase-host')
table = connection.table('user_events')
# Store events
row_key = f'user-123#{datetime.now().isoformat()}'
table.put(row_key, {
'cf:event_type': 'purchase',
'cf:amount': '129.99',
'cf:product_id': 'SKU-001',
'cf:timestamp': datetime.now().isoformat()
})
# Scan with filters
for key, data in table.scan(row_start='user-123',
row_stop='user-124',
filter="SingleColumnValueFilter('cf', 'event_type', =, 'binary:purchase')"):
print(key, data)
# Get specific row
row = table.row('user-123#2024-03-15T10:30:00')
Row Key Design
Row keys in HBase determine locality. Sequential keys cause write hotspots.
# Bad: sequential keys cause hotspot on one region server
row_key = f'order-{order_id:010d}' # order-0000000001, order-0000000002...
# Better: salted keys distribute writes
# But adds complexity to reads
# Better yet: randomizing keys for write distribution
import uuid
row_key = f'{uuid.uuid4().hex}#user-123'
# For time-series: reverse timestamp keys for hot/cold access patterns
# or use salting with date prefixes
row_key = f'2024-03-15#{uuid.uuid4().hex}#user-123'
Read/Write Paths
HBase writes go through a Write-Ahead Log (WAL) and MemStore before being flushed to HFiles.
Write: Client -> WAL -> MemStore (in memory) -> Flush to HFile
|
Read: Client -> BlockCache -> MemStore -> HFiles (merged) <--+
Compaction merges HFiles, removing deleted or overwritten cells.
Comparing Cassandra and HBase
| Characteristic | Cassandra | HBase |
|---|---|---|
| Architecture | Masterless, every node equal | Master-slave with ZooKeeper |
| Consistency | Tunable per-query | Strong consistency only |
| Write Performance | Excellent | Good |
| Query Language | CQL (SQL-like) | No native SQL (requires Phoenix) |
| Secondary Indexes | Limited, local | Limited |
| Hadoop Integration | Limited | Native MapReduce, Spark |
| Operational Complexity | Lower (no ZooKeeper) | Higher ( ZooKeeper, HDFS) |
| Global Scans | Possible but expensive | Efficient via MapReduce |
| Use Case Fit | Write-heavy, multi-datacenter | Hadoop ecosystem, sequential access |
When to Use / When Not to Use Column-Family Databases
Use them when:
- You need to handle millions of writes per second with horizontal scaling
- Your access patterns are known upfront and map to partition keys
- You need multi-datacenter replication with tunable consistency (Cassandra)
- Time-series data with bucketing fits your workload
- You’re in the Hadoop ecosystem and need native Spark/Hive integration (HBase)
Do not use them when:
- You need ad-hoc queries on arbitrary columns — CQL is not SQL and joins are not supported
- Your team can’t manage ZooKeeper, HDFS, and region management (HBase)
- You need strong consistency across wide range queries with complex aggregation
- Your data model changes frequently — wide-column schemas require rethinking partition keys
- You need real-time secondary indexing across non-key columns
Partition Key Strategies
The partition key is the most critical design decision.
Goals for Partition Keys
- Distribute writes evenly across partitions
- Keep related data in the same partition
- Avoid partitions that grow too large
- Enable efficient range queries when needed
-- E-commerce: multiple access patterns
-- Pattern 1: User-centric
CREATE TABLE user_orders (
user_id uuid,
order_id timeuuid,
total decimal,
items list<text>,
PRIMARY KEY (user_id, order_id)
);
-- Pattern 2: Product-centric (for "who bought this?")
CREATE TABLE product_orders (
product_id uuid,
order_id timeuuid,
user_id uuid,
quantity int,
PRIMARY KEY (product_id, order_id)
);
-- Pattern 3: Time-based (for analytics)
CREATE TABLE daily_sales (
date date,
product_id uuid,
order_id timeuuid,
quantity int,
PRIMARY KEY (date, product_id, order_id)
) WITH CLUSTERING ORDER BY (product_id ASC, order_id DESC);
Avoiding Hotspots
-- Bad: monotonically increasing key causes hotspot
CREATE TABLE events (
event_id bigint,
data text,
PRIMARY KEY (event_id)
);
-- All writes go to one partition (highest event_id)
-- Good: hash the key to distribute
CREATE TABLE events (
event_id bigint,
partition_key text, -- first part of composite key
data text,
PRIMARY KEY ((partition_key), event_id)
);
-- Better: use a natural key that has cardinality
CREATE TABLE user_events (
user_id uuid,
event_id timeuuid,
data text,
PRIMARY KEY ((user_id), event_id)
);
Time-Series Data Modeling
Time-series is a natural fit for wide-column stores.
-- Cassandra: bucket by time, query by partition
CREATE TABLE metrics (
metric_name text,
bucket text, -- YYYY-MM-DD
timestamp timestamp,
value double,
tags map<text, text>,
PRIMARY KEY ((metric_name, bucket), timestamp)
);
-- Query: get all CPU metrics for a host on a day
SELECT * FROM metrics
WHERE metric_name = 'cpu.usage'
AND bucket = '2024-03-15'
AND timestamp > '2024-03-15T00:00:00'
AND timestamp < '2024-03-15T23:59:59';
# HBase: store metrics with composite row key
# Row key: metric_name#hostname#timestamp
from happybase import Connection
import time
def store_metric(connection, metric_name, hostname, value, tags=None):
table = connection.table('metrics')
timestamp = int(time.time() * 1000)
row_key = f'{metric_name}#{hostname}#{timestamp:013d}'
data = {'cf:value': str(value), 'cf:timestamp': str(timestamp)}
if tags:
for k, v in tags.items():
data[f'cf:tag.{k}'] = v
table.put(row_key, data)
def query_metrics(connection, metric_name, hostname, start_time, end_time):
table = connection.table('metrics')
start_row = f'{metric_name}#{hostname}#{start_time:013d}'
end_row = f'{metric_name}#{hostname}#{end_time:013d}'
return table.scan(row_start=start_row, row_stop=end_row)
Consistency Tradeoffs
Wide-column databases make interesting consistency tradeoffs.
Cassandra’s Tunable Consistency
With RF=3 and QUORUM=2:
- Write: 2 of 3 nodes must acknowledge
- Read: 2 of 3 nodes must respond
- You may read stale data if the third node has a newer value
# Read repair: check other replicas after read
# If mismatch found, repair asynchronously
# For high consistency: use SERIAL consistency
session.execute(..., consistency_level=ConsistencyLevel.SERIAL)
# Requires CL=QUORUM for writes
# Linearizable consistency for lightweight transactions
HBase’s Strong Consistency
HBase guarantees strong consistency. Every read sees the latest write. This comes at the cost of potential latency during region splits or failover.
Handling Inconsistency
# Application-level conflict resolution
def write_with_retry(session, key, value, max_retries=3):
for attempt in range(max_retries):
try:
# Write with latest timestamp
session.execute(
"INSERT INTO events (key, value) VALUES (%s, %s)",
[key, value]
)
return
except WriteTimeoutException:
if attempt == max_retries - 1:
raise
time.sleep(0.1 * (attempt + 1))
# Conflict resolution strategies:
# 1. Last-write-wins (use TTL for automatic cleanup)
# 2. Application-defined merge (examine timestamps, values)
# 3. CRDTs (Conflict-free Replicated Data Types)
Capacity Estimation: Partition Sizing and Bloom Filter Memory
Cassandra partitions grow until they hit two limits: 2 billion cells per partition (a hard internal limit) or 100MB of data per partition (a practical soft limit for read performance). Beyond 100MB, reads become slow because deserializing a large partition consumes significant heap.
The formula for estimating partition size: average row size times rows per partition. If each row is 1KB and you expect 50,000 rows per partition, that is 50MB — manageable. If rows grow to 10KB over time (more columns get added), you are at 500MB per partition and reads start timing out. Budget headroom: size partitions for peak row count, not average.
Cassandra uses bloom filters to avoid reading partitions that do not contain the requested row key. Each SSTable has its own bloom filter, and bloom filter memory is roughly 1-2 bytes per key per SSTable. For a table with 100GB of data stored across 100 SSTables and 1 billion partition keys, you are looking at roughly 200MB of bloom filter memory per node. This is compact enough to keep in memory and is why bloom filters are effective at reducing unnecessary disk reads.
HBase also uses bloom filters at the StoreFile level. The block size (default 64KB or 128KB) determines how many disk reads a bloom filter can eliminate. With a bloom filter, HBase can skip entire blocks when the row key is confirmed absent. Without it, HBase reads the block index and then the data block.
For capacity planning: estimate your partition count by dividing total data by the target partition size (100MB). With 1TB of data and 100MB target partition size, you need at minimum 10 partitions for data distribution. But with replication factor 3, that is 30 partitions across your cluster. Size your cluster so each node handles a reasonable number of partitions — 100-500 partitions per node is typical for Cassandra.
Observability Hooks: Cassandra Nodetool and HBase Master Metrics
For Cassandra, nodetool is the primary operational interface. The commands that matter in production: nodetool status shows which nodes are up and how much data each owns. nodetool tpstats shows thread pool statistics — if DroppedMessages is non-zero, your cluster is overloaded and messages are being dropped. nodetool tablestats gives per-table read/write latency, tombstone counts, and bloom filter false positive rates. nodetool proxyhistograms shows the full distribution of latencies across all coordinators.
The bloom filter false positive rate (FFP) tells you if bloom filters are working. An FFP above 1% means too many unnecessary disk reads are occurring — your bloom filter is not eliminating enough absent key lookups. This usually happens when partition sizes grow too large or when the bloom filter was misconfigured at table creation.
For HBase, the HBase Master UI exposes region assignment, load distribution, and compaction queue depth. The RegionServer metrics are more critical: MemStoreSize (if it approaches hbase.regionserver.global.memstore.size, writes will be blocked), StoreFileCount (too many StoreFiles indicate compaction is lagging), and compactionQueueSize. If the compaction queue grows faster than it is consumed, read latency degrades.
HBase also exposes metrics via JMX: HRegionServer.metrics shows per-region read/write latencies. The SlowGet metric tracks get operations taking longer than hbase.slow.get.threshold (default 1000ms). If SlowGet is climbing, your BlockCache is likely thrashing or HFiles have grown too large.
For both systems, monitor your compaction impact. Cassandra’s nodetool compactionhistory shows compaction throughput over time. HBase’s admin.getCompactionState() per table shows whether minor or major compactions are running. Compaction I/O competes with normal read/write I/O — if latency spikes coincide with compactions, consider throttling compaction throughput or scheduling it during off-peak hours.
Real-World Case Study: Instagram’s Cassandra at Scale
Instagram’s Cassandra deployment is one of the largest in the world, storing hundreds of terabytes across their cluster. Their primary use case was the activity log — who commented on your post, who followed you, who liked your photo. The problem was that at Instagram’s scale, writing to a user’s activity log meant writing to a single wide partition keyed by user ID, and celebrity accounts (accounts with millions of followers) created partitions that dwarfed regular user partitions.
Their first approach used Cassandra’s wide rows directly: each user had one massive partition containing all their notifications. This worked until they hit the partition size limit and read latencies spiked for users with large follower counts. Their fix was bucketing within the partition — instead of one row per user, they used bucketed keys like user_id#0 through user_id#15 to split one partition into 16 smaller ones. This kept partition size manageable but added query complexity: to get all notifications, they had to query 16 buckets.
The operational lesson from Instagram’s experience: Cassandra’s partition model works until it does not, and when it breaks, it breaks at the worst time (when you are already at scale). Instagram’s team recommends planning partition sizes with 10x headroom over your expected growth — if you expect 10,000 rows per partition, design for 1,000 to leave room for the unexpected.
Interview Questions
Q: You are designing a Cassandra table for an IoT sensor network. Each sensor sends 100 readings per minute, and you need to query readings by sensor and time range. How do you model the partition key?
Model with a composite partition key (sensor_id, bucket) where bucket is a daily or hourly time bucket depending on retention requirements. A daily bucket for each sensor keeps partition size bounded — 100 readings/minute 60 minutes 24 hours = 144,000 rows per partition per day, which at 200 bytes per row is roughly 28MB per partition per day. Manageable but tight. If readings are larger or you need finer time buckets, switch to hourly bucketing. The alternative is using the sensor ID as the sole partition key and using clustering columns for time, but that puts all days for one sensor on one partition — unbounded growth.
Q: Your Cassandra cluster has one node showing 3x higher latency than other nodes despite similar data sizes. What do you investigate first?
Check nodetool tpstats on the slow node for non-zero DroppedMessages, which indicates the node is overloaded and cannot keep up with requests. Then check nodetool compactionstats — if compactions are running constantly, the node is I/O bound. Also check nodetool tablestats for that node specifically to see if one table has unusually large partitions or a high bloom filter false positive rate. A single large partition forces Cassandra to read more data per query, and if that partition is hosted on one node, that node bears the extra load while others do not.
Q: When would you choose HBase over Cassandra?
HBase makes sense when you are already in the Hadoop ecosystem and need native integration with MapReduce, Spark, or Hive for bulk processing of your data. HBase also makes sense when you need strong consistency guarantees — Cassandra’s tunable consistency means you can accidentally read stale data if your consistency level does not match your requirements. HBase is also preferable when your access patterns are predominantly sequential reads and writes (which HDFS handles well) and when you need efficient wide scans across large tables. If you do not have ZooKeeper expertise and are not in the Hadoop ecosystem, HBase’s operational complexity is a liability.
Q: What happens internally when a Cassandra partition becomes too large?
Cassandra does not split a partition in the traditional sense — instead, when a partition approaches the 100MB soft limit, the storage engine begins marking cells as deleted (tombstoned) more aggressively during compaction. Reads on oversized partitions must deserialize more data from disk into memory, which increases read latency and heap pressure. If the partition hits the 2 billion cell hard limit, writes to that partition start failing with a InvalidRequestException. The practical fix is to redesign the data model to split the partition using bucketing (adding a bucket number to the partition key), which forces Cassandra to create separate physical partitions for what was previously one logical partition.
Common Production Failures
Monotonically increasing partition key causing write hotspot: You use event_id bigint as the primary key and insert events with increasing IDs. All new writes land on the same node because Cassandra’s partitioner routes consecutive keys to the same hash. Your throughput collapses under load. The solution is a partition key with natural cardinality — something like (user_id, event_time) spreads writes across nodes.
Unbounded partition growing past Cassandra’s 2B cell limit: You model IoT sensor data with (sensor_id, timestamp) as the primary key but forget to bucket by date. One sensor running for 10 years generates millions of columns in a single partition, eventually hitting internal limits and causing read timeouts. Bucket time-series data by day or week — this is standard practice for a reason.
HBase region hotspot from sequential row keys: You use sequential IDs like order-00000001 as row keys. All recent orders cluster on one region server, which gets overwhelmed while others sit idle. UUIDs, hash-based keys, or reversed timestamps all spread writes more evenly.
Cassandra consistency misconfigured for the replication factor: You set consistency_level=ONE with a replication factor of 3, thinking you’re getting good coverage. With ONE, you read from the closest replica — which might have stale data while another replica holds the latest write. QUORUM with replication_factor=3 requires 2 of 3 replicas to acknowledge, giving you actual consistency.
HBase compaction starving compute resources: HBase runs major compaction at night, merging hundreds of HFiles. On large tables this saturates disk I/O and network bandwidth, and reads slow to a crawl right when your users are waking up. Schedule compactions during off-peak hours or use incremental compaction strategies.
Wrong partition key preventing efficient time-range queries: You model (user_id, order_id) as the primary key, then need to query all orders within a date range regardless of user. Cassandra scatters these across partitions, so range scans hit every node. If time-range access is a real use case, put the date or time bucket in the partition key from the start.
Security Checklist
- Enable Cassandra’s
authenticatorandauthorizerto enforce role-based access control on keyspaces and tables - Use TLS for all inter-node communication (Cassandra’s
server_encryption_options) to prevent traffic sniffing between nodes - For HBase: enable Kerberos authentication via
hbase.security.authentication = kerberos; use ACLs on namespace and table levels - Restrict HBase’s Thrift and REST ports to internal networks; disable the HBase Master UI port on public interfaces
- Encrypt SSTable data at rest using
sstableencryptionin Cassandra and HDFS encryption at rest for HBase - Implement row-level access control in Cassandra using
GRANT/REVOKEon roles to limit access to specific partition ranges
Common Pitfalls and Anti-Patterns
Unbounded partition size: A partition with no upper size limit causes read and write performance to degrade as it grows. Cassandra’s 2GB partition limit is a hard ceiling; HBase has no explicit limit but large rows cause memory pressure. Fix: use bucketed partition keys to split large partitions, and monitor partition statistics continuously.
Tombstone overload during reads: Cassandra marks deleted data as tombstones; queries that must scan many tombstones become slow. Fix: avoid querying partitions with high delete rates, and rebuild problematic tables with nodetool garbagecollect.
Using sequential keys in Cassandra: Keys like timeuuid or monotonic counters create sequential data on disk, causing write hot spots on a single node. Fix: use a hash-based partition key or add a randomizing element to the key design.
HBase pre-splitting misconfiguration: Starting with a single region causes all initial writes to hit one region server until a split occurs. Fix: pre-split tables using a hex-split strategy or hash-based split keys.
Ignoring bloom filter memory: Bloom filters in Cassandra can consume significant memory for high-cardinality partitions. Fix: monitor BloomFilterFalsePositives and BloomFilterFalseRatio metrics.
Quick Recap Checklist
- Wide-column databases suit write-heavy workloads with predictable access patterns and horizontal scaling needs
- Design partition keys for even write distribution; avoid sequential keys and unbounded partition sizes
- Cassandra offers tunable consistency per query; HBase provides strong consistency with ZooKeeper coordination
- Use
CONCURRENTLYfor Cassandra index creation; use bulk load tools for HBase to avoid region splitting during ingestion - Monitor tombstone counts, partition sizes, and bloom filter metrics continuously
Conclusion
Wide-column databases trade query flexibility for write scalability and operational simplicity at scale. The data model requires understanding your access patterns upfront, but when modeled correctly, these databases handle massive write volumes efficiently.
Cassandra suits applications needing multi-datacenter replication, tunable consistency, and a SQL-like query interface. HBase suits applications deeply integrated with Hadoop, needing strong consistency and sequential access patterns.
Partition key design determines everything. A poor partition key causes hotspots, inefficient queries, or unbounded partition growth. Spend time on your query patterns before you design your schema — it is the one decision you cannot easily reverse.
For more on NoSQL databases, see the NoSQL overview. For time-series specific databases, see Time-Series Databases. For scaling strategies, see Horizontal Sharding.
Category
Related Posts
Apache Cassandra: Distributed Column Store Built for Scale
Explore Apache Cassandra's peer-to-peer architecture, CQL query language, tunable consistency, compaction strategies, and use cases at scale.
CQRS Pattern
Separate read and write models. Command vs query models, eventual consistency implications, event sourcing integration, and when CQRS makes sense.
Document Databases: MongoDB and CouchDB Data Modeling
Learn MongoDB and CouchDB data modeling, embedding vs referencing, schema validation, and when document stores fit better than relational databases.