Document Databases: MongoDB and CouchDB Data Modeling
Learn MongoDB and CouchDB data modeling, embedding vs referencing, schema validation, and when document stores fit better than relational databases.
Document Databases: MongoDB and CouchDB Data Modeling
Document databases store data as JSON-like documents. Applications that work with hierarchical or tree-shaped data often fit this model well. Unlike relational databases that normalize data across tables, document stores keep related data together in a single document.
This affects everything from how you query to how you scale.
How Document Databases Work
A document database stores self-contained data units. Each document is a JSON object with fields and values. Collections or buckets group related documents together.
// Example MongoDB document
{
"_id": ObjectId("..."),
"username": "alice",
"profile": {
"firstName": "Alice",
"lastName": "Chen",
"bio": "Software engineer at a fintech startup",
"avatar": "https://example.com/alice.jpg",
"preferences": {
"theme": "dark",
"notifications": true,
"language": "en"
}
},
"orders": [
{
"orderId": "ORD-001",
"items": ["laptop", "mouse"],
"total": 1299.99,
"status": "delivered"
},
{
"orderId": "ORD-002",
"items": ["keyboard"],
"total": 149.99,
"status": "processing"
}
],
"createdAt": ISODate("2024-01-15")
}
The entire user profile lives in one document. No joins required.
flowchart TD
subgraph App["Application Layer"]
C1["Client 1"]
C2["Client 2"]
C3["Client N"]
end
subgraph Router["Query Router / mongos"]
QR["Shard Selector"]
end
subgraph Shards["Sharded Cluster"]
S1["Shard 1<br/>(Chunks: _id 0-100)"]
S2["Shard 2<br/>(Chunks: _id 101-200)"]
S3["Shard 3<br/>(Chunks: _id 201-300)"]
end
subgraph Config["Config Servers"]
CS["Config Server<br/>(Metadata)"]
end
C1 & C2 & C3 --> QR
QR --> S1 & S2 & S3
QR -.->|"Chunk routing table"| CS
MongoDB routes queries through mongos routers that consult config servers for chunk metadata. Each shard holds a subset of documents based on the shard key. The application never directly addresses shards — it talks to the router, which handles the distribution.
Embedding vs Referencing
The biggest decision in document database design is whether to embed or reference.
Embedding Documents
Embedding places related data directly inside the parent document.
// Embedded design - all data in one document
{
"orderId": "ORD-001",
"customer": {
"name": "Bob Martinez",
"email": "bob@example.com",
"address": {
"street": "123 Main St",
"city": "Seattle",
"state": "WA",
"zip": "98101"
}
},
"items": [
{ "sku": "LAPTOP-001", "name": "Gaming Laptop", "qty": 1, "price": 1299.99 },
{ "sku": "MOUSE-002", "name": "Wireless Mouse", "qty": 2, "price": 49.99 }
],
"total": 1399.97
}
Embedding works well when data is almost always accessed together, the embedded data stays small, the relationship is composition not association, and you need atomic updates to the whole unit.
The problem with unbounded arrays:
// Problematic: unbounded array grows indefinitely
{
"userId": "user123",
"activityLog": [
{ "action": "login", "timestamp": "2024-01-01" },
{ "action": "purchase", "timestamp": "2024-01-02" },
// ... grows forever
]
}
MongoDB has a 16MB document size limit. More practically, documents over a few MBs cause performance problems with indexing and network transfer.
Referencing Documents
Referencing stores related data in separate documents and links them by ID.
// Referenced design - separate collections
// Collection: customers
{
"_id": ObjectId("..."),
"name": "Carol Johnson",
"email": "carol@example.com"
}
// Collection: orders
{
"_id": ObjectId("..."),
"customerId": ObjectId("..."), // Reference to customers
"items": [...],
"total": 299.99
}
// Collection: products
{
"_id": ObjectId("..."),
"sku": "KEYBOARD-001",
"name": "Mechanical Keyboard",
"price": 149.99
}
Referencing works well when related data is accessed independently, arrays would grow without bound, you need to share data across multiple parents, and the relationship is association not composition.
The Hybrid Approach
Most real-world schemas use both patterns.
// Hybrid: embed small, stable data; reference large or shared data
{
"blogPost": {
"title": "Understanding Document Databases",
"slug": "understanding-document-databases",
"author": {
// Embed author summary for display
"id": ObjectId("..."),
"name": "Sam Wilson",
"avatar": "https://example.com/sam.jpg"
},
"tags": ["mongodb", "nosql", "data-modeling"],
"publishedAt": ISODate("2024-03-15"),
"content": "...",
"comments": [
// Small, bounded array - OK to embed
{ "author": "user1", "text": "Great explanation!", "date": "..." }
]
},
// Large, unbounded data stored separately
"commentReplies": [...],
"postAnalytics": {...}
}
Embedding vs Referencing: Trade-off Summary
| Aspect | Embedding | Referencing |
|---|---|---|
| Read performance | Single document fetch | Multiple round trips or $lookup aggregation |
| Write atomicity | Atomic within one document | Updates to parent/child independent |
| Data duplication | Denormalized — same data in multiple docs | Normalized — single source of truth |
| Array growth risk | Unbounded arrays hit 16MB limit | No document size concerns |
| Data sharing | Hard to share embedded data across parents | Same child doc can reference multiple parents |
| Update overhead | Updating shared data requires multiple updates | Updating shared data in one place |
| Query flexibility | Limited to data co-located in one doc | Can query children independently |
| Best for | Composition (order contains order-lines), bounded arrays | Association (products in many orders, authors of many posts) |
Schema Validation
Document databases started as schema-less. Modern implementations offer flexible validation.
MongoDB Schema Validation
// Create collection with validators
db.createCollection("products", {
validator: {
$jsonSchema: {
bsonType: "object",
required: ["name", "price", "sku"],
properties: {
name: {
bsonType: "string",
description: "Product name is required",
},
price: {
bsonType: "number",
minimum: 0,
description: "Price must be non-negative",
},
sku: {
bsonType: "string",
pattern: "^[A-Z]{3}-[0-9]{6}$",
description: "SKU must match pattern XXX-000000",
},
tags: {
bsonType: "array",
items: { bsonType: "string" },
},
},
},
},
});
CouchDB Design Documents
CouchDB uses design documents to define views and validation functions.
// Design document with validation
{
"_id": "_design/ecommerce",
"_rev": "1-abc123",
"validate_doc_update": function(newDoc, oldDoc, userCtx) {
if (!newDoc.name) {
throw({forbidden: "Products must have a name"});
}
if (typeof newDoc.price !== "number" || newDoc.price < 0) {
throw({forbidden: "Price must be a non-negative number"});
}
},
"views": {
"by_price": {
"map": "function(doc) { if (doc.price) emit(doc.price, doc.name); }"
}
}
}
When Document Stores Make Sense
Content management systems store articles with variable schemas. One post might have an “author” field, another might have “coAuthors”, another might have “translations”. Relational tables would require complex EAV patterns or frequent schema migrations.
Product catalogs with varied attributes work naturally as documents. A clothing product has size and color; a digital product has download links and license keys; a subscription has billing cycles.
User profiles and preferences vary widely. Some users fill out every field; others leave most blank. Document stores handle this variation without NULL handling nightmares.
Real-time analytics with time-series elements embed event data in documents written once and read frequently.
Applications with complex hierarchical data like organizational charts, file systems, or family trees map naturally to document structures.
When NOT to Use Document Databases
Normalized data requirements demand referential integrity across collections. If you need to enforce that every order references a valid customer, and deleting a customer cascades to all their orders, relational databases handle this better.
Complex multi-document transactions spanning multiple collections are problematic. MongoDB supports transactions across sessions and shards, but the operational complexity is significant. If you need ACID guarantees across multiple documents, consider whether a relational approach or a purpose-built transaction engine is simpler.
Well-structured tabular data with fixed schemas that rarely change performs just as well, often better, in relational databases. If your data model is stable and your access patterns are simple joins, the “flexibility” of document stores becomes a liability.
BI and analytical workloads with complex aggregations across many dimensions work better in columnar stores or specialized analytics databases.
JSON Document Tradeoffs
Document databases embrace JSON, but this has implications.
Advantages
JSON documents are self-describing. You can read any document and understand its structure without consulting an external schema.
Schema evolution becomes simpler. Adding new fields does not require ALTER TABLE statements.
The programming model feels natural for developers. Most web APIs use JSON, so mapping between storage and application is straightforward.
Challenges
Without enforced schemas, your application code must handle data inconsistency. One document might have price: "29.99" as a string; another might have price: 29.99 as a number.
Denormalization means updates can touch multiple documents. If you embed customer name in every order, updating the customer’s name requires updating every order document.
Query optimization requires understanding your access patterns. Unlike relational databases where the query planner chooses indexes, you must design indexes that match your queries.
Query Patterns and Indexing
MongoDB and CouchDB support secondary indexes, but they work differently.
MongoDB Indexes
// Create indexes to support query patterns
db.orders.createIndex({ customerId: 1, createdAt: -1 });
db.orders.createIndex({ status: 1, priority: 1 });
db.products.createIndex({ tags: 1 });
db.products.createIndex({ price: 1 });
// Compound indexes order matters
// Index on (a, b) supports queries on {a} and {a, b}
// Does NOT support queries on {b} alone
// Text indexes for search
db.articles.createIndex({ title: "text", content: "text" });
CouchDB Views
CouchDB uses MapReduce views defined in design documents.
// View definition
{
"views": {
"orders_by_customer": {
"map": function(doc) {
if (doc.type === "order") {
emit([doc.customerId, doc.createdAt], {
total: doc.total,
itemCount: doc.items.length
});
}
}
},
"revenue_by_month": {
"map": function(doc) {
if (doc.type === "order" && doc.status === "completed") {
var month = doc.createdAt.substring(0, 7); // YYYY-MM
emit(month, doc.total);
}
},
"reduce": "_sum"
}
}
}
Common Production Failures
Hotspot documents causing shard imbalance: You choose a shard key like userId for an event-logging collection. A single active user generates 10x more events than average. All that user’s documents land on one shard, which becomes the bottleneck while other shards sit idle. A better shard key spreads writes across nodes — something with high cardinality that doesn’t correlate with user activity.
Unbounded array hitting the 16MB document limit: You embed an activityLog array in user documents without limits. A power user accumulates 5 years of activity — the document grows past 16MB and MongoDB rejects further inserts. The options: archive old activities to separate documents, move to a separate activities collection with referencing, or cap the array and rotate oldest entries out.
Embedding at the wrong granularity causing update storms: You embed individual orderLine items inside an order document. When a product name changes, you update every order that contains that product — one product rename, thousands of document writes. If reference data changes frequently, referencing avoids this cascade.
Missing index on referenced fields: You use customerId references across orders but skip the index. Queries like “find all orders for customer X” do a full collection scan. As soon as you introduce referencing patterns, create createIndex({ customerId: 1 }).
Schema validation blocking legacy documents: You add JSON schema validation to an existing collection with documents that don’t conform. Writes start failing and your application starts returning 500s. Validate your existing dataset before enabling strict validation, or set the validation level to only apply to new inserts.
Capacity Estimation: Shard Sizing for MongoDB
MongoDB sharding splits data across nodes by shard key. Each shard handles a range of shard key values. Sizing shards correctly prevents hot spots and ensures even distribution.
The formula for shardable data: your total dataset in GB divided by number of shards should stay under the RAM-to-storage ratio threshold where working set fits in memory. If your working set (frequently accessed data) is 50 GB and you have 3 shards, each shard needs ~17 GB in RAM to serve reads from memory. MongoDB recommends working set fits in RAM per shard.
For a collection with 500 million documents averaging 1 KB each, your raw storage is 500 GB. With a shard key that spreads writes evenly (e.g., userId for user-centric data), each of 5 shards holds 100 GB. If reads are concentrated on recent data (last 30 days, ~100 GB), you need enough RAM per shard to hold the working set — not the entire dataset.
Chunk size matters: MongoDB splits chunks at 64MB by default. If a chunk grows beyond maxChunkSize, MongoDB splits it and migrates to balance. Unbounded arrays in documents can cause chunk sizes to balloon unexpectedly, triggering unexpected migrations.
$collStats tells you per-collection storage and shard distribution:
db.getSiblingDB("admin").runCommand({ collStats: "orders", scale: 1 });
Key fields: count (document count), size (total bytes), storageSize (physical bytes allocated), totalIndexSize. If storageSize / count is growing unexpectedly, documents are accumulating oversized fields or arrays that were never cleaned up.
Observability Hooks: Document Size Metrics and Shard Balance
For MongoDB, these metrics matter most:
Document size growth rate: track storageSize over time. A collection growing faster than expected usually means unbounded arrays or denormalized fields accumulating. Use db.collection.stats() and graph it in your monitoring tool.
Shard balance: db.adminCommand({ balancerStatus: 1 }) shows whether the balancer is running. db.adminCommand({ moveChunk: ... }) shows recent migrations. An unbalanced cluster means some shards serve more traffic than others — reads and writes skew toward the overloaded shard.
Chunk distribution: db.adminCommand({ splitVector: "mydb.mycollection" }) identifies chunk boundaries. db.getSiblingDB('config').chunks.find() lists all chunks and their shard assignments. If one shard owns more than 40% of chunks, your shard key has a hot spot.
For CouchDB, key metrics are in /_node/_local/_stats: document count, disk size, fragmented bytes (from compaction), and request latency by method. If fragmented bytes approaches disk size, compaction is overdue and read performance degrades.
Real-World Case Study: Airbnb’s MongoDB Migration
Airbnb migrated their search infrastructure from a monolithic PostgreSQL setup to MongoDB with sharding. Their problem: search needed to handle queries across millions of listings with sub-100ms latency, while writes (booking events, reviews, pricing updates) poured in continuously. PostgreSQL replicas could not keep up with write volume during peak hours.
Their migration was staged over 18 months — they ran dual-write to both systems, validated query results matched, then cut over search reads in batches. The lesson from their experience: document databases need honest upfront work on access patterns. They spent months modeling their document structure before writing a single production query. Embed vs reference decisions made early were expensive to change later.
The specific pattern they used: listing documents embedded amenity lists, calendar availability, and review summaries. High-volume write paths (availability updates) went to a separate collection with referencing, not embedding. This kept listing document size manageable and avoided update storms when availability windows changed.
What they got wrong initially: they embedded booking history in user documents. A frequent traveler accumulated years of booking history, and those documents grew past the working set threshold — reads became disk-bound instead of memory-bound. They fixed it by moving booking history to a separate collection.
Interview Questions
Q: You are designing a MongoDB schema for an e-commerce product catalog. Some products have custom attributes (a camera has megapixels, a shirt has size/color). How do you approach schema design?
Use a hybrid: fixed fields for common attributes (name, price, description, category) with a flexible attributes object for product-specific fields. Index on category and price for filtering. For product-specific filters, MongoDB can query across attributes if you create partial indexes on specific attribute paths. The tradeoff: flexible attributes are harder to aggregate across products (you cannot easily sum megapixels across all cameras without knowing which products are cameras). If cross-product aggregation matters, a relational or column-store is better. If each product type is queried within its own type, document flexibility wins.
Q: A MongoDB collection has grown to 50 GB across 5 shards, but one shard is using 25 GB while the others use 6 GB each. What is happening and how do you fix it?
Your shard key is causing hot spot writes — most documents route to one shard. Common causes: using a monotonically increasing _id as part of the shard key (new documents all go to the chunk with the highest range), or a shard key with low cardinality (e.g., country with 5 values on a global dataset). Diagnose with db.adminCommand({ chunkDiff: "mydb.mycollection" }) or inspect config.chunks. Fix requires changing the shard key, which means either migrating to a new collection with a better key or using refineCollectionShardKey if you can tolerate the migration window. As a temporary mitigation, add a hash suffix to the shard key or switch to hashed sharding for the collection.
Q: What is the difference between CouchDB and MongoDB from an operational perspective?
CouchDB is masterless — every node can accept writes and replicas sync via continuous replication. There is no single point of failure and no need for a replica set failover process. MongoDB has a primary in each replica set; failover promotes a secondary and the application driver reroutes. CouchDB’s replication model is simpler operationally for multi-region deployments but conflict resolution is your problem — if the same document is edited on two nodes during a partition, CouchDB stores both versions and you resolve conflicts in application code. MongoDB’s write concern lets you tune durability vs latency, and conflict resolution is simpler (last write wins at the replica set level). CouchDB wins for edge nodes, mobile sync, and situations where you want peer-to-peer replication. MongoDB wins when you want stronger consistency guarantees and more expressive query language.
Security Checklist
- Enable authentication on every MongoDB/CouchDB node; use SCRAM-SHA-256 or equivalent for MongoDB, and Cookie Auth + TLS for CouchDB
- Implement field-level encryption for highly sensitive document sections using a KMIP-compliant key management service
- Use TLS for all internal replica set and shard communication to prevent man-in-the-middle attacks on replica traffic
- Configure CouchDB’s
require_valid_userto enforce authentication on all endpoints; remove defaultadminaccount - Audit document access by enabling MongoDB’s
auditLogor using CouchDB’s log rotation for access logs - Restrict network exposure of document database ports using firewall rules or VPC peering; never expose MongoDB default port 27017 to the internet
Common Pitfalls and Anti-Patterns
Unbounded array growth: Storing unbounded arrays in documents (e.g., a comments array that grows indefinitely) causes document size to exceed the 16MB limit and slows reads. Fix: use a separate collection for items that grow unbounded and reference by document ID.
Hotspot documents: A document that receives 10x more writes than other documents becomes a write bottleneck in sharded clusters. Fix: add a shard key that distributes write load, such as a user or tenant identifier, rather than using a sequential _id.
Fetching too much data with $elemMatch: Using find() without projection returns entire arrays, including ones you do not need. Fix: always specify which fields to return with a projection document.
Denormalizing mutable data: Embedding fields that change frequently (like a user’s last login timestamp) inside a document requires updating every document that embeds that field. Fix: reference mutable data by ID rather than embedding it.
Ignoring schema validation: Unvalidated schemas allow documents with missing or malformed fields to be inserted, causing application errors. Fix: use MongoDB’s JSON Schema validation or CouchDB’s design document schema constraints.
Quick Recap Checklist
- Document databases suit hierarchical data, variable schemas, and read-complete-document access patterns
- Embed documents when data is always fetched together; reference when data is shared across documents
- Choose a shard key that distributes write load evenly and avoids hotspot documents
- Use projections to return only the fields your application needs
- Implement schema validation to catch malformed documents early
- Monitor document size — 16MB MongoDB / unlimited CouchDB — and watch for unbounded array growth
Conclusion
Document databases work well when your data is hierarchical, your schema varies between entities, or your access patterns favor reading complete units of related data. The embedding vs referencing decision is the most critical modeling choice, and it requires understanding your query patterns upfront.
For content management, user profiles, product catalogs, and applications with complex but self-contained data units, document databases often outperform relational alternatives. For data with strict referential integrity requirements, complex multi-document transactions, or stable tabular structures, relational databases remain the better choice.
Model honestly: understand your data relationships, access patterns, and consistency requirements before choosing a database model.
For more on NoSQL varieties, see the NoSQL Databases overview. To learn about schema design principles, see Schema Design. For comparison with key-value stores, see Key-Value Stores.
Category
Related Posts
Graph Databases: Neo4j and Graph Traversal Patterns
Learn Neo4j graph database modeling with Cypher. Covers nodes, edges, social networks, recommendation engines, fraud detection, and when graphs are not the right fit.
Column-Family Databases: Cassandra and HBase Architecture
Cassandra and HBase data storage explained. Learn partition key design, column families, time-series modeling, and consistency tradeoffs.
Denormalization
When to intentionally duplicate data for read performance. Tradeoffs with normalization, update anomalies, and application-level denormalization strategies.