Document Databases: MongoDB and CouchDB Data Modeling
Learn MongoDB and CouchDB data modeling, embedding vs referencing, schema validation, and when document stores fit better than relational databases.
Document Databases: MongoDB and CouchDB Data Modeling
Document databases store data as JSON-like documents. Applications that work with hierarchical or tree-shaped data often fit this model well. Unlike relational databases that normalize data across tables, document stores keep related data together in a single document.
This affects everything from how you query to how you scale.
Introduction
Document databases store related data together as JSON-like documents rather than across normalized tables. The difference matters most when your data structure varies between entities, evolves frequently, or maps naturally to hierarchical shapes. The tradeoff is giving up joins, transactions, and the consistency guarantees of relational databases in exchange for schema flexibility and faster queries on document-centric data.
This guide covers document database data modeling (embedding versus referencing), schema validation approaches, how indexing and querying differ from relational databases, and the specific strengths and weaknesses of MongoDB and CouchDB. It also covers when document databases are the right choice versus when a relational or multi-model database serves better.
How Document Databases Work
A document database stores self-contained data units. Each document is a JSON object with fields and values. Collections or buckets group related documents together.
// Example MongoDB document
{
"_id": ObjectId("..."),
"username": "alice",
"profile": {
"firstName": "Alice",
"lastName": "Chen",
"bio": "Software engineer at a fintech startup",
"avatar": "https://example.com/alice.jpg",
"preferences": {
"theme": "dark",
"notifications": true,
"language": "en"
}
},
"orders": [
{
"orderId": "ORD-001",
"items": ["laptop", "mouse"],
"total": 1299.99,
"status": "delivered"
},
{
"orderId": "ORD-002",
"items": ["keyboard"],
"total": 149.99,
"status": "processing"
}
],
"createdAt": ISODate("2024-01-15")
}
The entire user profile lives in one document. No joins required.
flowchart TD
subgraph App["Application Layer"]
C1["Client 1"]
C2["Client 2"]
C3["Client N"]
end
subgraph Router["Query Router / mongos"]
QR["Shard Selector"]
end
subgraph Shards["Sharded Cluster"]
S1["Shard 1<br/>(Chunks: _id 0-100)"]
S2["Shard 2<br/>(Chunks: _id 101-200)"]
S3["Shard 3<br/>(Chunks: _id 201-300)"]
end
subgraph Config["Config Servers"]
CS["Config Server<br/>(Metadata)"]
end
C1 & C2 & C3 --> QR
QR --> S1 & S2 & S3
QR -.->|"Chunk routing table"| CS
MongoDB routes queries through mongos routers that consult config servers for chunk metadata. Each shard holds a subset of documents based on the shard key. The application never directly addresses shards — it talks to the router, which handles the distribution.
Embedding vs Referencing
The biggest decision in document database design is whether to embed or reference.
Embedding Documents
Embedding places related data directly inside the parent document.
// Embedded design - all data in one document
{
"orderId": "ORD-001",
"customer": {
"name": "Bob Martinez",
"email": "bob@example.com",
"address": {
"street": "123 Main St",
"city": "Seattle",
"state": "WA",
"zip": "98101"
}
},
"items": [
{ "sku": "LAPTOP-001", "name": "Gaming Laptop", "qty": 1, "price": 1299.99 },
{ "sku": "MOUSE-002", "name": "Wireless Mouse", "qty": 2, "price": 49.99 }
],
"total": 1399.97
}
Embedding works well when data is almost always accessed together, the embedded data stays small, the relationship is composition not association, and you need atomic updates to the whole unit.
The problem with unbounded arrays:
// Problematic: unbounded array grows indefinitely
{
"userId": "user123",
"activityLog": [
{ "action": "login", "timestamp": "2024-01-01" },
{ "action": "purchase", "timestamp": "2024-01-02" },
// ... grows forever
]
}
MongoDB has a 16MB document size limit. More practically, documents over a few MBs cause performance problems with indexing and network transfer.
Referencing Documents
Referencing stores related data in separate documents and links them by ID.
// Referenced design - separate collections
// Collection: customers
{
"_id": ObjectId("..."),
"name": "Carol Johnson",
"email": "carol@example.com"
}
// Collection: orders
{
"_id": ObjectId("..."),
"customerId": ObjectId("..."), // Reference to customers
"items": [...],
"total": 299.99
}
// Collection: products
{
"_id": ObjectId("..."),
"sku": "KEYBOARD-001",
"name": "Mechanical Keyboard",
"price": 149.99
}
Referencing works well when related data is accessed independently, arrays would grow without bound, you need to share data across multiple parents, and the relationship is association not composition.
The Hybrid Approach
Most real-world schemas use both patterns.
// Hybrid: embed small, stable data; reference large or shared data
{
"blogPost": {
"title": "Understanding Document Databases",
"slug": "understanding-document-databases",
"author": {
// Embed author summary for display
"id": ObjectId("..."),
"name": "Sam Wilson",
"avatar": "https://example.com/sam.jpg"
},
"tags": ["mongodb", "nosql", "data-modeling"],
"publishedAt": ISODate("2024-03-15"),
"content": "...",
"comments": [
// Small, bounded array - OK to embed
{ "author": "user1", "text": "Great explanation!", "date": "..." }
]
},
// Large, unbounded data stored separately
"commentReplies": [...],
"postAnalytics": {...}
}
Embedding vs Referencing: Trade-off Summary
| Aspect | Embedding | Referencing |
|---|---|---|
| Read performance | Single document fetch | Multiple round trips or $lookup aggregation |
| Write atomicity | Atomic within one document | Updates to parent/child independent |
| Data duplication | Denormalized — same data in multiple docs | Normalized — single source of truth |
| Array growth risk | Unbounded arrays hit 16MB limit | No document size concerns |
| Data sharing | Hard to share embedded data across parents | Same child doc can reference multiple parents |
| Update overhead | Updating shared data requires multiple updates | Updating shared data in one place |
| Query flexibility | Limited to data co-located in one doc | Can query children independently |
| Best for | Composition (order contains order-lines), bounded arrays | Association (products in many orders, authors of many posts) |
Schema Validation
Document databases started as schema-less. Modern implementations offer flexible validation.
MongoDB Schema Validation
// Create collection with validators
db.createCollection("products", {
validator: {
$jsonSchema: {
bsonType: "object",
required: ["name", "price", "sku"],
properties: {
name: {
bsonType: "string",
description: "Product name is required",
},
price: {
bsonType: "number",
minimum: 0,
description: "Price must be non-negative",
},
sku: {
bsonType: "string",
pattern: "^[A-Z]{3}-[0-9]{6}$",
description: "SKU must match pattern XXX-000000",
},
tags: {
bsonType: "array",
items: { bsonType: "string" },
},
},
},
},
});
CouchDB Design Documents
CouchDB uses design documents to define views and validation functions.
// Design document with validation
{
"_id": "_design/ecommerce",
"_rev": "1-abc123",
"validate_doc_update": function(newDoc, oldDoc, userCtx) {
if (!newDoc.name) {
throw({forbidden: "Products must have a name"});
}
if (typeof newDoc.price !== "number" || newDoc.price < 0) {
throw({forbidden: "Price must be a non-negative number"});
}
},
"views": {
"by_price": {
"map": "function(doc) { if (doc.price) emit(doc.price, doc.name); }"
}
}
}
When Document Stores Make Sense
Content management systems store articles with variable schemas. One post might have an “author” field, another might have “coAuthors”, another might have “translations”. Relational tables would require complex EAV patterns or frequent schema migrations.
Product catalogs with varied attributes work naturally as documents. A clothing product has size and color; a digital product has download links and license keys; a subscription has billing cycles.
User profiles and preferences vary widely. Some users fill out every field; others leave most blank. Document stores handle this variation without NULL handling nightmares.
Real-time analytics with time-series elements embed event data in documents written once and read frequently.
Applications with complex hierarchical data like organizational charts, file systems, or family trees map naturally to document structures.
When NOT to Use Document Databases
Normalized data requirements demand referential integrity across collections. If you need to enforce that every order references a valid customer, and deleting a customer cascades to all their orders, relational databases handle this better.
Complex multi-document transactions spanning multiple collections are problematic. MongoDB supports transactions across sessions and shards, but the operational complexity is significant. If you need ACID guarantees across multiple documents, consider whether a relational approach or a purpose-built transaction engine is simpler.
Well-structured tabular data with fixed schemas that rarely change performs just as well, often better, in relational databases. If your data model is stable and your access patterns are simple joins, the “flexibility” of document stores becomes a liability.
BI and analytical workloads with complex aggregations across many dimensions work better in columnar stores or specialized analytics databases.
JSON Document Tradeoffs
Document databases embrace JSON, but this has implications.
Advantages
JSON documents are self-describing. You can read any document and understand its structure without consulting an external schema.
Schema evolution becomes simpler. Adding new fields does not require ALTER TABLE statements.
The programming model feels natural for developers. Most web APIs use JSON, so mapping between storage and application is straightforward.
Challenges
Without enforced schemas, your application code must handle data inconsistency. One document might have price: "29.99" as a string; another might have price: 29.99 as a number.
Denormalization means updates can touch multiple documents. If you embed customer name in every order, updating the customer’s name requires updating every order document.
Query optimization requires understanding your access patterns. Unlike relational databases where the query planner chooses indexes, you must design indexes that match your queries.
Query Patterns and Indexing
MongoDB and CouchDB support secondary indexes, but they work differently.
MongoDB Indexes
// Create indexes to support query patterns
db.orders.createIndex({ customerId: 1, createdAt: -1 });
db.orders.createIndex({ status: 1, priority: 1 });
db.products.createIndex({ tags: 1 });
db.products.createIndex({ price: 1 });
// Compound indexes order matters
// Index on (a, b) supports queries on {a} and {a, b}
// Does NOT support queries on {b} alone
// Text indexes for search
db.articles.createIndex({ title: "text", content: "text" });
CouchDB Views
CouchDB uses MapReduce views defined in design documents.
// View definition
{
"views": {
"orders_by_customer": {
"map": function(doc) {
if (doc.type === "order") {
emit([doc.customerId, doc.createdAt], {
total: doc.total,
itemCount: doc.items.length
});
}
}
},
"revenue_by_month": {
"map": function(doc) {
if (doc.type === "order" && doc.status === "completed") {
var month = doc.createdAt.substring(0, 7); // YYYY-MM
emit(month, doc.total);
}
},
"reduce": "_sum"
}
}
}
Common Production Failures
Hotspot documents causing shard imbalance: You choose a shard key like userId for an event-logging collection. A single active user generates 10x more events than average. All that user’s documents land on one shard, which becomes the bottleneck while other shards sit idle. A better shard key spreads writes across nodes — something with high cardinality that doesn’t correlate with user activity.
Unbounded array hitting the 16MB document limit: You embed an activityLog array in user documents without limits. A power user accumulates 5 years of activity — the document grows past 16MB and MongoDB rejects further inserts. The options: archive old activities to separate documents, move to a separate activities collection with referencing, or cap the array and rotate oldest entries out.
Embedding at the wrong granularity causing update storms: You embed individual orderLine items inside an order document. When a product name changes, you update every order that contains that product — one product rename, thousands of document writes. If reference data changes frequently, referencing avoids this cascade.
Missing index on referenced fields: You use customerId references across orders but skip the index. Queries like “find all orders for customer X” do a full collection scan. As soon as you introduce referencing patterns, create createIndex({ customerId: 1 }).
Schema validation blocking legacy documents: You add JSON schema validation to an existing collection with documents that don’t conform. Writes start failing and your application starts returning 500s. Validate your existing dataset before enabling strict validation, or set the validation level to only apply to new inserts.
Capacity Estimation: Shard Sizing for MongoDB
MongoDB sharding splits data across nodes by shard key. Each shard handles a range of shard key values. Sizing shards correctly prevents hot spots and ensures even distribution.
The formula for shardable data: your total dataset in GB divided by number of shards should stay under the RAM-to-storage ratio threshold where working set fits in memory. If your working set (frequently accessed data) is 50 GB and you have 3 shards, each shard needs ~17 GB in RAM to serve reads from memory. MongoDB recommends working set fits in RAM per shard.
For a collection with 500 million documents averaging 1 KB each, your raw storage is 500 GB. With a shard key that spreads writes evenly (e.g., userId for user-centric data), each of 5 shards holds 100 GB. If reads are concentrated on recent data (last 30 days, ~100 GB), you need enough RAM per shard to hold the working set — not the entire dataset.
Chunk size matters: MongoDB splits chunks at 64MB by default. If a chunk grows beyond maxChunkSize, MongoDB splits it and migrates to balance. Unbounded arrays in documents can cause chunk sizes to balloon unexpectedly, triggering unexpected migrations.
$collStats tells you per-collection storage and shard distribution:
db.getSiblingDB("admin").runCommand({ collStats: "orders", scale: 1 });
Key fields: count (document count), size (total bytes), storageSize (physical bytes allocated), totalIndexSize. If storageSize / count is growing unexpectedly, documents are accumulating oversized fields or arrays that were never cleaned up.
Observability Hooks: Document Size Metrics and Shard Balance
For MongoDB, these metrics matter most:
Document size growth rate: track storageSize over time. A collection growing faster than expected usually means unbounded arrays or denormalized fields accumulating. Use db.collection.stats() and graph it in your monitoring tool.
Shard balance: db.adminCommand({ balancerStatus: 1 }) shows whether the balancer is running. db.adminCommand({ moveChunk: ... }) shows recent migrations. An unbalanced cluster means some shards serve more traffic than others — reads and writes skew toward the overloaded shard.
Chunk distribution: db.adminCommand({ splitVector: "mydb.mycollection" }) identifies chunk boundaries. db.getSiblingDB('config').chunks.find() lists all chunks and their shard assignments. If one shard owns more than 40% of chunks, your shard key has a hot spot.
For CouchDB, key metrics are in /_node/_local/_stats: document count, disk size, fragmented bytes (from compaction), and request latency by method. If fragmented bytes approaches disk size, compaction is overdue and read performance degrades.
Real-World Case Study: Airbnb’s MongoDB Migration
Airbnb migrated their search infrastructure from a monolithic PostgreSQL setup to MongoDB with sharding. Their problem: search needed to handle queries across millions of listings with sub-100ms latency, while writes (booking events, reviews, pricing updates) poured in continuously. PostgreSQL replicas could not keep up with write volume during peak hours.
Their migration was staged over 18 months — they ran dual-write to both systems, validated query results matched, then cut over search reads in batches. The lesson from their experience: document databases need honest upfront work on access patterns. They spent months modeling their document structure before writing a single production query. Embed vs reference decisions made early were expensive to change later.
The specific pattern they used: listing documents embedded amenity lists, calendar availability, and review summaries. High-volume write paths (availability updates) went to a separate collection with referencing, not embedding. This kept listing document size manageable and avoided update storms when availability windows changed.
What they got wrong initially: they embedded booking history in user documents. A frequent traveler accumulated years of booking history, and those documents grew past the working set threshold — reads became disk-bound instead of memory-bound. They fixed it by moving booking history to a separate collection.
Security Checklist
- Enable authentication on every MongoDB/CouchDB node; use SCRAM-SHA-256 or equivalent for MongoDB, and Cookie Auth + TLS for CouchDB
- Implement field-level encryption for highly sensitive document sections using a KMIP-compliant key management service
- Use TLS for all internal replica set and shard communication to prevent man-in-the-middle attacks on replica traffic
- Configure CouchDB’s
require_valid_userto enforce authentication on all endpoints; remove defaultadminaccount - Audit document access by enabling MongoDB’s
auditLogor using CouchDB’s log rotation for access logs - Restrict network exposure of document database ports using firewall rules or VPC peering; never expose MongoDB default port 27017 to the internet
Common Pitfalls / Anti-Patterns
Unbounded array growth: Storing unbounded arrays in documents (e.g., a comments array that grows indefinitely) causes document size to exceed the 16MB limit and slows reads. Fix: use a separate collection for items that grow unbounded and reference by document ID.
Hotspot documents: A document that receives 10x more writes than other documents becomes a write bottleneck in sharded clusters. Fix: add a shard key that distributes write load, such as a user or tenant identifier, rather than using a sequential _id.
Fetching too much data with $elemMatch: Using find() without projection returns entire arrays, including ones you do not need. Fix: always specify which fields to return with a projection document.
Denormalizing mutable data: Embedding fields that change frequently (like a user’s last login timestamp) inside a document requires updating every document that embeds that field. Fix: reference mutable data by ID rather than embedding it.
Ignoring schema validation: Unvalidated schemas allow documents with missing or malformed fields to be inserted, causing application errors. Fix: use MongoDB’s JSON Schema validation or CouchDB’s design document schema constraints.
Quick Recap Checklist
- Document databases suit hierarchical data, variable schemas, and read-complete-document access patterns
- Embed documents when data is always fetched together; reference when data is shared across documents
- Choose a shard key that distributes write load evenly and avoids hotspot documents
- Use projections to return only the fields your application needs
- Implement schema validation to catch malformed documents early
- Monitor document size — 16MB MongoDB / unlimited CouchDB — and watch for unbounded array growth
ACID vs BASE in Document Databases
MongoDB supports multi-document transactions (ACID) but with different tradeoffs than relational databases.
flowchart LR
subgraph "MongoDB Transaction Model"
T["Start Transaction"]
R["Read/Write Multiple Collections"]
C["Commit or Abort"]
end
T --> R --> C
What MongoDB transactions guarantee:
- Atomic: all operations in a transaction succeed or none do
- Consistent: transaction moves database from one valid state to another
- Isolated: concurrent transactions do not see each other’s partial changes
- Durable: committed transaction survives server failure
What MongoDB transactions do not guarantee:
- Does not replace proper schema design — transactions mask bad modeling
- Does not fix distributed system problems — cross-shard transactions have higher latency
- Does not eliminate the need to understand write patterns — transactions across multiple collections have higher overhead
When to use transactions: Breaking up a logical operation that spans collections (e.g., order + inventory). When data consistency within a business operation matters more than raw write throughput.
When not to use transactions: As a substitute for embedding related data. High-frequency updates where transaction overhead would hurt throughput. Distributed writes across shards — latency is high.
Query Optimizer Behavior
MongoDB’s query optimizer chooses execution plans based on statistics.
// Force a specific index (use sparingly)
db.orders.find({ status: "pending" }).hint({ status: 1, createdAt: -1 });
// Explain query plan
db.orders.find({ customerId: "C123" }).explain("executionStats");
// Expected output: indexScan, not collectionScan
// Look for: winningPlan.stage === "IXSCAN"
The query planner automatically chooses indexed plans when indexes exist. Use explain() to verify — do not assume an index is being used.
Trade-off Analysis: Document vs Relational
| Scenario | Document Database | Relational Database |
|---|---|---|
| Frequent schema changes | Schema-agnostic, flexible | ALTER TABLE overhead |
| Hierarchical data retrieval | Single document fetch | Multiple JOINs |
| Complex aggregations | Aggregation pipeline | SQL GROUP BY |
| Transactions across multiple entities | Multi-document transactions (higher latency) | ACID transactions (lower overhead) |
| Ad-hoc querying on arbitrary fields | Requires indexing planning | Query planner handles dynamically |
| Stable, well-understood schema | Overkill | Natural fit |
Interview Questions
Use a hybrid: fixed fields for common attributes (name, price, description, category) with a flexible attributes object for product-specific fields. Index on category and price for filtering. For product-specific filters, MongoDB can query across attributes if you create partial indexes on specific attribute paths. The tradeoff: flexible attributes are harder to aggregate across products (you cannot easily sum megapixels across all cameras without knowing which products are cameras). If cross-product aggregation matters, a relational or column-store is better. If each product type is queried within its own type, document flexibility wins.
Your shard key is causing hot spot writes — most documents route to one shard. Common causes: using a monotonically increasing _id as part of the shard key (new documents all go to the chunk with the highest range), or a shard key with low cardinality (e.g., country with 5 values on a global dataset). Diagnose with db.adminCommand({ chunkDiff: "mydb.mycollection" }) or inspect config.chunks. Fix requires changing the shard key, which means either migrating to a new collection with a better key or using refineCollectionShardKey if you can tolerate the migration window. As a temporary mitigation, add a hash suffix to the shard key or switch to hashed sharding for the collection.
CouchDB is masterless — every node can accept writes and replicas sync via continuous replication. There is no single point of failure and no need for a replica set failover process. MongoDB has a primary in each replica set; failover promotes a secondary and the application driver reroutes. CouchDB's replication model is simpler operationally for multi-region deployments but conflict resolution is your problem — if the same document is edited on two nodes during a partition, CouchDB stores both versions and you resolve conflicts in application code. MongoDB's write concern lets you tune durability vs latency, and conflict resolution is simpler (last write wins at the replica set level). CouchDB wins for edge nodes, mobile sync, and situations where you want peer-to-peer replication. MongoDB wins when you want stronger consistency guarantees and more expressive query language.
Embedding colocates related data in a single document, enabling fast single-document reads and atomic writes. Choose embedding when data is almost always accessed together, the embedded data stays bounded in size, and the relationship is composition (order contains order lines). Referencing stores data in separate documents and links via ID. Choose referencing when related data is accessed independently, arrays would grow without bound, you need to share data across multiple parents, or the relationship is association rather than composition. The hybrid approach (embed small, stable data; reference large or shared data) handles most real-world schemas. Avoid unbounded array growth in embedded documents — it hits the 16MB limit and slows reads.
Change streams expose a real-time feed of document changes in a collection, replica set, or sharded cluster. The driver subscribes to the stream and receives notifications for insert, update, replace, and delete operations. Use cases: triggering downstream processes when data changes (e.g., search index updates, cache invalidation, event-driven microservices). Change streams are resumable — you can store the resume token and reconnect after a disconnect. Limitations: only works on replica sets or sharded clusters (not standalone mongod), and the stream is eventually consistent (changes may lag under heavy write load). For cross-collection transactions, consider using transaction-level change streams.
MMAPv1 (deprecated in MongoDB 4.0) mapped files into memory — reads went to the OS page cache, writes used memory-mapped files. Collection-level locking meant only one write could modify a collection at a time. WiredTiger is the default since MongoDB 3.2 — it uses its own cache and compression (Snappy or Zstd), supports document-level concurrency (multiple writes can modify the same collection simultaneously), and provides better write throughput under load. Choose WiredTiger for production — it has lower memory overhead, better compression, and superior write concurrency. MMAPv1 remains only for legacy systems where you cannot upgrade immediately.
Two main approaches: database-per-tenant or collection-per-tenant within a shared database. Database-per-tenant provides the strongest isolation — each tenant gets their own MongoDB database with independent auth, indexes, and backup scope. This scales to hundreds of tenants but operational complexity grows linearly. Collection-per-tenant adds a tenantId field to every document and uses MongoDB's namespace isolation features. This scales to thousands of tenants but requires careful index design and query filters to prevent cross-tenant data leakage. For strict isolation with small tenant counts (dozens), use separate databases. For larger tenant counts with less strict isolation requirements, use collection-per-tenant with mandatory tenantId filters in application code.
The 16MB limit is a hard ceiling on any single document including all fields, arrays, and nested objects. This typically manifests when storing large activity logs, chat histories, or versioned content in a single document. Design patterns to handle it: break large documents into separate documents in the same collection (e.g., one document per chat message instead of one document per conversation), use a separate collection for historical data with referencing, or store large binary data in GridFS (which chunks data into 255KB documents). For unbounded arrays, store items as separate documents with a shared parentId reference rather than embedding the entire array.
MongoDB supports multi-document transactions starting in version 4.0, expanding to cross-shard transactions in version 4.2. Transactions use a snapshot isolation level — reads see a consistent view at the transaction start time. Performance trade-offs: transactions add overhead — they require coordination across participants, and snapshot isolation holds locks longer than in single-document atomic operations. For sharded clusters, transactions involve a transaction coordinator that manages commit across shards. Best practice: keep transactions short (sub-second), avoid transactions that span many documents or many shards, and use transactions only when business logic genuinely requires atomicity across multiple collections. For most cases, designing documents to keep related data together eliminates the need for transactions.
$lookup performs a left outer join to another collection within the same database. The joined collection's data is embedded in the result document as an array. Use $lookup when you need to combine data from two collections in a single query and the result set is bounded. Application-level joins (fetch Collection A, then fetch relevant documents from Collection B based on IDs) are better when you need to join across databases (MongoDB does not support cross-database $lookup), when the joined collection is very large (pipeline the join to avoid scanning large collections), or when you need to join on non-indexed fields. $lookup with let and $match pipeline can push filtering into the joined collection for better performance.
For zero-downtime migrations, use the expand-contract pattern: first add the new field (or structure) alongside the old one, deploy application code that writes to both old and new locations, backfill the new field for existing documents, then remove the old field once all data is migrated. For adding an index, use background: true to build the index without locking writes. For removing a field, first stop writing to it (step 1 of expand-contract), then remove in a subsequent release. For large collections, use batched updates with bulk operations and hint to process in chunks, tracking progress in a separate document. Never use db.collection.updateMany({}, {$unset: {oldField: 1}}) on a live large collection — it locks and causes latency spikes.
A covered query is satisfied entirely by an index without touching the actual documents. MongoDB can answer the query using only the index data (filter and projection both use index fields). Use explain() to check for IXSCAN without COLLSCAN — if you see COLLSCAN, the query touches documents. To cover a query: create an index that includes all fields used in the filter and projection, ensure the sort fields are in the same index if sorting, and use projection to exclude unnecessary fields (especially _id unless you need it). Covered queries are fastest but inappropriate for queries that need fields not in the index — in those cases, the index still reduces the scan scope significantly.
By default, MongoDB reads from the primary in a replica set (read preference primary). With secondary read preference, reads may return stale data if the secondary lags the primary. Under failover (primary becomes unavailable, a secondary is elected), there is a window where writes that were acknowledged by the old primary may not have replicated to the new primary. Applications can handle this by: using ReadPreference setting appropriate for each query type (primary for critical data, secondary preferred for reads that tolerate staleness), checking rs.status() for replication lag, and implementing retry logic for NotPrimaryOrSecondary errors. For truly consistent reads after failover, use sessions with readConcern: "linearizable" — this forces reads to wait until the new primary has applied all acknowledged writes.
$match filters documents entering the pipeline stage — it reduces the number of documents that subsequent stages process. $filter is an array operator that filters elements within an array field while preserving the surrounding document structure. Use $match early in the pipeline to reduce document volume (MongoDB can also use indexes for $match at the start of a pipeline). Use $filter when you need to filter items inside an array field without changing the document's shape — e.g., keeping the document but only some array elements. $filter does not reduce document count; it only modifies array contents within each document.
Use a base post document with fixed fields (title, author, publishedAt, status) and a flexible content schema that uses typed discriminator fields: { type: "video", url: "...", thumbnail: "..." }, { type: "gallery", images: [...] }. This allows polymorphic content blocks while keeping the document structure consistent. Alternative: store content blocks in a separate content_blocks collection with a postId reference, enabling infinite extensibility. The embedded approach (content blocks as array elements in the post document) is simpler for queries like "get post with all content" but hits the 16MB limit if content grows very large. Use a hybrid: embed the first few content blocks for display speed, store additional blocks in a separate collection.
queryPlanner shows the winning plan chosen by the query optimizer without executing it — useful for understanding the plan structure and index usage without the cost of execution. executionStats executes the query and returns actual statistics (execution time, number of documents examined vs returned, index usage) — this is the most useful mode for diagnosing slow queries because it shows whether the plan was efficient. allPlansExecution returns execution stats for all plans considered by the optimizer (not just the winning plan) — useful when the optimizer is choosing a suboptimal plan and you want to see alternatives. Always use executionStats for query tuning — look for nReturned close to totalDocsExamined to confirm the plan is efficient.
MongoDB's aggregation pipeline has a 100MB memory limit per stage by default. When a stage exceeds this limit (e.g., $group accumulating too many distinct keys, $sort on large data sets), MongoDB spills intermediate results to disk. Signs of spilling: explain() shows spills > 0, or serverStatus shows metrics.aggregations.spilled increasing. To avoid spilling: add $match early to filter data before expensive stages, ensure early pipeline stages reduce document count, use allowDiskUse: true in the aggregation options to enable disk spilling (but this slows queries), and consider limiting the number of distinct groups in $group operations. For large aggregations, break them into stages that can use indexes to reduce data volume.
Change streams provide near-real-time notifications (typically sub-second latency) with lower resource overhead — you receive only the changes that occurred. Polling queries the database at intervals regardless of changes, using more resources and introducing delay proportional to the poll interval. Change streams are more efficient for high-frequency updates and provide guaranteed ordering. The trade-offs: change streams require replica set or sharded cluster infrastructure and cannot cross database boundaries. Polling works on any MongoDB deployment and can query across collections or databases. For moderate change volumes (thousands per second), change streams are more efficient. For infrequent changes or simple sync patterns, polling may be simpler to implement. Change streams also support resumability (store the resume token) while polling has no built-in continuity mechanism.
When multi-document transactions are not viable (sharded clusters with long operations, or performance constraints), use the saga pattern with compensating transactions: each operation in a business process has a corresponding undo operation. If step 3 fails, steps 1 and 2 are rolled back by executing their compensating actions. Implementation: store the saga state in a separate collection (saga ID, current step, status), execute each step with an idempotent operation, record the completed step, and on failure run the compensating transactions for all completed steps. This approach works across sharded clusters and has lower latency than multi-document transactions. Trade-off: saga pattern does not provide snapshot isolation — concurrent operations may interfere — so design sagas to be isolated or accept the trade-off.
Use a dual-model approach: real-time data lives in MongoDB with indexes optimized for current queries. Historical aggregations are pre-computed and stored in separate collections or a separate system (TimescaleDB, ClickHouse). For real-time dashboards: use MongoDB with lean queries (projection to return only needed fields), covered indexes where possible, and result caching (Redis) for frequently accessed aggregations. For historical analytics: materialize aggregations on a schedule (hourly, daily) using background jobs that write to an analytics collection. The analytics collection is append-only and can use lower-cost storage. Queries for historical data hit the pre-aggregated collection, not the raw operational data. This separation keeps operational writes fast and analytics queries efficient without competing for the same indexes.
Further Reading
Official Documentation
- MongoDB Manual — CRUD operations, aggregation pipeline, and data modeling guides
- CouchDB Documentation — Design documents, replication, and conflict resolution
- MongoDB University — Free courses on MongoDB data modeling and administration
Books and References
- “MongoDB Applied Design Patterns” by Rick Copeland — Document modeling patterns for common use cases
- “NoSQL Distilled” by Pramod Sadalage and Martin Fowler — Overview of NoSQL patterns and tradeoffs
Conclusion
Document databases work well when your data is hierarchical, your schema varies between entities, or your access patterns favor reading complete units of related data. The embedding vs referencing decision is the most critical modeling choice, and it requires understanding your query patterns upfront.
For content management, user profiles, product catalogs, and applications with complex but self-contained data units, document databases often outperform relational alternatives. For data with strict referential integrity requirements, complex multi-document transactions, or stable tabular structures, relational databases remain the better choice.
Model honestly: understand your data relationships, access patterns, and consistency requirements before choosing a database model.
For more on NoSQL varieties, see the NoSQL Databases overview. To learn about schema design principles, see Schema Design. For comparison with key-value stores, see Key-Value Stores.
Category
Related Posts
Graph Databases: Neo4j and Graph Traversal Patterns
Learn Neo4j graph database modeling with Cypher. Covers nodes, edges, social networks, recommendation engines, fraud detection, and when graphs are not the right fit.
Column-Family Databases: Cassandra and HBase Architecture
Cassandra and HBase data storage explained. Learn partition key design, column families, time-series modeling, and consistency tradeoffs.
Denormalization
When to intentionally duplicate data for read performance. Tradeoffs with normalization, update anomalies, and application-level denormalization strategies.