Incremental Loads: Processing Only What Changed

Incremental loads reduce pipeline cost and latency. Learn watermark strategies, upsert patterns, and how to handle late-arriving data.

published: March 27, 2026 reading time: 20 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Full table refreshes do not scale. As your tables grow, four-hour loads creep to eight, then twelve, and suddenly your daily pipeline is running around the clock with no room to breathe. Incremental loads fix this by tracking only what changed since the last run. A watermark—a timestamp, ID, or log offset—marks your checkpoint. Approaches vary: column-based timestamps work for simple sources, ID sequencing catches new rows in append-only tables, hybrids combine both, and CDC from transaction logs catches everything including updates that never touched a timestamp column. You also need upserts so re-runs do not create duplicates, and a strategy for late arrivals—either reprocess the affected window or use accumulating snapshots and accept the staleness. Done properly, incremental pipelines handle millions of rows in minutes, not hours.

Incremental Loads: Processing Only What Changed

A full load of your orders table takes four hours. Your pipeline runs daily. Tomorrow the full load takes five hours. In six months it takes eight. The table grows, the load gets longer, and eventually your daily pipeline is running near-continuously with no time left for anything else.

Incremental loads solve this. Instead of reloading everything, load only what changed since the last run. A table with 500 million rows becomes a daily delta of 50,000 rows. The pipeline runs in minutes, not hours.

The Watermark Concept

An incremental load needs a way to identify what changed since the last run. A watermark is a marker that tracks the position of the last successful load.

-- Simple incremental load using a watermark column
-- Watermark stored in a control table
DECLARE @last_watermark DATETIME;
SELECT @last_watermark = last_extracted_at FROM pipeline_control WHERE pipeline_name = 'orders';

-- Extract only records changed since last watermark
SELECT * FROM orders
WHERE updated_at > @last_watermark
  AND updated_at <= GETDATE();

-- Update watermark after successful load
UPDATE pipeline_control
SET last_extracted_at = GETDATE()
WHERE pipeline_name = 'orders';

The watermark column must be reliable. It must update whenever the row changes, and it must be indexed for efficient range queries. updated_at or modified_at columns serve this purpose when the source system maintains them correctly.

Watermark Strategies

Column-based watermarks

The simplest approach. Choose a timestamp or integer column that reflects when a row changed. This requires the source system to maintain the column reliably.

Best for: Tables with a clear updated_at column maintained by triggers or application code.

Risk: Some systems do not update the timestamp on all changes (for example, updates that only touch specific columns might skip the trigger).

The reliability of the timestamp column is the critical assumption. In practice, not all updated_at columns are equally reliable. Application-level updates that bypass triggers, bulk UPDATE statements that update millions of rows at once, and replication pipelines that modify data without going through the application layer can all cause the timestamp to not reflect the actual business event time. Before committing to a column-based watermark, audit the source system to confirm that every code path that modifies data also updates the timestamp column. If there are update paths that do not, either fix them at the source or switch to a more reliable watermark strategy. The risk is not theoretical: a pipeline that misses updates because the timestamp column is not consistently maintained will silently accumulate discrepancies that only surface when downstream reports diverge from source-of-truth systems.

ID-based watermarks

Track the maximum ID from the last run. Any row with an ID greater than the watermark is new or changed.

DECLARE @last_max_id BIGINT;
SELECT @last_max_id = last_max_id FROM pipeline_control WHERE pipeline_name = 'orders';

SELECT * FROM orders WHERE order_id > @last_max_id ORDER BY order_id;

Best for: Tables where IDs are sequential and monotonically increasing.

Risk: Updated rows with IDs below the watermark are missed. ID-based watermarks only catch new rows, not modified rows.

The ID-based approach works well for append-only event tables where the primary key is an incrementing sequence and rows are never modified after creation. An orders table where new orders get monotonically increasing order_id values and existing orders are never updated fits this pattern cleanly. But most OLTP tables do not fit this pattern: customer records get updated, order statuses change, and product prices get corrected. For tables where rows can be modified after creation, an ID-based watermark alone will miss those modifications. The only reliable use case for ID-based watermarks is tables where the business invariant is that existing rows are immutable. If there is any doubt, add a timestamp-based component alongside the ID.

Hybrid watermarks

Combine column-based and ID-based approaches. Use the timestamp for ordering and the ID for completeness checking.

-- Hybrid: timestamp for ordering, ID for deduplication
SELECT * FROM orders
WHERE updated_at > @last_watermark
   OR order_id > @last_max_id
ORDER BY updated_at, order_id;

The hybrid approach trades off simplicity for completeness. It is the right choice when the source system has both a reliable timestamp column and an incrementing ID, and you need to catch both new rows and modified rows. The timestamp handles ordering and handles most cases where rows are updated. The ID component catches rows that the timestamp misses due to drift or gaps. Use it when you have observed timestamp drift in your source (bulk updates touching updated_at out of sequence) or when the source system has known gaps in its change tracking. The cost is added complexity in the watermark storage and extraction query. For sources where the timestamp is genuinely reliable and updated on every change, the hybrid approach is overkill.

Log-based watermarks (CDC)

Read the database transaction log. The log contains every change with a unique LSN (Log Sequence Number). Track the last processed LSN as your watermark. This catches all changes, including updates that do not touch the timestamp column.

For more on this approach, see Change Data Capture.

Upsert: Inserting and Updating

Once you have extracted the delta, you need to write it to the destination. An upsert (or merge) inserts new rows and updates existing rows.

-- PostgreSQL upsert (INSERT ON CONFLICT)
INSERT INTO dim_customers (customer_id, email, name, updated_at)
SELECT customer_id, email, name, updated_at
FROM stg_customers
ON CONFLICT (customer_id) DO UPDATE SET
    email = EXCLUDED.email,
    name = EXCLUDED.name,
    updated_at = EXCLUDED.updated_at;

-- Snowflake merge
MERGE INTO dim_customers target
USING stg_customers source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET
    email = source.email,
    name = source.name,
    updated_at = source.updated_at
WHEN NOT MATCHED THEN INSERT
    (customer_id, email, name, updated_at)
    VALUES (source.customer_id, source.email, source.name, source.updated_at);

-- BigQuery merge
MERGE INTO dim_customers target
USING staging_customers source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET
    email = source.email,
    name = source.name,
    updated_at = source.updated_at
WHEN NOT MATCHED THEN INSERT
    (customer_id, email, name, updated_at)
    VALUES (source.customer_id, source.email, source.name, source.updated_at);

Handling Late-Arriving Data

A record arrives late. The order was placed on March 25th but the shipping update came in on March 27th. The pipeline processed March 25th data on March 25th. The late March 25th shipping update arrives on March 27th. What happens?

Late-arriving data requires one of these strategies:

Accept the latency: The late update arrives after the daily aggregation for March 25th is complete. The March 25th daily total is wrong until someone notices and corrects it.

Reprocess the period: When late data arrives, backfill the affected period. The daily aggregation for March 25th is recomputed with the late update included. This requires the pipeline to support reprocessing historical windows.

Accumulating snapshot: Use a fact table that accepts updates to historical periods. The snapshot for March 25th gets updated when late data arrives. Final numbers are only correct after the late-arrival window closes.

For more on handling late data in streaming contexts, see Apache Kafka and Exactly-Once Semantics.

Idempotency

Incremental pipelines run multiple times against the same source data. The first run succeeds. The second run encounters the same records (the watermark has not advanced yet). The pipeline must handle this gracefully without creating duplicates.

Idempotent writes ensure that running the pipeline twice produces the same result as running it once. The upsert pattern handles this naturally: if the pipeline runs twice with the same records, the second run updates rows that already exist with the same values.

If your pipeline uses inserts rather than upserts, deduplication on a unique key is essential.

Change Data Capture as Incremental Load

CDC is an incremental load strategy where the change stream itself is the source. Rather than querying a table for changes, you consume a stream of changes. The watermark is the offset in the change stream.

flowchart LR
    DB[(Database)] -->|CDC| Kafka[Kafka]
    Kafka -->|offset=5000| Pipeline[Pipeline]
    Pipeline -->|processed| Warehouse[(Data Warehouse)]
    Pipeline -->|commit offset 5000| Kafka

CDC captures inserts, updates, and deletes. Your pipeline processes the change events and applies them to the destination. This is fundamentally different from polling-based incremental loads that only see the current state.

CDC eliminates the need for watermark columns in the source system. The change stream is the source of truth for what changed. See Change Data Capture for more.

Common Pitfalls

Non-atomic watermark updates

The pipeline extracts data, then updates the watermark, then loads. If the load fails, the watermark has already advanced. The next run skips the records that failed to load. Use transactions or two-phase commit to ensure the watermark advances only after successful load.

The failure scenario plays out like this. The pipeline reads the watermark (say, 2026-03-27 08:00:00), extracts 50,000 rows changed since that point, loads them into the staging table, and then advances the watermark to 2026-03-27 08:15:00. The load into the production table fails due to a constraint violation on one row. The watermark has already moved. On the next run, those 50,000 rows are not re-extracted because they fall before the new watermark. That one bad row silently causes 49,999 rows to be skipped.

Detecting this is hard because the pipeline reports success for the extraction phase. The gap only becomes visible when downstream reports missing data. Prevention is straightforward in most databases: wrap the watermark update and the load in the same transaction. In SQL Server this means explicit transactions with BEGIN TRAN/COMMIT. In Snowflake and BigQuery the MERGE or COPY operation and the watermark UPDATE can be wrapped together. PostgreSQL’s INSERT ON CONFLICT ... RETURNING combined with a transaction gives you atomicity across both steps.

Two-phase commit matters when the load touches multiple systems. If the watermark lives in one database and the destination lives in another, a single local transaction is not enough. In those cases consider a three-phase approach: extract, load to destination, then update watermark only after confirming the destination write succeeded. Some teams use a “watermark lock” row that gets updated last, with the load logic checking that the lock is not held by a failed prior run.

For pipelines where atomicity is difficult to guarantee, add a reconciliation check after every run. Compare the row count extracted against the row count loaded. If they diverge, alert and halt the pipeline until someone reviews the discrepancy. Rows get skipped and nobody knows until a downstream report shows gaps.

The reconciliation check is also useful as a secondary defense even when you believe the transaction is atomic. It catches edge cases that transactions do not cover: a merge that partially succeeds due to a mid-statement error, a network interruption that truncates the load mid-batch, or a downstream process that modifies data outside the pipeline’s awareness. The check does not need to be expensive — a simple SELECT COUNT(*) on the staging table against the loaded table in the destination, verified against the expected row count from the extraction phase, is enough to catch most silent failures.

Watermark column drift

The updated_at column does not always reflect the logical time of the change. A bulk UPDATE statement that touches a million rows might update their updated_at to the current time even though the business event happened earlier. This causes late-arriving data issues.

Bulk operations are where this shows up most. A nightly batch job runs UPDATE orders SET status = 'fulfilled' WHERE status = 'shipped'. Every row touched gets its updated_at set to the batch execution time, not the actual shipping time when the package left the warehouse. From the pipeline’s perspective, these orders appear to change at 2 AM even though the business event happened hours earlier. If the incremental pipeline ran between the actual shipping time and the batch, the orders were already picked up. The batch then re-triggers extraction on the next run because the watermark comparison sees a newer updated_at.

Denormalized updates cause the same issue. An ETL process moves a last_modified column on a parent record when a child record changes. The parent updated_at jumps to the current time even though the underlying child data has been stable for days. Downstream consumers see unexpected new deltas with no corresponding business event.

One fix is to add a business_timestamp column that application code sets explicitly when meaningful changes occur, separate from database-level triggers that update updated_at on every write. Another is to instrument the source system to emit events with the actual event time, then use that as the watermark instead of the column value.

Hybrid watermarks help here. An ID-based component alongside the timestamp catches rows that the timestamp-based watermark misses due to drift. For PostgreSQL sources, enabling row freeze visibility tracking and querying pg_xact_commit_timestamp gives you the actual transaction commit time rather than the updated_at column value. For MySQL sources, the binary log position is a more reliable watermark than any application column because it reflects when the transaction committed, not when a trigger fired.

The business_timestamp approach requires buy-in from the application team, which is often the hardest part. Application developers must understand that setting this column is a contract with the data pipeline: the timestamp value should reflect when the business event actually occurred, not when the application wrote the record. For timestamp columns that are automatically maintained by triggers, audit the trigger definition to confirm it fires on all relevant update paths, including stored procedure updates, bulk loader imports, and replication cascades. Most trigger-based timestamp failures are a result of incomplete trigger definitions rather than deliberate design choices.

Missing deletes

If your source system marks records as deleted (soft delete) but the incremental query does not filter them out, deleted records persist in the warehouse. CDC change streams include delete operations. Polling-based incremental loads require explicit handling of soft-delete flags.

Soft deletes are the most common source of this problem. The source table has an is_deleted flag or deleted_at timestamp column. The incremental query selects rows where updated_at > @watermark, but a deleted row’s updated_at only changes when the soft-delete flag is set. If the row was marked deleted before the current watermark, it was already picked up by a prior run. If it was marked deleted after the current watermark, it gets picked up correctly. The subtle failure mode is when a row was created, then deleted, all before the pipeline ever ran. The row exists in the source with a past updated_at. The pipeline extracts it. It lands in the warehouse. The source row is then deleted, but since the deletion happened at an updated_at value the pipeline has already processed, the deletion event never reaches the warehouse.

Hard deletes present a different problem. A record is permanently removed from the source table. The incremental query has no mechanism to know this happened unless the query explicitly filters for deletion events. Some databases support CDC or logical replication that captures hard deletes as explicit delete operations in the change stream. Without that infrastructure, a polling-based pipeline has no way to propagate the deletion to the warehouse.

Handling this correctly requires adding the soft-delete filter to every extraction query. The WHERE clause becomes WHERE (updated_at > @last_watermark OR is_deleted = 1) to catch both changes and deletions. For the warehouse side, the upsert logic must treat rows with is_deleted = 1 as deletes: either remove them from the destination table or flag them as inactive depending on downstream needs. Some teams maintain a separate deleted_records table and cross-reference it during pipeline runs to catch deletions that occurred before the current watermark window.

If your source system supports CDC, use it for deletes. The change stream emits delete operations explicitly and the pipeline applies them to the destination. It’s the only approach that handles both hard and soft deletes without depending on the updated_at column. If CDC is not available and the source only has soft deletes, audit the source system periodically with a full comparison against the warehouse to find rows that exist downstream but have been deleted upstream. Schedule this as a weekly or monthly data quality check rather than patching it in the pipeline.

The pre-pipeline deletion scenario is the most insidious version of this problem. A row created before the pipeline’s first run, deleted before the first run, lands in the warehouse and stays there because the deletion happened at an updated_at value the pipeline already processed. The warehouse now has a record that no longer exists in the source. The only way to catch this is periodic reconciliation: periodically compare the warehouse table against the source table and flag rows that exist downstream but not upstream. This is distinct from normal pipeline monitoring because it catches data that is wrong in a specific way, not just stale. Schedule it monthly or weekly depending on how frequently records are deleted at the source.

Tombstone records in Kafka

When consuming from Kafka with CDC, deleted records appear as tombstone records (value is null). Consumer logic must explicitly handle deletes and not skip these records or treat them as errors.

Kafka’s log compaction process retains the last known value for each key. When a record is deleted in the source system, the CDC connector writes a tombstone record with a null value for the same key. Log compaction then removes the record on the next cleanup pass, leaving nothing. This is how Kafka prevents topics from growing forever when the source emits deletes.

For the pipeline consumer, tombstone records require deliberate handling. Most consumer frameworks skip null values by default, treating them as invalid messages. The correct pattern is to check for null value and execute a delete operation against the destination rather than attempting to parse or upsert. This means your consumer logic needs to handle three cases: insert (key does not exist in destination), update (key exists, value is non-null), and delete (value is null). Skipping tombstones silently leaves stale data in the warehouse. Treating them as errors causes the consumer to stall on the first deletion it encounters.

Consumer group behavior affects how deletes propagate. If the consumer group has multiple instances processing the same topic, all instances must see the same tombstone record and execute the delete. With at-least-once delivery semantics, a tombstone may be delivered more than once. The delete operation must be idempotent: deleting a row that no longer exists in the destination should not raise an error.

Kafka’s log compaction only guarantees tombstones are retained for the delete.retention.ms period, which defaults to 24 hours in many connector configurations. If your consumer lags beyond this window, the tombstone may be compacted away before it is processed, and the delete never reaches the destination. Monitor consumer lag relative to the tombstone retention window. If the pipeline falls behind by more than the retention period, you will accumulate stale records in the destination that can only be cleaned up by a full reconciliation against the source.

When to Use Incremental Loads

Use incremental loads when:

Source tables exceed tens of millions of rows where full reload is impractical
Pipeline latency requirements are measured in minutes rather than hours
Source systems can reliably identify changed records (CDC, watermark columns, or log-based)
Downstream consumers can handle slightly stale data (most analytical workloads)

When NOT to Use Incremental Loads

Do not use incremental loads when:

Full table size is manageable (under a few million rows) and simplicity matters more than efficiency
Source cannot identify changes (no watermark column, no CDC, no change log)
Regulatory requirements mandate complete historical accuracy on every load
Late-arriving data would create consistency issues that are harder to fix than full reloads

Trade-off Table: Watermark Strategies

Aspect	Column-based	ID-based	Hybrid	Log-based (CDC)
Setup complexity	Low	Low	Medium	High
Catches all changes	Sometimes	No	Yes	Yes
Handles updated rows	Sometimes	No	Yes	Yes
Handles deletes	No	No	No	Yes
Source system load	Low (indexed query)	Low (indexed query)	Low	Minimal (WAL only)
Late-arriving data	Supported if re-watermarked	Not supported	Supported	Supported
Requires source changes	Timestamp column	Sequential ID column	Both	CDC infrastructure
Best for	Simple OLTP sources	Append-only tables	Most scenarios	Kafka/DB with WAL

Column-based watermarks are the simplest and work when the source reliably maintains timestamp columns. ID-based watermarks only catch new rows, not updates — use them only for append-only tables. Hybrid combines both for completeness. Log-based CDC is the most robust but requires the most infrastructure.

Capacity Estimation for Watermark Storage

Watermark storage is modest but grows with pipeline count.

Control table size: Each pipeline stores 1-2 watermark values (timestamp, max ID, LSN offset). A pipeline control table with 1,000 pipelines stores roughly 1,000 rows. At 100 bytes per row, that is 100 KB total.

Watermark precision vs storage: A timestamp watermark at second precision requires 8 bytes. A microsecond-precision timestamp requires 10 bytes. ID watermarks depend on the ID type (8 bytes for BIGINT, variable for UUID). LSN offsets in PostgreSQL WAL are 8 bytes.

Practical sizing: A control table with 500 pipelines, each tracking 2 watermark values, uses under 1 MB. The storage cost is negligible. The complexity is in ensuring watermark updates are transactional with the load itself.

The real capacity concern is not storage but the query that reads the watermark. Ensure the watermark column is indexed. A full table scan on a 500M row table to read the watermark defeats the purpose.

Observability Checklist for Incremental Loads

Track these metrics on every incremental pipeline:

Watermark freshness: When was the watermark last advanced? A watermark that has not moved in 48 hours means the pipeline is not processing new data.

Lag per run: How many records were extracted per run? A sudden drop to zero may mean the source has no new changes. A sudden spike may mean a backlog is clearing.

Upsert success rate: What percentage of upserted records matched vs inserted? Drifting match rates indicate changing source data patterns.

Late-arriving data rate: What percentage of extracted records have a watermark value older than expected? High late-arrival rates indicate source system issues or clock skew.

Consumer lag per topic/partition: For CDC-based incremental loads, track Kafka consumer lag. Rising lag means the pipeline cannot keep up with source changes.

# Example: Watermark freshness check
SELECT
    pipeline_name,
    MAX(last_watermark_at) AS last_watermark,
    NOW() - MAX(last_watermark_at) AS watermark_age,
    SUM(CASE WHEN status = 'success' THEN rows_processed ELSE 0 END) AS rows_last_run
FROM pipeline_control
GROUP BY pipeline_name
HAVING NOW() - MAX(last_watermark_at) > INTERVAL '24 hours';

Alert on: watermark older than 24 hours without explanation, upsert match rate changing by more than 10% week-over-week, consumer lag exceeding 5 minutes.

Quick Recap

Incremental loads avoid full table scans by tracking what changed since the last run.
Watermarks can be timestamps, IDs, LSN offsets, or offsets. Each has different trade-offs.
Upsert (MERGE/INSERT ON CONFLICT) handles both inserts and updates idempotently.
Late-arriving data requires a strategy: accept staleness, reprocess the window, or use accumulating snapshots.
CDC is the most robust incremental strategy when you have the infrastructure for it.

Conclusion

Incremental loads are essential for production data pipelines at scale. Full loads become untenable as tables grow. The alternative is tracking what changed and processing only the delta.

Watermarks are the mechanism for tracking progress. They can be timestamps, IDs, or log positions. Upserts ensure idempotent writes. Late-arriving data requires a strategy. CDC is the most robust incremental strategy but requires change data capture infrastructure.

For more on pipeline patterns, see Extract-Transform-Load, Backfills, and Pipeline Orchestration.

Incremental Loads: Processing Only What Changed

The Watermark Concept

Watermark Strategies

Column-based watermarks

ID-based watermarks

Hybrid watermarks

Log-based watermarks (CDC)

Upsert: Inserting and Updating

Handling Late-Arriving Data

Idempotency

Change Data Capture as Incremental Load

Common Pitfalls

Non-atomic watermark updates

Watermark column drift

Missing deletes

Tombstone records in Kafka

When to Use Incremental Loads

When NOT to Use Incremental Loads

Trade-off Table: Watermark Strategies

Capacity Estimation for Watermark Storage

Observability Checklist for Incremental Loads

Quick Recap

Conclusion

Category

Tags

Related Posts

Extract-Transform-Load: The Foundation of Data Pipelines

Backfills: Rebuilding Historical Data at Scale

Data Lake Architecture: Raw Data Storage at Scale