Incremental Loads: Processing Only What Changed

Incremental loads reduce pipeline cost and latency. Learn watermark strategies, upsert patterns, and how to handle late-arriving data.

published: reading time: 11 min read

Incremental Loads: Processing Only What Changed

A full load of your orders table takes four hours. Your pipeline runs daily. Tomorrow the full load takes five hours. In six months it takes eight. The table grows, the load gets longer, and eventually your daily pipeline is running near-continuously with no time left for anything else.

Incremental loads solve this. Instead of reloading everything, load only what changed since the last run. A table with 500 million rows becomes a daily delta of 50,000 rows. The pipeline runs in minutes, not hours.

The Watermark Concept

An incremental load needs a way to identify what changed since the last run. A watermark is a marker that tracks the position of the last successful load.

-- Simple incremental load using a watermark column
-- Watermark stored in a control table
DECLARE @last_watermark DATETIME;
SELECT @last_watermark = last_extracted_at FROM pipeline_control WHERE pipeline_name = 'orders';

-- Extract only records changed since last watermark
SELECT * FROM orders
WHERE updated_at > @last_watermark
  AND updated_at <= GETDATE();

-- Update watermark after successful load
UPDATE pipeline_control
SET last_extracted_at = GETDATE()
WHERE pipeline_name = 'orders';

The watermark column must be reliable. It must update whenever the row changes, and it must be indexed for efficient range queries. updated_at or modified_at columns serve this purpose when the source system maintains them correctly.

Watermark Strategies

Column-based watermarks

The simplest approach. Choose a timestamp or integer column that reflects when a row changed. This requires the source system to maintain the column reliably.

Best for: Tables with a clear updated_at column maintained by triggers or application code.

Risk: Some systems do not update the timestamp on all changes (for example, updates that only touch specific columns might skip the trigger).

ID-based watermarks

Track the maximum ID from the last run. Any row with an ID greater than the watermark is new or changed.

DECLARE @last_max_id BIGINT;
SELECT @last_max_id = last_max_id FROM pipeline_control WHERE pipeline_name = 'orders';

SELECT * FROM orders WHERE order_id > @last_max_id ORDER BY order_id;

Best for: Tables where IDs are sequential and monotonically increasing.

Risk: Updated rows with IDs below the watermark are missed. ID-based watermarks only catch new rows, not modified rows.

Hybrid watermarks

Combine column-based and ID-based approaches. Use the timestamp for ordering and the ID for completeness checking.

-- Hybrid: timestamp for ordering, ID for deduplication
SELECT * FROM orders
WHERE updated_at > @last_watermark
   OR order_id > @last_max_id
ORDER BY updated_at, order_id;

Log-based watermarks (CDC)

Read the database transaction log. The log contains every change with a unique LSN (Log Sequence Number). Track the last processed LSN as your watermark. This catches all changes, including updates that do not touch the timestamp column.

For more on this approach, see Change Data Capture.

Upsert: Inserting and Updating

Once you have extracted the delta, you need to write it to the destination. An upsert (or merge) inserts new rows and updates existing rows.

-- PostgreSQL upsert (INSERT ON CONFLICT)
INSERT INTO dim_customers (customer_id, email, name, updated_at)
SELECT customer_id, email, name, updated_at
FROM stg_customers
ON CONFLICT (customer_id) DO UPDATE SET
    email = EXCLUDED.email,
    name = EXCLUDED.name,
    updated_at = EXCLUDED.updated_at;
-- Snowflake merge
MERGE INTO dim_customers target
USING stg_customers source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET
    email = source.email,
    name = source.name,
    updated_at = source.updated_at
WHEN NOT MATCHED THEN INSERT
    (customer_id, email, name, updated_at)
    VALUES (source.customer_id, source.email, source.name, source.updated_at);
-- BigQuery merge
MERGE INTO dim_customers target
USING staging_customers source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET
    email = source.email,
    name = source.name,
    updated_at = source.updated_at
WHEN NOT MATCHED THEN INSERT
    (customer_id, email, name, updated_at)
    VALUES (source.customer_id, source.email, source.name, source.updated_at);

Handling Late-Arriving Data

A record arrives late. The order was placed on March 25th but the shipping update came in on March 27th. The pipeline processed March 25th data on March 25th. The late March 25th shipping update arrives on March 27th. What happens?

Late-arriving data requires one of these strategies:

Accept the latency: The late update arrives after the daily aggregation for March 25th is complete. The March 25th daily total is wrong until someone notices and corrects it.

Reprocess the period: When late data arrives, backfill the affected period. The daily aggregation for March 25th is recomputed with the late update included. This requires the pipeline to support reprocessing historical windows.

Accumulating snapshot: Use a fact table that accepts updates to historical periods. The snapshot for March 25th gets updated when late data arrives. Final numbers are only correct after the late-arrival window closes.

For more on handling late data in streaming contexts, see Apache Kafka and Exactly-Once Semantics.

Idempotency

Incremental pipelines run multiple times against the same source data. The first run succeeds. The second run encounters the same records (the watermark has not advanced yet). The pipeline must handle this gracefully without creating duplicates.

Idempotent writes ensure that running the pipeline twice produces the same result as running it once. The upsert pattern handles this naturally: if the pipeline runs twice with the same records, the second run updates rows that already exist with the same values.

If your pipeline uses inserts rather than upserts, deduplication on a unique key is essential.

Change Data Capture as Incremental Load

CDC is an incremental load strategy where the change stream itself is the source. Rather than querying a table for changes, you consume a stream of changes. The watermark is the offset in the change stream.

flowchart LR
    DB[(Database)] -->|CDC| Kafka[Kafka]
    Kafka -->|offset=5000| Pipeline[Pipeline]
    Pipeline -->|processed| Warehouse[(Data Warehouse)]
    Pipeline -->|commit offset 5000| Kafka

CDC captures inserts, updates, and deletes. Your pipeline processes the change events and applies them to the destination. This is fundamentally different from polling-based incremental loads that only see the current state.

CDC eliminates the need for watermark columns in the source system. The change stream is the source of truth for what changed. See Change Data Capture for more.

Common Pitfalls

Non-atomic watermark updates

The pipeline extracts data, then updates the watermark, then loads. If the load fails, the watermark has already advanced. The next run skips the records that failed to load. Use transactions or two-phase commit to ensure the watermark advances only after successful load.

Watermark column drift

The updated_at column does not always reflect the logical time of the change. A bulk UPDATE statement that touches a million rows might update their updated_at to the current time even though the business event happened earlier. This causes late-arriving data issues.

Missing deletes

If your source system marks records as deleted (soft delete) but the incremental query does not filter them out, deleted records persist in the warehouse. CDC change streams include delete operations. Polling-based incremental loads require explicit handling of soft-delete flags.

Tombstone records in Kafka

When consuming from Kafka with CDC, deleted records appear as tombstone records (value is null). Consumer logic must explicitly handle deletes and not skip these records or treat them as errors.

When to Use Incremental Loads

Use incremental loads when:

  • Source tables exceed tens of millions of rows where full reload is impractical
  • Pipeline latency requirements are measured in minutes rather than hours
  • Source systems can reliably identify changed records (CDC, watermark columns, or log-based)
  • Downstream consumers can handle slightly stale data (most analytical workloads)

Do not use incremental loads when:

  • Full table size is manageable (under a few million rows) and simplicity matters more than efficiency
  • Source cannot identify changes (no watermark column, no CDC, no change log)
  • Regulatory requirements mandate complete historical accuracy on every load
  • Late-arriving data would create consistency issues that are harder to fix than full reloads

Trade-off Table: Watermark Strategies

AspectColumn-basedID-basedHybridLog-based (CDC)
Setup complexityLowLowMediumHigh
Catches all changesSometimesNoYesYes
Handles updated rowsSometimesNoYesYes
Handles deletesNoNoNoYes
Source system loadLow (indexed query)Low (indexed query)LowMinimal (WAL only)
Late-arriving dataSupported if re-watermarkedNot supportedSupportedSupported
Requires source changesTimestamp columnSequential ID columnBothCDC infrastructure
Best forSimple OLTP sourcesAppend-only tablesMost scenariosKafka/DB with WAL

Column-based watermarks are the simplest and work when the source reliably maintains timestamp columns. ID-based watermarks only catch new rows, not updates — use them only for append-only tables. Hybrid combines both for completeness. Log-based CDC is the most robust but requires the most infrastructure.

Capacity Estimation for Watermark Storage

Watermark storage is modest but grows with pipeline count.

Control table size: Each pipeline stores 1-2 watermark values (timestamp, max ID, LSN offset). A pipeline control table with 1,000 pipelines stores roughly 1,000 rows. At 100 bytes per row, that is 100 KB total.

Watermark precision vs storage: A timestamp watermark at second precision requires 8 bytes. A microsecond-precision timestamp requires 10 bytes. ID watermarks depend on the ID type (8 bytes for BIGINT, variable for UUID). LSN offsets in PostgreSQL WAL are 8 bytes.

Practical sizing: A control table with 500 pipelines, each tracking 2 watermark values, uses under 1 MB. The storage cost is negligible. The complexity is in ensuring watermark updates are transactional with the load itself.

The real capacity concern is not storage but the query that reads the watermark. Ensure the watermark column is indexed. A full table scan on a 500M row table to read the watermark defeats the purpose.

Observability Checklist for Incremental Loads

Track these metrics on every incremental pipeline:

Watermark freshness: When was the watermark last advanced? A watermark that has not moved in 48 hours means the pipeline is not processing new data.

Lag per run: How many records were extracted per run? A sudden drop to zero may mean the source has no new changes. A sudden spike may mean a backlog is clearing.

Upsert success rate: What percentage of upserted records matched vs inserted? Drifting match rates indicate changing source data patterns.

Late-arriving data rate: What percentage of extracted records have a watermark value older than expected? High late-arrival rates indicate source system issues or clock skew.

Consumer lag per topic/partition: For CDC-based incremental loads, track Kafka consumer lag. Rising lag means the pipeline cannot keep up with source changes.

# Example: Watermark freshness check
SELECT
    pipeline_name,
    MAX(last_watermark_at) AS last_watermark,
    NOW() - MAX(last_watermark_at) AS watermark_age,
    SUM(CASE WHEN status = 'success' THEN rows_processed ELSE 0 END) AS rows_last_run
FROM pipeline_control
GROUP BY pipeline_name
HAVING NOW() - MAX(last_watermark_at) > INTERVAL '24 hours';

Alert on: watermark older than 24 hours without explanation, upsert match rate changing by more than 10% week-over-week, consumer lag exceeding 5 minutes.

Quick Recap

  • Incremental loads avoid full table scans by tracking what changed since the last run.
  • Watermarks can be timestamps, IDs, LSN offsets, or offsets. Each has different trade-offs.
  • Upsert (MERGE/INSERT ON CONFLICT) handles both inserts and updates idempotently.
  • Late-arriving data requires a strategy: accept staleness, reprocess the window, or use accumulating snapshots.
  • CDC is the most robust incremental strategy when you have the infrastructure for it.

Conclusion

Incremental loads are essential for production data pipelines at scale. Full loads become untenable as tables grow. The alternative is tracking what changed and processing only the delta.

Watermarks are the mechanism for tracking progress. They can be timestamps, IDs, or log positions. Upserts ensure idempotent writes. Late-arriving data requires a strategy. CDC is the most robust incremental strategy but requires change data capture infrastructure.

For more on pipeline patterns, see Extract-Transform-Load, Backfills, and Pipeline Orchestration.

Category

Related Posts

Extract-Transform-Load: The Foundation of Data Pipelines

ETL is the core data integration pattern. Learn how extraction, transformation, and loading work, and how modern ETL differs from classical approaches.

#data-engineering #etl #data-pipeline

Backfills: Rebuilding Historical Data at Scale

Backfills reprocess historical data to fix bugs or load new sources. Learn strategies for running backfills safely without breaking production pipelines.

#data-engineering #backfill #data-pipeline

Data Lake Architecture: Raw Data Storage at Scale

Learn how data lakes store raw data at scale for machine learning and analytics, and the patterns that prevent data swamps.

#data-engineering #data-lake #data-storage