Backfills: Rebuilding Historical Data at Scale

Backfills reprocess historical data to fix bugs or load new sources. Learn strategies for running backfills safely without breaking production pipelines.

published: March 27, 2026 reading time: 15 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Bugs happen. When they do, you often need to reprocess historical data through corrected logic. Backfills handle this by running the full historical range through your pipeline in chunks—using read replicas, throttling, and validation per chunk so you do not crash your source systems.

Backfills: Rebuilding Historical Data at Scale

A bug in your transformation logic produced incorrect customer segment assignments for the past 18 months. Your revenue by segment report is wrong. Finance needs corrected numbers before the board meeting next week. You fix the bug. Now you need to reprocess 18 months of historical data.

This is a backfill. Backfills reprocess historical data through a pipeline. They are one of the riskiest operations in data engineering because they run at high volume, often under time pressure, and can easily overwhelm source systems, saturate destination warehouses, or create inconsistent state if not handled carefully.

Why Backfills Happen

Backfills are not optional pipeline operations. They are a fact of life in production data systems.

Bug fixes: A transformation bug produces incorrect results. Fixing the bug is only half the solution. Historical data must be reprocessed with the corrected logic.

Schema changes: A new column is added to a table. Historical records need the new column populated with a derived or default value.

New data sources: A new data source is connected to the pipeline. Historical data from that source needs to be loaded.

Regulation or audit: A compliance requirement demands recalculation of historical metrics under new definitions.

Warehouse migration: Moving from one data warehouse to another requires a full historical reload.

Backfill vs Normal Pipeline

A normal pipeline processes new data incrementally. A backfill reprocesses existing historical data. The difference matters operationally:

Aspect	Normal Pipeline	Backfill
Data volume	Delta (new records)	Full historical range
Frequency	Continuous or scheduled	One-time or limited
Risk	Lower (small delta)	Higher (large volume)
Source load	Minimal	Can be significant
Destination write	Upsert	Truncate and reload or full upsert

Backfills require a different approach to resource management, error handling, and validation.

Strategies for Safe Backfills

Chunking by time range

Divide the historical range into chunks and process one chunk at a time. This limits the load on source and destination systems at any point.

def backfill_orders(start_date: date, end_date: date, chunk_days: int = 7):
    """Backfill orders in weekly chunks."""
    current = start_date
    while current < end_date:
        chunk_end = min(current + timedelta(days=chunk_days), end_date)
        print(f"Processing {current} to {chunk_end}")

        try:
            backfill_chunk(current, chunk_end)
            update_backfill_checkpoint(current)
        except Exception as e:
            print(f"Chunk {current} to {chunk_end} failed: {e}")
            raise  # Stop and investigate

        current = chunk_end

Chunk size is a trade-off. Smaller chunks limit resource usage but take longer overall. Larger chunks are faster but put more pressure on systems. Start conservative (smaller chunks) and increase if the systems handle it well.

Chunk boundaries should align with natural data boundaries where possible. Daily chunks that end at midnight work well for business events that are date-aligned. If your data has weekly cycles, 7-day chunks that end on a Sunday night align with the natural cadence of the data and make validation easier. Avoid chunk boundaries that split a business event across two chunks — an order that started on March 31st and completed on April 1st should ideally be in the same chunk, or handled explicitly with overlap logic. Chunk size is not fixed: a backfill can start with 3-day chunks for safety and switch to 14-day chunks once the pipeline proves it handles the load without issue.

Read replicas

Backfills should not impact production source systems. If your source is a PostgreSQL primary, a backfill query scanning millions of rows will saturate the connection pool and slow down production writes.

Use a read replica for backfills. The replica mirrors the primary and is designed for read-heavy workloads. Backfill queries run against the replica, production queries against the primary.

A read replica lag is usually small — typically under a minute for well-provisioned replicas — but it is not zero. For backfills, this does not matter because you are processing historical data that has already settled. The replica will have everything in the backfill window. The lag only bites if your window extends to the very recent past, where the replica has not caught up. In that case, fall back to the primary for the most recent window and use the replica for older data.

Setting up a replica for backfill traffic also means separating connection pools. The backfill job gets its own pool against the replica, with configurable concurrency limits. Production applications get dedicated connections to the primary. This isolation means a runaway backfill query cannot consume all available connections and tank production.

If your data source does not have a read replica — a legacy system or a managed SaaS database, for instance — see if you can build a logical copy via CDC or export/import for backfill workloads. The cost of a temporary replica or copy is almost always less than the cost of a production incident caused by an unthrottled backfill query.

For PostgreSQL specifically, setting up a replica for backfill means configuring hot_standby_feedback on the replica so it can see uncommitted transactions on the primary and avoid cancelling queries due to row-level lock conflicts. Without this setting, long-running backfill queries against the replica can be cancelled when the primary has uncommitted transactions that affect rows the replica is reading. The parameter is straightforward to enable and avoids a class of backfill failures that are frustrating to debug: a query that worked fine for days suddenly fails with a cancellation error because a long-running transaction on the primary finally committed.

Throttling and rate limiting

Even against a replica, aggressive backfill queries can cause problems. Implement throttling to limit the rate of reads and writes. Sleep between chunks if necessary.

import time
import ratelimit

@ratelimit.sleep_and_retry
@ratelimit.limits(calls=100, period=60)
def extract_chunk(start_date, end_date):
    """Extract with rate limiting: max 100 calls per minute."""
    return query_source(start_date, end_date)

Read throttling and write throttling address different bottlenecks. Read throttling limits how fast the backfill scans the source — this matters most against a read replica that shares CPU and I/O with production read queries. Write throttling limits how fast the pipeline writes to the destination — this matters when the warehouse has limited ingest capacity or when concurrent pipeline runs compete for write slots. Most teams focus on read throttling but neglect write throttling, which can cause warehouse queue buildup when multiple backfill chunks are writing simultaneously or when the warehouse is also handling normal production loads. A backfill that writes too aggressively can degrade query performance for BI tools running against the same warehouse. Setting write throttle limits slightly below the warehouse’s maximum ingest rate leaves headroom for production workloads.

Separate backfill pipeline

The worst backfill mistakes I have seen did not come from bugs in the backfill logic itself. They came from teams running a backfill through the same pipeline code that handles normal incremental loads. The watermark logic treated the backfill as an incremental run and advanced the cursor mid-backfill. The priority settings meant backfill writes competed with production queries for warehouse resources. The rollback logic assumed small, recent deltas, not millions of historical rows.

Backfill pipelines differ operationally from incremental pipelines in ways that matter. A normal pipeline processes new data as it arrives, with the implicit assumption that the data is small and recent. A backfill processes large volumes of historical data, often spanning months or years. The watermark mechanism that works perfectly for 50,000 daily rows will misbehave when fed 50 million historical rows because the range scans behave differently at scale and the commit timestamps fall far outside the normal recent window. Running both workloads through the same entry point is more complex and more fragile than keeping them separate.

Build a separate backfill entry point with its own code path. This entry point must override or disable the normal watermark tracking so that a backfill run does not advance the pipeline’s position in the data stream. It must write to shadow or backup tables rather than directly to production, so that validation can happen before the production tables are touched. It must accept arbitrary date ranges rather than inferring them from the current cursor position. And it must run at lower priority than normal pipelines, so that production workloads are not starved during a long-running backfill.

The shadow table approach handles this cleanly. The backfill writes to a table named fact_orders_backfill_2026_03_15 rather than fact_orders. After each chunk completes and validates, the pipeline swaps the shadow table into place via an atomic rename or a controlled migration. If the backfill fails midway, the production table is untouched and the failure is isolated. This also means the backfill can be re-run without affecting the production state. The swap itself should be a quick metadata operation, not a data movement, so it carries minimal risk even on large tables.

Validation During Backfills

Backfills must include validation. A bug that affected production data might also affect the backfill logic if the bug is in shared code.

Pre-backfill validation: Validate the source data before starting the backfill. Confirm row counts, data quality, and freshness.

Incremental validation during backfill: Validate each chunk as it completes. Compare row counts, aggregate statistics, and data quality metrics against expected values.

Post-backfill validation: After the full backfill completes, run comprehensive validation on the entire dataset. Compare against alternative data sources if available.

def validate_chunk(chunk_start, chunk_end, result_df):
    """Validate a backfill chunk."""
    expected_row_count = estimate_expected_rows(chunk_start, chunk_end)
    actual_row_count = len(result_df)

    if abs(actual_row_count - expected_row_count) > expected_row_count * 0.1:
        raise ValidationError(f"Row count mismatch: expected ~{expected_row_count}, got {actual_row_count}")

    null_rate = result_df['customer_id'].isna().mean()
    if null_rate > 0.001:
        raise ValidationError(f"Null rate in customer_id too high: {null_rate}")

    print(f"Chunk {chunk_start} to {chunk_end} validated: {actual_row_count} rows")

Handling Long-Running Backfills

18 months of data is a lot. A backfill might run for days. During that time, the normal pipeline continues to run and produce new incremental data.

The overlap problem: The backfill processes March 2025. The normal pipeline processes March 2025 incrementally. When the backfill completes and overwrites March 2025 data, it might overwrite the correct incremental data with the old (buggy) data.

Solutions:

Write to a new table: The backfill writes to fact_orders_v2. The normal pipeline writes to fact_orders. After the backfill completes, swap them atomically or run a migration.

Partition the backfill window: Backfill the historical range, then run the normal pipeline on the overlap period to reconcile. Or stop the normal pipeline during the backfill overlap period.

Use a backfill flag: Mark records as “backfill” vs “incremental”. Downstream queries union both, with incremental taking precedence during the overlap period.

Backfill Chunking Flow

A backfill processes historical data in discrete chunks. Each chunk is a time window, processed independently, with checkpointing between chunks:

flowchart TD
    Start[Start: Jan 2024] --> Check1{Checkpoint exists?}
    Check1 -->|No| Chunk1[Chunk 1: Jan 1-7]
    Check1 -->|Yes| Resume[Resume from checkpoint]
    Chunk1 --> Validate1{Validate chunk?}
    Validate1 -->|Fail| Investigate[Investigate and fix]
    Investigate --> Chunk1
    Validate1 -->|Pass| Write1[Write to staging]
    Write1 --> Checkpoint1[Save checkpoint: Jan 7]
    Checkpoint1 --> Check2{More chunks?}
    Check2 -->|Yes| Chunk2[Chunk 2: Jan 8-14]
    Chunk2 --> Validate2{Validate chunk?}
    Validate2 -->|Pass| Write2[Write to staging]
    Write2 --> Checkpoint2[Save checkpoint: Jan 14]
    Checkpoint2 --> Check3
    Check3 -->|Yes| Chunk3[Chunk 3: Jan 15-21]
    Check3 -->|No| Swap[Swap staging to production]
    Chunk3 --> Validate3{Validate chunk?}
    Validate3 -->|Pass| Write3[Write to staging]
    Write3 --> Checkpoint3[Save checkpoint: Jan 21]
    Checkpoint3 --> Check4{More chunks?}
    Check4 -->|No| Swap
    Swap --> End[Done]

The checkpoint is the critical piece. If the backfill crashes on chunk 73, it resumes from the chunk 72 checkpoint rather than restarting from scratch.

Capacity Estimation for Backfills

Backfill capacity planning focuses on source scan bandwidth and warehouse write throughput.

Source scan rate: A read replica handling backfill queries scans at roughly 50-100 MB/sec per connection. Eight parallel connections scans ~400-800 MB/sec. The limiting factor is usually replica CPU under the combined read (production + backfill) load.

Chunk size math: If you have 18 months of data (roughly 540 days) and want each chunk to be 7 days, that is 77 chunks. If each chunk takes 10 minutes, the backfill runs for ~13 hours. If each chunk takes 1 hour, you are at 3 days.

Warehouse write rate: A backfill that writes 100K rows per chunk with 100 byte rows writes 10 MB per chunk. Warehouse write is rarely the bottleneck for analytical backfills. The bottleneck is the source scan.

Throttling impact: If you throttle reads to 10 calls per minute to avoid saturating the replica, a scan that would take 1 hour takes 6 hours. Budget throttle rate based on the replica’s headroom above normal production load.

Observability Checklist for Backfills

Track these during any backfill operation:

Chunk progress: Which chunk is running, which completed, which failed. A running backfill without chunk progress is stuck.

Rows per chunk: Does each chunk process roughly the same row count? Sudden changes indicate data pattern changes or source issues.

Validation metrics per chunk: Null rates, duplicate rates, and aggregate totals per chunk. Drifting validation metrics mid-backfill indicate problems.

Source replica load: CPU and connection utilization on the read replica. A backfill that degrades production queries needs throttling.

# Example: Backfill progress check
SELECT
    pipeline_name,
    MAX(completed_chunk_end) AS last_completed_chunk,
    COUNT(CASE WHEN status = 'completed' THEN 1 END) AS chunks_completed,
    COUNT(CASE WHEN status = 'failed' THEN 1 END) AS chunks_failed,
    SUM(rows_processed) AS total_rows_processed
FROM backfill_checkpoints
GROUP BY pipeline_name;

Alert on: any failed chunk (stop immediately), source replica CPU exceeding 80%, chunk row count drifting more than 50% from the average.

Rollback and Recovery

Backfills can fail partway through. If a backfill of 100 chunks fails on chunk 73, you need to recover without corrupting data.

Snapshot before backfill: Take a snapshot or backup of the destination table before starting the backfill. If the backfill fails catastrophically, restore from the snapshot.

Chunk-level checkpoints: Store the last completed chunk in a checkpoint table. If the backfill restarts, it resumes from the last successful chunk rather than starting over.

-- Checkpoint table for backfill progress
CREATE TABLE backfill_checkpoints (
    pipeline_name VARCHAR,
    chunk_start DATE,
    chunk_end DATE,
    status VARCHAR,  -- 'running', 'completed', 'failed'
    completed_at TIMESTAMP,
    rows_processed BIGINT
);

Atomic chunk writes: Each chunk writes to its own staging table. After successful validation, the chunk is merged into the main table. If a chunk fails validation, the staging table remains for debugging while the main table is untouched.

When to Avoid Backfills

Some situations make backfills dangerous:

Very large datasets: A backfill of a multi-trillion row fact table might take weeks and consume significant resources. The cost and duration might exceed the value of correcting historical data.

Regulatory constraints: Some data cannot be reprocessed due to regulatory requirements. Audit logs must reflect what was true at the time, not what should have been true.

Source data no longer available: If the source system has deleted or archived historical data, the backfill cannot run. This happens when CDC retention is short or source tables are partitioned and old partitions are dropped.

Common Backfill Anti-Patterns

Backfills go wrong in predictable ways:

No checkpointing: A backfill without checkpointing restarts from the beginning when it fails. For a 3-day backfill, this is catastrophic. Always checkpoint at chunk boundaries.

Running backfills on production: A backfill that scans the production primary database will saturate connections and slow down production queries. Always use a read replica.

No validation until the end: If you only validate after the full backfill completes, you might discover problems hours after the bad data was written. Validate each chunk as it completes.

Ignoring the overlap problem: The backfill processes March 2025 while the normal pipeline also processes March 2025. Without a strategy, the backfill overwrites correct incremental data. Plan for the overlap.

Backfill pipeline same as production pipeline: Backfills have different resource profiles and failure modes. A separate backfill entry point with its own configuration avoids surprises.

Quick Recap

Backfills reprocess historical data when bugs are fixed, schemas change, or new sources are connected.
Chunk by time range. Use read replicas. Throttle to avoid saturating source systems.
Checkpoint after each chunk. If the backfill fails, resume from the last successful chunk.
Validate each chunk as it completes. Validate the full dataset after the backfill finishes.
Plan for the overlap: the normal pipeline keeps running while the backfill processes the same period.

Conclusion

Backfills are an inevitable part of running production data pipelines. When bugs happen, when schemas change, when new sources come online, historical data must be reprocessed.

The key to safe backfills is treating them as a distinct operation with different risk profiles. Chunk the work. Use read replicas. Validate incrementally. Plan for the overlap with normal incremental pipelines.

Backfills are not a sign of failure. They are a sign that the pipeline handles the reality of evolving systems. The goal is to make backfills routine, predictable, and safe.

For related reading on pipeline reliability, see Pipeline Orchestration and Data Quality.

Backfills: Rebuilding Historical Data at Scale

Why Backfills Happen

Backfill vs Normal Pipeline

Strategies for Safe Backfills

Chunking by time range

Read replicas

Throttling and rate limiting

Separate backfill pipeline

Validation During Backfills

Handling Long-Running Backfills

Backfill Chunking Flow

Capacity Estimation for Backfills

Observability Checklist for Backfills

Rollback and Recovery

When to Avoid Backfills

Common Backfill Anti-Patterns

Quick Recap

Conclusion

Category

Tags

Related Posts

Extract-Transform-Load: The Foundation of Data Pipelines

Incremental Loads: Processing Only What Changed

Alerting in Production: Building Alerts That Matter