Backfills: Rebuilding Historical Data at Scale
Backfills reprocess historical data to fix bugs or load new sources. Learn strategies for running backfills safely without breaking production pipelines.
Backfills: Rebuilding Historical Data at Scale
A bug in your transformation logic produced incorrect customer segment assignments for the past 18 months. Your revenue by segment report is wrong. Finance needs corrected numbers before the board meeting next week. You fix the bug. Now you need to reprocess 18 months of historical data.
This is a backfill. Backfills reprocess historical data through a pipeline. They are one of the most risky operations in data engineering because they run at high volume, often under time pressure, and can easily overwhelm source systems, saturate destination warehouses, or create inconsistent state if not handled carefully.
Why Backfills Happen
Backfills are not optional pipeline operations. They are a fact of life in production data systems.
Bug fixes: A transformation bug produces incorrect results. Fixing the bug is only half the solution. Historical data must be reprocessed with the corrected logic.
Schema changes: A new column is added to a table. Historical records need the new column populated with a derived or default value.
New data sources: A new data source is connected to the pipeline. Historical data from that source needs to be loaded.
Regulation or audit: A compliance requirement demands recalculation of historical metrics under new definitions.
Warehouse migration: Moving from one data warehouse to another requires a full historical reload.
Backfill vs Normal Pipeline
A normal pipeline processes new data incrementally. A backfill reprocesses existing historical data. The difference matters operationally:
| Aspect | Normal Pipeline | Backfill |
|---|---|---|
| Data volume | Delta (new records) | Full historical range |
| Frequency | Continuous or scheduled | One-time or limited |
| Risk | Lower (small delta) | Higher (large volume) |
| Source load | Minimal | Can be significant |
| Destination write | Upsert | Truncate and reload or full upsert |
Backfills require a different approach to resource management, error handling, and validation.
Strategies for Safe Backfills
Chunking by time range
Divide the historical range into chunks and process one chunk at a time. This limits the load on source and destination systems at any point.
def backfill_orders(start_date: date, end_date: date, chunk_days: int = 7):
"""Backfill orders in weekly chunks."""
current = start_date
while current < end_date:
chunk_end = min(current + timedelta(days=chunk_days), end_date)
print(f"Processing {current} to {chunk_end}")
try:
backfill_chunk(current, chunk_end)
update_backfill_checkpoint(current)
except Exception as e:
print(f"Chunk {current} to {chunk_end} failed: {e}")
raise # Stop and investigate
current = chunk_end
Chunk size is a trade-off. Smaller chunks limit resource usage but take longer overall. Larger chunks are faster but put more pressure on systems. Start conservative (smaller chunks) and increase if the systems handle it well.
Read replicas
Backfills should not impact production source systems. If your source is a PostgreSQL primary, a backfill query scanning millions of rows will saturate the connection pool and slow down production writes.
Use a read replica for backfills. The replica mirrors the primary and is designed for read-heavy workloads. Backfill queries run against the replica, production queries against the primary.
Throttling and rate limiting
Even against a replica, aggressive backfill queries can cause problems. Implement throttling to limit the rate of reads and writes. Sleep between chunks if necessary.
import time
import ratelimit
@ratelimit.sleep_and_retry
@ratelimit.limits(calls=100, period=60)
def extract_chunk(start_date, end_date):
"""Extract with rate limiting: max 100 calls per minute."""
return query_source(start_date, end_date)
Separate backfill pipeline
Do not run backfills through the same pipeline code path as normal incremental loads. Backfill pipelines have different resource profiles, error handling, and rollback requirements.
Create a separate backfill entry point that can:
- Process arbitrary date ranges
- Override the normal watermark logic
- Write to a shadow table or backup before overwriting
- Run at lower priority than normal pipelines
Validation During Backfills
Backfills must include validation. A bug that affected production data might also affect the backfill logic if the bug is in shared code.
Pre-backfill validation: Validate the source data before starting the backfill. Confirm row counts, data quality, and freshness.
Incremental validation during backfill: Validate each chunk as it completes. Compare row counts, aggregate statistics, and data quality metrics against expected values.
Post-backfill validation: After the full backfill completes, run comprehensive validation on the entire dataset. Compare against alternative data sources if available.
def validate_chunk(chunk_start, chunk_end, result_df):
"""Validate a backfill chunk."""
expected_row_count = estimate_expected_rows(chunk_start, chunk_end)
actual_row_count = len(result_df)
if abs(actual_row_count - expected_row_count) > expected_row_count * 0.1:
raise ValidationError(f"Row count mismatch: expected ~{expected_row_count}, got {actual_row_count}")
null_rate = result_df['customer_id'].isna().mean()
if null_rate > 0.001:
raise ValidationError(f"Null rate in customer_id too high: {null_rate}")
print(f"Chunk {chunk_start} to {chunk_end} validated: {actual_row_count} rows")
Handling Long-Running Backfills
18 months of data is a lot. A backfill might run for days. During that time, the normal pipeline continues to run and produce new incremental data.
The overlap problem: The backfill processes March 2025. The normal pipeline processes March 2025 incrementally. When the backfill completes and overwrites March 2025 data, it might overwrite the correct incremental data with the old (buggy) data.
Solutions:
Write to a new table: The backfill writes to fact_orders_v2. The normal pipeline writes to fact_orders. After the backfill completes, swap them atomically or run a migration.
Partition the backfill window: Backfill the historical range, then run the normal pipeline on the overlap period to reconcile. Or stop the normal pipeline during the backfill overlap period.
Use a backfill flag: Mark records as “backfill” vs “incremental”. Downstream queries union both, with incremental taking precedence during the overlap period.
Backfill Chunking Flow
A backfill processes historical data in discrete chunks. Each chunk is a time window, processed independently, with checkpointing between chunks:
flowchart TD
Start[Start: Jan 2024] --> Check1{Checkpoint exists?}
Check1 -->|No| Chunk1[Chunk 1: Jan 1-7]
Check1 -->|Yes| Resume[Resume from checkpoint]
Chunk1 --> Validate1{Validate chunk?}
Validate1 -->|Fail| Investigate[Investigate and fix]
Investigate --> Chunk1
Validate1 -->|Pass| Write1[Write to staging]
Write1 --> Checkpoint1[Save checkpoint: Jan 7]
Checkpoint1 --> Check2{More chunks?}
Check2 -->|Yes| Chunk2[Chunk 2: Jan 8-14]
Chunk2 --> Validate2{Validate chunk?}
Validate2 -->|Pass| Write2[Write to staging]
Write2 --> Checkpoint2[Save checkpoint: Jan 14]
Checkpoint2 --> Check3
Check3 -->|Yes| Chunk3[Chunk 3: Jan 15-21]
Check3 -->|No| Swap[Swap staging to production]
Chunk3 --> Validate3{Validate chunk?}
Validate3 -->|Pass| Write3[Write to staging]
Write3 --> Checkpoint3[Save checkpoint: Jan 21]
Checkpoint3 --> Check4{More chunks?}
Check4 -->|No| Swap
Swap --> End[Done]
The checkpoint is the critical piece. If the backfill crashes on chunk 73, it resumes from the chunk 72 checkpoint rather than restarting from scratch.
Capacity Estimation for Backfills
Backfill capacity planning focuses on source scan bandwidth and warehouse write throughput.
Source scan rate: A read replica handling backfill queries scans at roughly 50-100 MB/sec per connection. Eight parallel connections scans ~400-800 MB/sec. The limiting factor is usually replica CPU under the combined read (production + backfill) load.
Chunk size math: If you have 18 months of data (roughly 540 days) and want each chunk to be 7 days, that is 77 chunks. If each chunk takes 10 minutes, the backfill runs for ~13 hours. If each chunk takes 1 hour, you are at 3 days.
Warehouse write rate: A backfill that writes 100K rows per chunk with 100 byte rows writes 10 MB per chunk. Warehouse write is rarely the bottleneck for analytical backfills. The bottleneck is the source scan.
Throttling impact: If you throttle reads to 10 calls per minute to avoid saturating the replica, a scan that would take 1 hour takes 6 hours. Budget throttle rate based on the replica’s headroom above normal production load.
Observability Checklist for Backfills
Track these during any backfill operation:
Chunk progress: Which chunk is running, which completed, which failed. A running backfill without chunk progress is stuck.
Rows per chunk: Does each chunk process roughly the same row count? Sudden changes indicate data pattern changes or source issues.
Validation metrics per chunk: Null rates, duplicate rates, and aggregate totals per chunk. Drifting validation metrics mid-backfill indicate problems.
Source replica load: CPU and connection utilization on the read replica. A backfill that degrades production queries needs throttling.
# Example: Backfill progress check
SELECT
pipeline_name,
MAX(completed_chunk_end) AS last_completed_chunk,
COUNT(CASE WHEN status = 'completed' THEN 1 END) AS chunks_completed,
COUNT(CASE WHEN status = 'failed' THEN 1 END) AS chunks_failed,
SUM(rows_processed) AS total_rows_processed
FROM backfill_checkpoints
GROUP BY pipeline_name;
Alert on: any failed chunk (stop immediately), source replica CPU exceeding 80%, chunk row count drifting more than 50% from the average.
Rollback and Recovery
Backfills can fail partway through. If a backfill of 100 chunks fails on chunk 73, you need to recover without corrupting data.
Snapshot before backfill: Take a snapshot or backup of the destination table before starting the backfill. If the backfill fails catastrophically, restore from the snapshot.
Chunk-level checkpoints: Store the last completed chunk in a checkpoint table. If the backfill restarts, it resumes from the last successful chunk rather than starting over.
-- Checkpoint table for backfill progress
CREATE TABLE backfill_checkpoints (
pipeline_name VARCHAR,
chunk_start DATE,
chunk_end DATE,
status VARCHAR, -- 'running', 'completed', 'failed'
completed_at TIMESTAMP,
rows_processed BIGINT
);
Atomic chunk writes: Each chunk writes to its own staging table. After successful validation, the chunk is merged into the main table. If a chunk fails validation, the staging table remains for debugging while the main table is untouched.
When to Avoid Backfills
Some situations make backfills dangerous:
Very large datasets: A backfill of a multi-trillion row fact table might take weeks and consume significant resources. The cost and duration might exceed the value of correcting historical data.
Regulatory constraints: Some data cannot be reprocessed due to regulatory requirements. Audit logs must reflect what was true at the time, not what should have been true.
Source data no longer available: If the source system has deleted or archived historical data, the backfill cannot run. This happens when CDC retention is short or source tables are partitioned and old partitions are dropped.
Common Backfill Anti-Patterns
Backfills go wrong in predictable ways:
No checkpointing: A backfill without checkpointing restarts from the beginning when it fails. For a 3-day backfill, this is catastrophic. Always checkpoint at chunk boundaries.
Running backfills on production: A backfill that scans the production primary database will saturate connections and slow down production queries. Always use a read replica.
No validation until the end: If you only validate after the full backfill completes, you might discover problems hours after the bad data was written. Validate each chunk as it completes.
Ignoring the overlap problem: The backfill processes March 2025 while the normal pipeline also processes March 2025. Without a strategy, the backfill overwrites correct incremental data. Plan for the overlap.
Backfill pipeline same as production pipeline: Backfills have different resource profiles and failure modes. A separate backfill entry point with its own configuration avoids surprises.
Quick Recap
- Backfills reprocess historical data when bugs are fixed, schemas change, or new sources are connected.
- Chunk by time range. Use read replicas. Throttle to avoid saturating source systems.
- Checkpoint after each chunk. If the backfill fails, resume from the last successful chunk.
- Validate each chunk as it completes. Validate the full dataset after the backfill finishes.
- Plan for the overlap: the normal pipeline keeps running while the backfill processes the same period.
Conclusion
Backfills are an inevitable part of running production data pipelines. When bugs happen, when schemas change, when new sources come online, historical data must be reprocessed.
The key to safe backfills is treating them as a distinct operation with different risk profiles. Chunk the work. Use read replicas. Validate incrementally. Plan for the overlap with normal incremental pipelines.
Backfills are not a sign of failure. They are a sign that the pipeline handles the reality of evolving systems. The goal is to make backfills routine, predictable, and safe.
For related reading on pipeline reliability, see Pipeline Orchestration and Data Quality.
Category
Related Posts
Extract-Transform-Load: The Foundation of Data Pipelines
ETL is the core data integration pattern. Learn how extraction, transformation, and loading work, and how modern ETL differs from classical approaches.
Incremental Loads: Processing Only What Changed
Incremental loads reduce pipeline cost and latency. Learn watermark strategies, upsert patterns, and how to handle late-arriving data.
Alerting in Production: Building Alerts That Matter
Build alerting systems that catch real problems without fatigue. Learn alert design principles, severity levels, runbooks, and on-call best practices.