Extract-Transform-Load: The Foundation of Data Pipelines

ETL is the core data integration pattern. Learn how extraction, transformation, and loading work, and how modern ETL differs from classical approaches.

published: reading time: 9 min read

Extract-Transform-Load: The Foundation of Data Pipelines

ETL has been the backbone of data integration since the 1970s. Extract data from source systems, transform it into the right shape, and load it into a destination. The pattern persists because it works. What has changed is the technology, the scale, and where transformations happen.

This post covers the three stages of ETL, classical versus modern implementations, and the trade-offs that drive architectural decisions.

Extract: Getting Data Out of Sources

Extraction pulls data from source systems. Sources can be relational databases, SaaS APIs, file systems, event streams, or legacy systems with proprietary formats.

There are several extraction patterns:

Full table extraction: Read the entire source table. Simple but expensive for large tables. Practical only for small datasets or when full refresh is acceptable.

Incremental extraction: Extract only records that changed since the last run. Requires a way to identify changed records: a timestamp column, an incrementing ID, or a change log like a database WAL (see Change Data Capture).

-- Incremental extraction using a watermark column
SELECT * FROM orders
WHERE updated_at > :last_extracted_timestamp
  AND updated_at <= :current_timestamp;

Log-based extraction: Read the database transaction log directly. This captures every change without modifying the source tables or adding query overhead. CDC tools like Debezium use this approach.

API pagination: For SaaS sources, extract data through API endpoints that return paginated results. Handle rate limiting and token expiration.

Transform: Shaping Data for Its Destination

Transformation is where the work happens. Raw source data rarely arrives in the form your destination expects. Transformations clean, filter, aggregate, join, and reshape data.

Common transformation types

Data type conversion: Strings to dates, numeric strings to integers, raw JSON to structured fields.

Deduplication: Remove duplicate records based on a key. Source systems often produce duplicates, especially in CDC scenarios.

Key substitution: Replace operational keys with surrogate keys. An order references customer_id 12345, but the data warehouse uses a different key scheme. A lookup table maps between them.

Denormalization: Join related tables to create wide denormalized records. The classic star schema fact table joins to dimension tables.

Aggregation: Roll up detailed transactions into summary records. Daily sales become monthly summaries.

# Example: Simple deduplication transformation in Python
def deduplicate_orders(orders: list[dict]) -> list[dict]:
    seen = set()
    deduped = []
    for order in orders:
        key = (order['order_id'], order['order_line_id'])
        if key not in seen:
            seen.add(key)
            deduped.append(order)
    return deduped

Staging and validation

Good ETL pipelines stage data between stages. Extract to staging, validate, then transform. If transformation fails, the staging data remains for debugging. The raw extracted data is never modified in place.

Load: Putting Data in Its Destination

Loading writes data to the destination. There are two primary strategies:

Full load: Truncate the destination table and reload all data. Simple, always correct, but expensive for large tables. Used for small dimensions and lookup tables.

Incremental load: Insert or update only changed records. Uses upsert patterns (INSERT ON CONFLICT in PostgreSQL, MERGE in SQL Server/BigQuery/Snowflake).

-- Snowflake merge (upsert) pattern
MERGE INTO dim_customers target
USING staging_customers source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
    UPDATE SET
        email = source.email,
        loyalty_tier = source.loyalty_tier
WHEN NOT MATCHED THEN
    INSERT (customer_id, email, loyalty_tier)
    VALUES (source.customer_id, source.email, source.loyalty_tier);

Load failure handling

Loads fail. Constraint violations, type mismatches, disk space issues. A robust ETL pipeline handles partial failures gracefully. Failed records go to a dead letter queue for investigation. Successful records commit. The pipeline does not silently skip errors.

Classical ETL vs Modern ETL

Classical ETL tools (Informatica, DataStage, SSIS) ran on dedicated infrastructure with built-in connectors, transformation engines, and scheduling. They were expensive, GUI-driven, and required specialized skills.

Modern ETL often means something different:

SQL-based transformation: dbt transforms data in the data warehouse using SQL. Extract and load first, then transform in-place using SQL queries. This is the ELT pattern (see ELT Pattern).

Python-based pipelines: Airflow, Prefect, or Dagster orchestrate Python-based pipelines. Extraction and loading are custom Python code. Transformations are Python functions or SQL queries.

Streaming ETL: Apache Kafka with Kafka Streams or Apache Flink processes data in real time rather than in batches. See Apache Kafka.

The ETL Pipeline in Context

ETL does not exist in isolation. An ETL pipeline is part of a larger data architecture:

flowchart LR
    Source1[Source Systems]
    Source2[APIs / DBs]
    Extract -->|raw| Staging[Staging Area]
    Staging -->|validate| Transform
    Transform -->|clean| Load
    Load -->|structured| Warehouse[Data Warehouse]
    Load -->|raw| Lake[Data Lake]

For more on data warehouse architecture, see Data Warehousing.

Common ETL Pitfalls

Transforming in the wrong order

Business logic that should happen in the source system ends up in the ETL pipeline. A source system with incorrect timezone handling gets corrected in ETL. This works until the source system is replaced and the correction logic disappears. Push transformation logic to the source when the source owns it.

No rollback strategy

A failed load midway through a large batch leaves the destination in a partial state. Use staging tables and atomic commits. Either the full load succeeds or the previous state is preserved.

Ignoring Slowly Changing Dimensions

When source dimension data changes, the warehouse must reflect that change appropriately (see Data Warehousing for SCD types). ETL pipelines must handle Type 2 SCD correctly.

Missing data quality checks

ETL pipelines that load data without validation pass bad data downstream. Build data quality checks into the transformation stage. Reject or quarantine records that fail validation.

ETL vs ELT vs Streaming ETL Trade-offs

The right approach depends on your workload, team skills, and infrastructure.

AspectClassical ETLELTStreaming ETL
LatencyHours to daysHours to minutesSeconds to minutes
Transformation locationExternal engineData warehouseStream processor
Compute costDedicated ETL infrastructureWarehouse computeDedicated stream infra
Schema evolutionRequires pipeline updateRaw data preserves schemaSchema registry needed
DebuggingModerateGood (raw data preserved)Challenging (distributed)
Best forComplex transformations, regulated dataSQL-first teams, replay needsReal-time use cases

Classical ETL suits teams with existing ETL investments and complex transformation logic that predates cloud warehouses. ELT suits SQL-first teams who want replay capability and are comfortable with warehouse compute costs. Streaming ETL suits use cases requiring sub-minute latency.

Performance Considerations

ETL performance is dominated by I/O. Extracting from a source and loading to a destination are typically the slowest steps. Transformations in memory or in-database are fast by comparison.

Parallelize extraction by splitting ranges of an incrementing column. Parallelize loads by writing to different partitions or shards. Many ETL frameworks handle this automatically when configured correctly.

Compression reduces I/O. Extract and load compressed data where the source and destination support it. Parquet and ORC formats (see Data Formats) compress well and are faster to scan than uncompressed CSV.

Capacity Estimation for ETL Pipelines

ETL capacity planning centers on volume, parallelism, and I/O bottlenecks.

Extraction throughput: A sequential read from PostgreSQL on a decent connection scans roughly 50-100 MB/sec. Parallelize across 4 threads and you approach 200-400 MB/sec. The limiting factor is usually database CPU and connection pool saturation, not network bandwidth.

Parallelization math: If your source table has 100 million rows and you want extraction to complete in 10 minutes, you need 100M / 600sec = ~166K rows/sec throughput. At 1 KB per row, that is 166 MB/sec. Eight parallel connections at 21 MB/sec each gets you there.

Warehouse load throughput: Snowflake’s bulk insert achieves 50-100 MB/sec per warehouse size. An XLARGE warehouse (8 cluster) can ingest ~400 MB/sec. BigQuery load jobs handle ~100 MB/sec per slot. Budget for warehouse sizing based on your daily delta volume.

I/O bottleneck indicators: If extraction consistently maxes out source CPU or connections, add parallelism but watch for source degradation. If warehouse load is the bottleneck, increase warehouse size or batch size.

Observability Hooks for ETL Pipelines

Track these metrics to catch ETL problems before they cascade:

Extraction metrics: Records read per run, extraction duration, source query performance. A sudden spike in extraction time means the source query plan changed or the delta is larger than expected.

Staging metrics: Staged record count vs expected, staging area disk usage, age of staged files. Staging that grows unbounded means downstream transformation is falling behind.

Transformation metrics: Records transformed vs input, transformation duration, error count per transformation stage. Rising error rates in transformation are the most common leading indicator of data quality problems.

Load metrics: Records written per run, load duration, destination table row count delta. Track the destination row count over time to catch drift.

# Example: ETL pipeline health check query
SELECT
    pipeline_name,
    MAX(last_run_at) AS last_run,
    SUM(CASE WHEN status = 'success' THEN 1 ELSE 0 END)::FLOAT / COUNT(*) AS success_rate,
    AVG(duration_seconds) AS avg_duration,
    SUM(rows_extracted) AS total_rows_extracted,
    SUM(rows_loaded) AS total_rows_loaded
FROM pipeline_runs
WHERE last_run_at >= NOW() - INTERVAL '7 days'
GROUP BY pipeline_name;

Alert on: success rate below 95%, duration exceeding 2x the 30-day average, row count delta exceeding 3x the normal range.

When ETL Is the Right Choice

ETL is appropriate when:

  • Data must be cleaned and reshaped before reaching the destination
  • Source systems cannot handle analytical query load
  • Historical data needs to be preserved and archived
  • Regulatory requirements demand specific data handling

For more on modern pipeline orchestration, see Pipeline Orchestration and Incremental Loads.

Quick Recap

  • ETL extracts from sources, transforms data, and loads to destinations. The pattern has been around since the 1970s.
  • Incremental extraction (watermarks, CDC) avoids full table scans on large sources.
  • Full load is simple and correct but expensive. Upsert is complex but efficient for ongoing pipelines.
  • Stage data between stages. Never transform in place on raw extracted data.
  • Watch extraction duration, staged record counts, and destination row count drift.

Conclusion

ETL is a mature pattern that solves real problems. Extract from sources, transform to the right schema and quality, load to the destination. The details matter: incremental versus full extraction, upsert versus truncate-and-reload, staging versus in-place transformation.

Modern data stacks have changed where and how ETL runs, but the fundamental pattern remains relevant. Understanding ETL is prerequisite to understanding any data integration architecture.

Category

Related Posts

Incremental Loads: Processing Only What Changed

Incremental loads reduce pipeline cost and latency. Learn watermark strategies, upsert patterns, and how to handle late-arriving data.

#data-engineering #incremental-load #etl

Backfills: Rebuilding Historical Data at Scale

Backfills reprocess historical data to fix bugs or load new sources. Learn strategies for running backfills safely without breaking production pipelines.

#data-engineering #backfill #data-pipeline

Data Lake Architecture: Raw Data Storage at Scale

Learn how data lakes store raw data at scale for machine learning and analytics, and the patterns that prevent data swamps.

#data-engineering #data-lake #data-storage