Extract-Transform-Load: The Foundation of Data Pipelines

ETL is the core data integration pattern. Learn how extraction, transformation, and loading work, and how modern ETL differs from classical approaches.

published: March 27, 2026 reading time: 14 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

ETL extracts data from sources, transforms it into the right shape, and loads to destinations—a pattern that has powered data integration since the 1970s and remains relevant despite newer alternatives. Incremental extraction using watermark columns or change data capture avoids full table scans on large sources, while full load stays practical for small dimensions and lookup tables where simplicity outweighs cost. Stage data between every stage: extract to staging, validate, then transform, so that failures leave debugging artifacts rather than corrupted in-place data. Monitor extraction duration, staged record counts, and destination row count drift as the earliest signals of pipeline health degradation.

Extract-Transform-Load: The Foundation of Data Pipelines

ETL has been the backbone of data integration since the 1970s. Extract data from source systems, transform it into the right shape, and load it into a destination. The pattern persists because it works. What has changed is the technology, the scale, and where transformations happen.

This post covers the three stages of ETL, classical versus modern implementations, and the trade-offs that drive architectural decisions.

Extract: Getting Data Out of Sources

Extraction pulls data from source systems. Sources can be relational databases, SaaS APIs, file systems, event streams, or legacy systems with proprietary formats.

There are several extraction patterns:

Full table extraction: Read the entire source table. Simple but expensive for large tables. Practical only for small datasets or when full refresh is acceptable.

Incremental extraction: Extract only records that changed since the last run. Requires a way to identify changed records: a timestamp column, an incrementing ID, or a change log like a database WAL (see Change Data Capture).

-- Incremental extraction using a watermark column
SELECT * FROM orders
WHERE updated_at > :last_extracted_timestamp
  AND updated_at <= :current_timestamp;

Log-based extraction: Read the database transaction log directly. This captures every change without modifying the source tables or adding query overhead. CDC tools like Debezium use this approach.

API pagination: For SaaS sources, extract data through API endpoints that return paginated results. Handle rate limiting and token expiration.

Transform: Shaping Data for Its Destination

Transformation is where the work happens. Raw source data rarely arrives in the form your destination expects. Transformations clean, filter, aggregate, join, and reshape data.

Common transformation types

Data type conversion: Strings to dates, numeric strings to integers, raw JSON to structured fields.

Deduplication: Remove duplicate records based on a key. Source systems often produce duplicates, especially in CDC scenarios.

Key substitution: Replace operational keys with surrogate keys. An order references customer_id 12345, but the data warehouse uses a different key scheme. A lookup table maps between them.

Denormalization: Join related tables to create wide denormalized records. The classic star schema fact table joins to dimension tables.

Aggregation: Roll up detailed transactions into summary records. Daily sales become monthly summaries.

# Example: Simple deduplication transformation in Python
def deduplicate_orders(orders: list[dict]) -> list[dict]:
    seen = set()
    deduped = []
    for order in orders:
        key = (order['order_id'], order['order_line_id'])
        if key not in seen:
            seen.add(key)
            deduped.append(order)
    return deduped

Staging and validation

Good ETL pipelines stage data between stages. Extract to staging, validate, then transform. If transformation fails, the staging data remains for debugging. The raw extracted data is never modified in place.

Load: Putting Data in Its Destination

Loading writes data to the destination. There are two primary strategies:

Full load: Truncate the destination table and reload all data. Simple, always correct, but expensive for large tables. Used for small dimensions and lookup tables.

Incremental load: Insert or update only changed records. Uses upsert patterns (INSERT ON CONFLICT in PostgreSQL, MERGE in SQL Server/BigQuery/Snowflake).

-- Snowflake merge (upsert) pattern
MERGE INTO dim_customers target
USING staging_customers source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
    UPDATE SET
        email = source.email,
        loyalty_tier = source.loyalty_tier
WHEN NOT MATCHED THEN
    INSERT (customer_id, email, loyalty_tier)
    VALUES (source.customer_id, source.email, source.loyalty_tier);

Load failure handling

Loads fail. Constraint violations, type mismatches, disk space issues. A robust ETL pipeline handles partial failures gracefully. Failed records go to a dead letter queue for investigation. Successful records commit. The pipeline does not silently skip errors.

Classical ETL vs Modern ETL

Classical ETL tools (Informatica, DataStage, SSIS) ran on dedicated infrastructure with built-in connectors, transformation engines, and scheduling. They were expensive, GUI-driven, and required specialized skills.

Modern ETL often means something different:

SQL-based transformation: dbt transforms data in the data warehouse using SQL. Extract and load first, then transform in-place using SQL queries. This is the ELT pattern (see ELT Pattern).

Python-based pipelines: Airflow, Prefect, or Dagster orchestrate Python-based pipelines. Extraction and loading are custom Python code. Transformations are Python functions or SQL queries.

Streaming ETL: Apache Kafka with Kafka Streams or Apache Flink processes data in real time rather than in batches. See Apache Kafka.

The ETL Pipeline in Context

ETL does not exist in isolation. An ETL pipeline is part of a larger data architecture:

flowchart LR
    Source1[Source Systems]
    Source2[APIs / DBs]
    Extract -->|raw| Staging[Staging Area]
    Staging -->|validate| Transform
    Transform -->|clean| Load
    Load -->|structured| Warehouse[Data Warehouse]
    Load -->|raw| Lake[Data Lake]

For more on data warehouse architecture, see Data Warehousing.

Common ETL Pitfalls

Transforming in the wrong order

ETL pipelines tend to accumulate business logic that belongs closer to the source. Timezone corrections, currency conversions, and business rule enforcement get embedded in transformation logic because it is easier to fix there than in the source system. The pipeline becomes a workaround for bugs it was never designed to handle.

The problem surfaces when the source system is replaced or fixed. The ETL pipeline that was compensating for the bug keeps running, now producing incorrect data because the original problem no longer exists. The workaround was never documented as a workaround. New team members see timezone-adjusted values and assume that is the correct format. The bug resurfaces in a different place.

Push transformation logic to the source when the source owns it. If the source system stores timestamps without timezone information, fix the source system. If a source application applies a discount rule incorrectly, fix the application. ETL should handle data integration concerns, not paper over application bugs. When you catch yourself adding a transformation to work around a source system problem, treat it as a temporary patch and file a bug against the source. Document those patches in your pipeline so they can be removed once the root cause is fixed.

The tell is when a transformation depends on knowing the business context of a source field rather than its technical representation. A pipeline that converts a price_usd column to price_eur using a hardcoded exchange rate is encoding business logic that should live in the source. When the exchange rate changes, you must remember to update the pipeline. When the source adds multi-currency support, the pipeline breaks silently. The same applies to date/time normalization: if a source sends naive timestamps, the fix belongs in the source database or application layer, not in the ETL that consumes them. The pipeline should pass data through with minimal transformation, treating source-level issues as source-level problems.

No rollback strategy

A failed load midway through a large batch leaves the destination in a partial state. Some records are written, others are not, and the destination now contains data that does not match the source. You cannot roll back easily when the destination reflects neither the previous state nor the intended new state.

Stage incoming records in a temporary table first. Validate them. When validation passes, execute the merge or upsert in a single transaction. If anything fails during the merge, roll back the transaction. The destination stays intact and your pipeline can retry from staging without re-extracting from the source.

For large batches that cannot fit in a single transaction, write records in batches each in its own transaction, and track the last successfully committed batch in a marker table. On retry, resume from the marker. This limits data loss to one batch rather than the entire load. Your monitoring needs to track batch-level progress, not just pipeline-level success or failure.

The staging table approach also gives you a clean audit point. Before touching production, you have a copy of what is about to be written, how many records it contains, and what the data looks like. If a downstream consumer reports an issue after the load, you can compare the staging table against the production table to determine whether the issue originated in the transformation or in the load itself. Without staging, that investigation requires re-extracting from the source, which is expensive and sometimes impossible if the source does not retain recent change logs.

Ignoring Slowly Changing Dimensions

Dimension data changes over time. A customer moves to a new address. A product gets reassigned to a different category. A store changes its region assignment. In a data warehouse, how you handle these changes determines whether your historical analysis remains accurate.

Slowly Changing Dimensions come in a few types. Type 1 overwrites the old value with the new one, so historical facts for this record now show the new value instead of the old one. Type 2 preserves history by creating a new dimension record with a new surrogate key and effective date range. Facts continue to join to the record that was active when the fact occurred. Type 3 stores both old and new values in the same record using current and previous columns.

Type 2 SCD requires detecting when a dimension attribute changes and inserting a new record rather than updating the existing one. Compare incoming dimension records against the current dimension table, detect changes in the tracked attributes, and insert a new row with updated effective dates. The old record gets a closed date. The new record starts from the change date forward.

The problem with getting this wrong is that the warehouse silently loses historical context. Reports that should show different values for the same customer at different points in time start showing the current value everywhere. Rebuilding that history later is expensive and often impossible if the source does not retain change logs. Getting SCD handling right from the start is worth the upfront design effort.

The practical detection logic compares each incoming dimension row against the current warehouse version on all tracked SCD attributes. For a customer dimension, that might be address, loyalty_tier, and region. If any of these differ from the current row, the incoming row represents a change that warrants a new dimension record. The existing row’s valid_to date gets set to the day before the change, and the new row gets valid_from set to the change date. This preserves the ability to join facts to the correct historical dimension value using the fact’s transaction date and the dimension’s effective date range. Type 2 SCD is the most common choice for fact tables that track metrics tied to changing dimension attributes, because it makes historical accuracy explicit rather than relying on overwrite semantics.

Missing data quality checks

ETL pipelines that load data without validation pass bad data downstream. Bad data in the warehouse is harder to fix than good data never loaded. Downstream consumers build reports, dashboards, and models on data they assume is correct. When that assumption breaks, the cost of finding and correcting the bad data compounds the longer it sits undetected.

Validate before you transform, not after. Check that required fields are present and not null. Validate data types and value ranges. Catch referential integrity violations where a foreign key does not exist in the lookup table. Detect duplicates before they enter the pipeline. Each check should be explicit: when a record fails, you should know exactly which check failed and why.

When validation fails, reject or quarantine the record. Rejected records go to a dead letter queue or error table with the failure reason and the raw input. Do not silently skip errors. Quarantined records sit in a staging table pending investigation. Someone reviews them, corrects the source data if possible, and re-runs the pipeline. Track validation failure rates over time. A sudden spike in null required fields means something changed in the source system. A gradual increase in duplicate records might indicate a CDC bug. Monitoring validation metrics is as important as monitoring pipeline duration and row counts.

ETL vs ELT vs Streaming ETL Trade-offs

The right approach depends on your workload, team skills, and infrastructure.

Aspect	Classical ETL	ELT	Streaming ETL
Latency	Hours to days	Hours to minutes	Seconds to minutes
Transformation location	External engine	Data warehouse	Stream processor
Compute cost	Dedicated ETL infrastructure	Warehouse compute	Dedicated stream infra
Schema evolution	Requires pipeline update	Raw data preserves schema	Schema registry needed
Debugging	Moderate	Good (raw data preserved)	Challenging (distributed)
Best for	Complex transformations, regulated data	SQL-first teams, replay needs	Real-time use cases

Classical ETL suits teams with existing ETL investments and complex transformation logic that predates cloud warehouses. ELT suits SQL-first teams who want replay capability and are comfortable with warehouse compute costs. Streaming ETL suits use cases requiring sub-minute latency.

Performance Considerations

ETL performance is dominated by I/O. Extracting from a source and loading to a destination are typically the slowest steps. Transformations in memory or in-database are fast by comparison.

Parallelize extraction by splitting ranges of an incrementing column. Parallelize loads by writing to different partitions or shards. Many ETL frameworks handle this automatically when configured correctly.

Compression reduces I/O. Extract and load compressed data where the source and destination support it. Parquet and ORC formats (see Data Formats) compress well and are faster to scan than uncompressed CSV.

Capacity Estimation for ETL Pipelines

ETL capacity planning centers on volume, parallelism, and I/O bottlenecks.

Extraction throughput: A sequential read from PostgreSQL on a decent connection scans roughly 50-100 MB/sec. Parallelize across 4 threads and you approach 200-400 MB/sec. The limiting factor is usually database CPU and connection pool saturation, not network bandwidth.

Parallelization math: If your source table has 100 million rows and you want extraction to complete in 10 minutes, you need 100M / 600sec = ~166K rows/sec throughput. At 1 KB per row, that is 166 MB/sec. Eight parallel connections at 21 MB/sec each gets you there.

Warehouse load throughput: Snowflake’s bulk insert achieves 50-100 MB/sec per warehouse size. An XLARGE warehouse (8 cluster) can ingest ~400 MB/sec. BigQuery load jobs handle ~100 MB/sec per slot. Budget for warehouse sizing based on your daily delta volume.

I/O bottleneck indicators: If extraction consistently maxes out source CPU or connections, add parallelism but watch for source degradation. If warehouse load is the bottleneck, increase warehouse size or batch size.

Observability Hooks for ETL Pipelines

Track these metrics to catch ETL problems before they cascade:

Extraction metrics: Records read per run, extraction duration, source query performance. A sudden spike in extraction time means the source query plan changed or the delta is larger than expected.

Staging metrics: Staged record count vs expected, staging area disk usage, age of staged files. Staging that grows unbounded means downstream transformation is falling behind.

Transformation metrics: Records transformed vs input, transformation duration, error count per transformation stage. Rising error rates in transformation are the most common leading indicator of data quality problems.

Load metrics: Records written per run, load duration, destination table row count delta. Track the destination row count over time to catch drift.

# Example: ETL pipeline health check query
SELECT
    pipeline_name,
    MAX(last_run_at) AS last_run,
    SUM(CASE WHEN status = 'success' THEN 1 ELSE 0 END)::FLOAT / COUNT(*) AS success_rate,
    AVG(duration_seconds) AS avg_duration,
    SUM(rows_extracted) AS total_rows_extracted,
    SUM(rows_loaded) AS total_rows_loaded
FROM pipeline_runs
WHERE last_run_at >= NOW() - INTERVAL '7 days'
GROUP BY pipeline_name;

Alert on: success rate below 95%, duration exceeding 2x the 30-day average, row count delta exceeding 3x the normal range.

When ETL Is the Right Choice

ETL is appropriate when:

Data must be cleaned and reshaped before reaching the destination
Source systems cannot handle analytical query load
Historical data needs to be preserved and archived
Regulatory requirements demand specific data handling

For more on modern pipeline orchestration, see Pipeline Orchestration and Incremental Loads.

Quick Recap

ETL extracts from sources, transforms data, and loads to destinations. The pattern has been around since the 1970s.
Incremental extraction (watermarks, CDC) avoids full table scans on large sources.
Full load is simple and correct but expensive. Upsert is complex but efficient for ongoing pipelines.
Stage data between stages. Never transform in place on raw extracted data.
Watch extraction duration, staged record counts, and destination row count drift.

Conclusion

ETL is a mature pattern that solves real problems. Extract from sources, transform to the right schema and quality, load to the destination. The details matter: incremental versus full extraction, upsert versus truncate-and-reload, staging versus in-place transformation.

Modern data stacks have changed where and how ETL runs, but the fundamental pattern remains relevant. Understanding ETL is prerequisite to understanding any data integration architecture.

Extract-Transform-Load: The Foundation of Data Pipelines

Extract: Getting Data Out of Sources

Transform: Shaping Data for Its Destination

Common transformation types

Staging and validation

Load: Putting Data in Its Destination

Load failure handling

Classical ETL vs Modern ETL

The ETL Pipeline in Context

Common ETL Pitfalls

Transforming in the wrong order

No rollback strategy

Ignoring Slowly Changing Dimensions

Missing data quality checks

ETL vs ELT vs Streaming ETL Trade-offs

Performance Considerations

Capacity Estimation for ETL Pipelines

Observability Hooks for ETL Pipelines

When ETL Is the Right Choice

Quick Recap

Conclusion

Category

Tags

Related Posts

Incremental Loads: Processing Only What Changed

Backfills: Rebuilding Historical Data at Scale

Data Lake Architecture: Raw Data Storage at Scale