ELT Pattern: Transforming Data in the Data Warehouse
ELT flips ETL by loading raw data first, then transforming in the warehouse. Learn how modern cloud platforms enable ELT at scale.
ELT Pattern: Transforming Data in the Data Warehouse
ETL has a sequencing problem. When you transform first, you are constrained by the compute of your transformation engine. When you load first, you can leverage the full power of your data warehouse.
ELT inverts the sequence. Extract data from sources and load it into the warehouse in its raw form. Then transform the data using the warehouse’s compute engine. Modern cloud data warehouses like Snowflake, BigQuery, and Redshift have enough raw power that SQL-based transformations are often faster than external ETL engines.
The Difference Between ETL and ELT
ETL transforms data before it reaches the warehouse. You extract from source, transform using an external engine, then load to the warehouse. The warehouse receives clean, shaped data.
ELT loads raw data first. The warehouse stores exactly what the source emitted. Transformations happen inside the warehouse using SQL. The warehouse is both the destination and the transformation engine.
flowchart LR
subgraph ETL[ETL Pattern]
E1[Extract] --> T1[Transform]
T1 --> L1[Load to Warehouse]
end
subgraph ELT[ELT Pattern]
E2[Extract] --> L2[Load Raw to Warehouse]
L2 --> T2[Transform in Warehouse]
end
Why Load Raw First
A data warehouse is designed for analytical queries. Snowflake, BigQuery, and Redshift can scan billions of rows efficiently using columnar storage and massive parallelism. When you move transformations into the warehouse, you leverage hardware specifically designed for this workload.
There are other advantages:
Replay capability: Raw data is preserved. If your transformation logic changes, you can rerun it against the same raw data. ETL that transforms before loading loses the raw source.
Debugging: When a transformation produces wrong results, you can inspect the raw data to understand what went wrong. With ETL, you often have to re-extract to debug.
Schema evolution: Raw data preserves the source schema even when the warehouse schema changes. If a source adds a new column, raw data captures it without requiring schema changes to the transformation.
Simpler pipelines: Extract and load are the only pipeline steps. No external transformation engine to operate. The pipeline is SQL and scheduling, not a distributed compute framework.
The dbt Revolution
dbt (data build tool) transformed how data teams think about ELT. dbt runs SQL transformations against your data warehouse. You define models in SQL, and dbt compiles and runs them in the correct order with dependency management.
-- dbt model: dim_customers.sql
{{
config(
materialized='incremental',
unique_key='customer_id'
)
}}
SELECT
customer_id,
email,
first_name,
last_name,
CASE
WHEN total_spend >= 1000 THEN 'premium'
WHEN total_spend >= 500 THEN 'standard'
ELSE 'basic'
END AS customer_segment,
created_at,
updated_at
FROM {{ ref('stg_customers') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}
dbt models are:
- Compiled: dbt generates SQL and executes it against the warehouse
- Version controlled: Models live in git repositories with full history
- Tested: dbt tests assert contract guarantees on model outputs
- Documented: dbt auto-generates data lineage and documentation from model definitions
For more on how dbt fits into the modern stack, see the dbt.
Staging Layers
A well-structured ELT pipeline has multiple transformation layers:
Raw layer: The exact data received from sources, unmodified. If the source sent “123” as a string, the raw layer stores “123” as a string. This layer is append-only.
Staging layer: Light cleaning and type casting. Convert strings to dates. Cast numeric strings to integers. Rename columns to warehouse conventions. Remove obviously bad records.
Intermediate layer: Business logic transformations. Join to dimension tables. Apply business rules. Aggregate to the grain your analytics needs.
Mart layer: Final models optimized for consumption. Star schema fact and dimension tables ready for BI tools and analyst queries.
-- Staging: light cleaning
CREATE TABLE stg_customers AS
SELECT
customer_id::VARCHAR AS customer_id,
email::VARCHAR AS email,
first_name::VARCHAR AS first_name,
NULLIF(last_name, '')::VARCHAR AS last_name,
created_at::TIMESTAMP AS created_at,
updated_at::TIMESTAMP AS updated_at
FROM raw_customers
WHERE email IS NOT NULL;
SQL Transformations at Scale
The concern with ELT is that complex SQL transformations become unwieldy. SQL that spans hundreds of lines with multiple joins and window functions is hard to test and harder to debug.
dbt addresses this by:
- Breaking transformations into models (files) that can be tested individually
- Providing a ref() function that explicitly declares dependencies
- Generating a directed acyclic graph (DAG) of model execution
- Running tests on model outputs to catch issues early
For large-scale transformations that exceed what SQL can express, some teams use Spark or Python for the heavy lifting while keeping dbt for the orchestration and testing layer.
Data Quality in ELT
ELT puts data quality responsibility on the transformation layer rather than the extraction layer. This shift requires discipline.
# dbt data tests
models:
- name: dim_customers
columns:
- name: customer_id
tests:
- not_null
- unique
- name: customer_segment
tests:
- accepted_values:
values: ["premium", "standard", "basic"]
Tests run after transformations. Bad data that passes through raw and staging layers gets caught at the mart layer. This works as long as tests are comprehensive and run on every pipeline execution.
For a deeper look at data quality practices, see Data Quality.
Cloud Platform Considerations
Snowflake
Snowflake separates storage and compute. You can scale compute up for heavy transformation runs and down during idle periods. Snowflake’s virtual warehouses are purpose-built for analytical SQL workloads.
-- Snowflake: scale warehouse for heavy transforms
ALTER WAREHOUSE transform_wh SET warehouse_size = 'XLARGE';
-- Run transformations
ALTER WAREHOUSE transform_wh SET warehouse_size = 'SMALL';
BigQuery
BigQuery separates storage and compute differently. There is no persistent warehouse. Each query spins up temporary slots. For heavy transformation workloads, you reserve slots for consistent performance.
Redshift
Redshift uses persistent clusters. RA3 nodes separate storage (S3) from compute, giving you similar elasticity to Snowflake. Spectrum allows querying S3 data directly without loading it.
When ELT Is the Right Choice
ELT works well when:
- Your data warehouse has enough compute to handle transformations
- You want to preserve raw data for replay and debugging
- Your transformation logic changes frequently (dbt makes this manageable)
- Your team knows SQL better than Python or Scala
ETL is preferable when:
- Source systems cannot handle the query load of raw extraction
- Data must be cleaned before transport (sensitive data should not sit in raw form in the warehouse)
- Transformation logic is stable and has been battle-tested
- Regulatory requirements dictate specific handling before data reaches the warehouse
Production Failure Scenarios for ELT
ELT pipelines fail in ways that are specific to the warehouse-first approach.
dbt model failures: A dbt model fails mid-run. The warehouse may be left with a partially built table. The pipeline should use on_schema_error: continue for incremental models and validate row counts before marking the model complete.
Warehouse compute exhaustion: A complex dbt model spins up too many concurrent threads and hits Snowflake’s concurrent query limit. Set concurrent_limit in warehouse configuration and use threads in dbt profiles to cap parallelism.
Raw layer abuse: Teams bypass staging and write directly to the raw layer with transformation logic mixed in. The raw layer becomes a second transformation layer, defeating the replay capability that ELT depends on. Enforce raw layer immutability as a team convention.
Staging layer bypass: Analysts write queries directly against raw data, embedding transformation logic in BI tools. This creates implicit transformation logic outside of dbt, making debugging harder and replay impossible.
-- Guard against raw layer writes that should go to staging
CREATE TABLE IF EXISTS raw_orders AS
SELECT * FROM source_orders; -- Raw is append-only
-- Use a separate staging table for cleaned data
CREATE TABLE stg_orders AS
SELECT
order_id,
customer_id,
order_total::NUMERIC(10,2) AS order_total,
created_at::TIMESTAMP AS created_at
FROM raw_orders
WHERE order_total > 0; -- Business rule in staging, not raw
Trade-off Table: Snowflake vs BigQuery vs Redshift for ELT
| Aspect | Snowflake | BigQuery | Redshift |
|---|---|---|---|
| Compute elasticity | Per-second billing, auto-suspend | Slot-based, per-second | RA3 nodes, persistent |
| Max warehouse size | 128 servers (XLARGE) | 100 slots (flex), 2000 (enterprise) | Dense compute nodes |
| dbt support | Native, mature | Native via BigQuery credentials | Native via Redshift credentials |
| Schema evolution | Native VARIANT type | JSON + REpeated fields | SUPER type |
| Cost model | Credits per second | Per-second slot usage | Node hours |
| Simultaneous queries | Per warehouse limit | Global slot pool | Per cluster |
| Best for | Enterprise, multi-team | Large-scale ad hoc | Mixed workloads |
Snowflake wins for teams that need per-second elasticity and fine-grained warehouse sizing. BigQuery wins for massive scale with unpredictable query patterns. Redshift wins when you need deep Postgres compatibility.
Anti-Patterns in ELT
Transformations in raw layer: The raw layer should contain unmodified source data. Adding transformations to raw layer queries breaks the replay capability and makes debugging harder.
No staging layer: Jumping directly from raw to mart models means every analytics query hits raw data. Mart models should be built on staging, not raw.
Monolithic dbt models: A single dbt model with 2,000 lines of SQL is hard to test and impossible to debug. Break large transformations into smaller models with explicit dependencies.
Skipping dbt tests: Running dbt without tests means you have no automated validation of model quality. Tests catch regressions before they reach production dashboards.
Observability Hooks for ELT
Track these ELT-specific metrics:
dbt test results: Track the number of dbt tests passing, failing, and skipped per run. A rising test failure rate is the earliest signal of data quality degradation.
Warehouse credit usage: Monitor credit consumption per warehouse per day. Unexpected spikes indicate runaway queries or incorrect warehouse sizing.
Model execution time: Track per-model execution time and compare against the 7-day average. Sudden increases mean a model is reading more data or the warehouse is under pressure.
Staging layer growth: Track row counts in staging tables. Tables that grow without being cleaned up indicate missing cleanup logic.
# Example: Snowflake warehouse credit monitoring
SELECT
WAREHOUSE_NAME,
START_TIME::DATE AS date,
SUM(CREDITS_USED) AS credits_used,
AVG(AVG_RUNNING) AS avg_running_queries,
AVG(AVG_BLOCKED) AS avg_blocked_queries
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
WHERE START_TIME >= DATEADD(day, -7, CURRENT_TIMESTAMP())
GROUP BY WAREHOUSE_NAME, date
ORDER BY date DESC;
Alert on: dbt test failures exceeding 1% of total tests, warehouse credit usage exceeding 2x the 30-day average for the same day of week, model execution time exceeding 3x the 7-day average.
Quick Recap
- ELT loads raw data first, then transforms inside the warehouse. This preserves replay capability.
- dbt is the standard framework for managing SQL transformations in ELT pipelines.
- Raw layer is append-only — never transform there. Staging cleans and casts types, mart models serve analytics.
- Keep dbt models small and tested. Monolithic models are impossible to debug.
- Watch warehouse credit usage and model execution time as key ELT health signals.
Conclusion
ELT has become the dominant pattern in modern data stacks. Extract raw data, load it to the warehouse, transform using SQL. dbt provides the framework for managing transformation complexity at scale.
The raw-first approach gives you replay capability and easier debugging. The warehouse-first approach leverages hardware designed for analytical SQL. Together, they are a significant improvement over classical ETL for most use cases.
For more on data pipeline patterns, see Pipeline Orchestration and Incremental Loads.
Category
Related Posts
Data Warehouse Architecture: Building the Foundation for Analytics
Learn the core architectural patterns of data warehouses, from ETL pipelines to dimensional modeling, and how they enable business intelligence at scale.
Data Vault: Scalable Enterprise Data Modeling
Learn Data Vault modeling methodology for building auditable, scalable enterprise data warehouses with hash keys and satellite tables.
dbt: The SQL-First Transformation Tool for Data Teams
Discover how dbt brings software engineering practices—version control, testing, documentation—to SQL-based data transformations.