ELT Pattern: Transforming Data in the Data Warehouse

ELT flips ETL by loading raw data first, then transforming in the warehouse. Learn how modern cloud platforms enable ELT at scale.

published: March 27, 2026 reading time: 13 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

ELT flips the classical ETL sequence by loading raw data into the warehouse first, then transforming in-place using warehouse compute instead of external ETL engines. The raw-first approach preserves replay capability when transformation logic changes, simplifies debugging by keeping unmodified source data available, and captures schema evolution without pipeline rewrites. dbt has become the standard framework for managing SQL transformations in ELT pipelines, breaking large transformations into testable models with explicit dependency management. Raw layers stay append-only—never transform there—while staging handles type casting and light cleaning, and mart models serve analytics directly.

ELT Pattern: Transforming Data in the Data Warehouse

ETL has a sequencing problem. When you transform first, you are constrained by the compute of your transformation engine. When you load first, you can leverage the full power of your data warehouse.

ELT inverts the sequence. Extract data from sources and load it into the warehouse in its raw form. Then transform the data using the warehouse’s compute engine. Modern cloud data warehouses like Snowflake, BigQuery, and Redshift have enough raw power that SQL-based transformations are often faster than external ETL engines.

flowchart LR
    subgraph ETL[ETL Pattern]
        E1[Extract] --> T1[Transform]
        T1 --> L1[Load to Warehouse]
    end
    subgraph ELT[ELT Pattern]
        E2[Extract] --> L2[Load Raw to Warehouse]
        L2 --> T2[Transform in Warehouse]
    end

Why Load Raw First

A data warehouse is designed for analytical queries. Snowflake, BigQuery, and Redshift can scan billions of rows efficiently using columnar storage and massive parallelism. When you move transformations into the warehouse, you leverage hardware specifically designed for this workload.

There are other advantages:

Replay capability: Raw data is preserved. If your transformation logic changes, you can rerun it against the same raw data. ETL that transforms before loading loses the raw source.

Debugging: When a transformation produces wrong results, you can inspect the raw data to understand what went wrong. With ETL, you often have to re-extract to debug.

Schema evolution: Raw data preserves the source schema even when the warehouse schema changes. If a source adds a new column, raw data captures it without requiring schema changes to the transformation.

Simpler pipelines: Extract and load are the only pipeline steps. No external transformation engine to operate. The pipeline is SQL and scheduling, not a distributed compute framework.

The dbt Revolution

dbt (data build tool) changed how data teams handle ELT. dbt runs SQL transformations against your data warehouse. You define models in SQL, and dbt compiles and runs them in the correct order with dependency management.

-- dbt model: dim_customers.sql
{{
  config(
    materialized='incremental',
    unique_key='customer_id'
  )
}}

SELECT
    customer_id,
    email,
    first_name,
    last_name,
    CASE
        WHEN total_spend >= 1000 THEN 'premium'
        WHEN total_spend >= 500 THEN 'standard'
        ELSE 'basic'
    END AS customer_segment,
    created_at,
    updated_at
FROM {{ ref('stg_customers') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

dbt models are:

Compiled: dbt generates SQL and executes it against the warehouse
Version controlled: Models live in git repositories with full history
Tested: dbt tests assert contract guarantees on model outputs
Documented: dbt auto-generates data lineage and documentation from model definitions

For more on how dbt fits into the modern stack, see the dbt.

Staging Layers

A well-structured ELT pipeline has multiple transformation layers:

Raw layer: The exact data received from sources, unmodified. If the source sent “123” as a string, the raw layer stores “123” as a string. This layer is append-only.

Staging layer: Light cleaning and type casting. Convert strings to dates. Cast numeric strings to integers. Rename columns to warehouse conventions. Remove obviously bad records.

Intermediate layer: Business logic transformations. Join to dimension tables. Apply business rules. Aggregate to the grain your analytics needs.

Mart layer: Final models optimized for consumption. Star schema fact and dimension tables ready for BI tools and analyst queries.

-- Staging: light cleaning
CREATE TABLE stg_customers AS
SELECT
    customer_id::VARCHAR AS customer_id,
    email::VARCHAR AS email,
    first_name::VARCHAR AS first_name,
    NULLIF(last_name, '')::VARCHAR AS last_name,
    created_at::TIMESTAMP AS created_at,
    updated_at::TIMESTAMP AS updated_at
FROM raw_customers
WHERE email IS NOT NULL;

SQL Transformations at Scale

The concern with ELT is that complex SQL transformations become unwieldy. SQL that spans hundreds of lines with multiple joins and window functions is hard to test and harder to debug.

dbt addresses this by:

Breaking transformations into models (files) that can be tested individually
Providing a ref() function that explicitly declares dependencies
Generating a directed acyclic graph (DAG) of model execution
Running tests on model outputs to catch issues early

For large-scale transformations that exceed what SQL can express, some teams use Spark or Python for the heavy lifting while keeping dbt for the orchestration and testing layer.

Data Quality in ELT

ELT puts data quality responsibility on the transformation layer rather than the extraction layer. This shift requires discipline from the team.

# dbt data tests
models:
  - name: dim_customers
    columns:
      - name: customer_id
        tests:
          - not_null
          - unique
      - name: customer_segment
        tests:
          - accepted_values:
              values: ["premium", "standard", "basic"]

Tests run after transformations. Bad data that passes through raw and staging layers gets caught at the mart layer. This works as long as tests are comprehensive and run on every pipeline execution.

For a deeper look at data quality practices, see Data Quality.

Cloud Platform Considerations

Snowflake

Snowflake separates storage and compute. You can scale compute up for heavy transformation runs and down during idle periods. Snowflake’s virtual warehouses are purpose-built for analytical SQL workloads.

-- Snowflake: scale warehouse for heavy transforms
ALTER WAREHOUSE transform_wh SET warehouse_size = 'XLARGE';
-- Run transformations
ALTER WAREHOUSE transform_wh SET warehouse_size = 'SMALL';

The practical benefit of this separation is that warehouse sizing becomes a runtime decision rather than a capacity planning exercise. A dbt model that takes 40 minutes to compile and run against a MEDIUM warehouse might complete in 8 minutes on an XLARGE. The cost difference is meaningful: Snowflake bills per-second with a minimum of 60 seconds per session, so short, frequent runs benefit from smaller warehouses while long, heavy model runs justify the larger size for the duration of the task. Auto-suspend settings help too. Configure the warehouse to suspend after a period of inactivity so you are not paying for idle compute between pipeline runs. Most teams set auto-suspend to 60 or 120 seconds. Auto-resume ensures the warehouse spins up automatically when a new query arrives, so the pipeline does not fail for lack of compute.

BigQuery

BigQuery separates storage and compute differently. There is no persistent warehouse. Each query spins up temporary slots. For heavy transformation workloads, you reserve slots for consistent performance.

BigQuery’s slot-based execution means queries compete for a shared pool of compute resources. A heavy dbt model with multiple upstream dependencies eats up slots proportional to its complexity. On-demand pricing charges per byte processed, which can spike without warning during large transformations. Flat-rate reservations lock in a fixed slot count regardless of what you actually run, giving predictable costs when you know your workload.

For dbt-based ELT, reserve slots during your transformation window. A flat-rate reservation of 100 slots usually covers a 30-minute daily run without queuing. Scale up before heavy runs and back down after. You pay for reserved capacity, not consumed capacity.

dbt’s bigquery profile authenticates via service account credentials and routes queries to the configured project. Incremental models use merge statements instead of delete+insert, which sidesteps the need for manual partition management.

One nuance that trips up teams early: slot reservations are committed per project, not per pipeline. If you have multiple dbt projects or multiple pipelines running concurrently against the same BigQuery project, they all share the same slot pool. A single heavy run can starve concurrent lighter queries. The practical fix is to separate workloads into different projects — one project for heavy dbt transformation runs with a dedicated slot reservation, another for ad-hoc analyst queries with a smaller reservation or on-demand billing. This isolation prevents transformation workloads from impacting query performance for downstream BI tools.

Redshift

Redshift uses persistent clusters. RA3 nodes separate storage (S3) from compute, giving you similar elasticity to Snowflake. Spectrum allows querying S3 data directly without loading it.

Redshift clusters run around the clock, so you’re paying for compute even at 3am when nobody’s querying. RA3 nodes help by keeping hot data cached locally while the bulk lives in S3, letting you size compute separately from storage. If your ELT runs on a schedule, pausing the cluster overnight cuts costs meaningfully. Redshift Serverless automates this entirely but costs more per second for predictable heavy workloads.

Spectrum lets Redshift query S3 data without loading it first. Raw data you want to keep for replay but not land in tables fits here. The tradeoff is latency — Spectrum adds enough overhead that it’s not viable for time-sensitive analytics. For replaying historical S3 logs, though, it’s practical.

Concurrency scaling spins up transient clusters when query volume spikes. If your dbt run fires off many models in parallel, this keeps them from queuing — at the cost of extra credits. WLM (Workload Management) queues let you prioritize transformation queries over ad-hoc analyst traffic by allocating slots per queue.

For ELT specifically, the WLM configuration matters more than in other platforms. A typical setup allocates one WLM queue for dbt transformation runs with enough slots to handle parallel model execution, and a separate queue for ad-hoc analyst queries with lower priority. The transformation queue might get 80% of available slots during the nightly run window, dropping to 20% during business hours when analyst traffic picks up. Without this separation, a slow analyst query can block the dbt run from acquiring slots, extending pipeline duration unpredictably. Monitor WLM queue wait times and adjust slot allocation as query patterns evolve — a pipeline that ran cleanly at launch can queue up badly six months later as more users and more complex queries accumulate.

When ELT Is the Right Choice

ELT works well when:

Your data warehouse has enough compute to handle transformations
You want to preserve raw data for replay and debugging
Your transformation logic changes frequently (dbt makes this manageable)
Your team knows SQL better than Python or Scala

ETL is preferable when:

Source systems cannot handle the query load of raw extraction
Data must be cleaned before transport (sensitive data should not sit in raw form in the warehouse)
Transformation logic is stable and has been battle-tested
Regulatory requirements dictate specific handling before data reaches the warehouse

Production Failure Scenarios for ELT

ELT pipelines fail in ways that are specific to the warehouse-first approach.

dbt model failures: A dbt model fails mid-run. The warehouse may be left with a partially built table. The pipeline should use on_schema_error: continue for incremental models and validate row counts before marking the model complete.

Warehouse compute exhaustion: A complex dbt model spins up too many concurrent threads and hits Snowflake’s concurrent query limit. Set concurrent_limit in warehouse configuration and use threads in dbt profiles to cap parallelism.

Raw layer abuse: Teams bypass staging and write directly to the raw layer with transformation logic mixed in. The raw layer becomes a second transformation layer, defeating the replay capability that ELT depends on. Enforce raw layer immutability as a team convention.

Staging layer bypass: Analysts write queries directly against raw data, embedding transformation logic in BI tools. This creates implicit transformation logic outside of dbt, making debugging harder and replay impossible.

-- Guard against raw layer writes that should go to staging
CREATE TABLE IF EXISTS raw_orders AS
SELECT * FROM source_orders;  -- Raw is append-only

-- Use a separate staging table for cleaned data
CREATE TABLE stg_orders AS
SELECT
    order_id,
    customer_id,
    order_total::NUMERIC(10,2) AS order_total,
    created_at::TIMESTAMP AS created_at
FROM raw_orders
WHERE order_total > 0;  -- Business rule in staging, not raw

Trade-off Table: Snowflake vs BigQuery vs Redshift for ELT

Aspect	Snowflake	BigQuery	Redshift
Compute elasticity	Per-second billing, auto-suspend	Slot-based, per-second	RA3 nodes, persistent
Max warehouse size	128 servers (XLARGE)	100 slots (flex), 2000 (enterprise)	Dense compute nodes
dbt support	Native, mature	Native via BigQuery credentials	Native via Redshift credentials
Schema evolution	Native VARIANT type	JSON + REpeated fields	SUPER type
Cost model	Credits per second	Per-second slot usage	Node hours
Simultaneous queries	Per warehouse limit	Global slot pool	Per cluster
Best for	Enterprise, multi-team	Large-scale ad hoc	Mixed workloads

Snowflake wins for teams that need per-second elasticity and fine-grained warehouse sizing. BigQuery wins for massive scale with unpredictable query patterns. Redshift wins when you need deep Postgres compatibility.

Anti-Patterns in ELT

Transformations in raw layer: The raw layer should contain unmodified source data. Adding transformations to raw layer queries breaks the replay capability and makes debugging harder.

No staging layer: Jumping directly from raw to mart models means every analytics query hits raw data. Mart models should be built on staging, not raw.

Monolithic dbt models: A single dbt model with 2,000 lines of SQL is hard to test and impossible to debug. Break large transformations into smaller models with explicit dependencies.

Skipping dbt tests: Running dbt without tests means you have no automated validation of model quality. Tests catch regressions before they reach production dashboards.

Observability Hooks for ELT

Track these ELT-specific metrics:

dbt test results: Track the number of dbt tests passing, failing, and skipped per run. A rising test failure rate is the earliest signal of data quality degradation.

Warehouse credit usage: Monitor credit consumption per warehouse per day. Unexpected spikes indicate runaway queries or incorrect warehouse sizing.

Model execution time: Track per-model execution time and compare against the 7-day average. Sudden increases mean a model is reading more data or the warehouse is under pressure.

Staging layer growth: Track row counts in staging tables. Tables that grow without being cleaned up indicate missing cleanup logic.

# Example: Snowflake warehouse credit monitoring
SELECT
    WAREHOUSE_NAME,
    START_TIME::DATE AS date,
    SUM(CREDITS_USED) AS credits_used,
    AVG(AVG_RUNNING) AS avg_running_queries,
    AVG(AVG_BLOCKED) AS avg_blocked_queries
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
WHERE START_TIME >= DATEADD(day, -7, CURRENT_TIMESTAMP())
GROUP BY WAREHOUSE_NAME, date
ORDER BY date DESC;

Alert on: dbt test failures exceeding 1% of total tests, warehouse credit usage exceeding 2x the 30-day average for the same day of week, model execution time exceeding 3x the 7-day average.

Quick Recap

ELT loads raw data first, then transforms inside the warehouse. This preserves replay capability.
dbt is the standard framework for managing SQL transformations in ELT pipelines.
Raw layer is append-only — never transform there. Staging cleans and casts types, mart models serve analytics.
Keep dbt models small and tested. Monolithic models are impossible to debug.
Watch warehouse credit usage and model execution time as key ELT health signals.

Conclusion

ELT has become the dominant pattern in modern data stacks. Extract raw data, load it to the warehouse, transform using SQL. dbt provides the framework for managing transformation complexity at scale.

The raw-first approach gives you replay capability and easier debugging. The warehouse-first approach leverages hardware designed for analytical SQL. Together, they are a significant improvement over classical ETL for most use cases.

For more on data pipeline patterns, see Pipeline Orchestration and Incremental Loads.

ELT Pattern: Transforming Data in the Data Warehouse

Why Load Raw First

The dbt Revolution

Staging Layers

SQL Transformations at Scale

Data Quality in ELT

Cloud Platform Considerations

Snowflake

BigQuery

Redshift

When ELT Is the Right Choice

Production Failure Scenarios for ELT

Trade-off Table: Snowflake vs BigQuery vs Redshift for ELT

Anti-Patterns in ELT

Observability Hooks for ELT

Quick Recap

Conclusion

Category

Tags

Related Posts

Data Warehouse Architecture: Building the Foundation for Analytics

Data Vault: Scalable Enterprise Data Modeling

dbt: The SQL-First Transformation Tool for Data Teams