Data Warehouse Architecture: Building the Foundation for Analytics
Learn the core architectural patterns of data warehouses, from ETL pipelines to dimensional modeling, and how they enable business intelligence at scale.
Data Warehouse Architecture: Building the Foundation for Analytics
A data warehouse is the centerpiece of any serious analytics operation. While operational databases keep the lights on, a data warehouse answers the questions that keep executives up at night: why did revenue drop last quarter, which customers are most profitable, and where are we leaving money on the table? Getting the architecture right matters because rebuilding it later is expensive and painful.
This post covers the core architectural patterns, the decisions you’ll face, and how the pieces fit together to create a system that actually serves business users rather than frustrating them.
What Makes a Data Warehouse Different
Your operational database runs OLTP (Online Transaction Processing). Every insert, update, and delete happens against a single record or small set of records, and response time matters in milliseconds. A data warehouse runs OLAP (Online Analytical Processing), where queries scan millions of rows, join across dozens of tables, and aggregate across time periods. The access patterns are fundamentally different, which is why you don’t run both workloads on the same system.
The canonical data warehouse architecture separates into distinct layers: source systems, a staging area, the core warehouse, and presentation databases (often called data marts).
The source layer feeds data from your operational systems—ERP, CRM, e-commerce platform, financial software, whatever runs the business. These systems care about transactions, not analytics. The staging layer acts as a landing zone where raw data sits temporarily before being cleaned and loaded. The core warehouse stores integrated, cleaned data in a structure optimized for analytical queries. Data marts are focused subsets, often oriented around a business function like finance or marketing, that provide faster query performance for specific user groups.
ETL Versus ELT: The Loading Strategy Decision
The way data moves from source to warehouse defines much of your architecture’s personality. ETL (Extract, Transform, Load) has been the traditional approach: extract data from sources, transform it in a dedicated processing layer (often on-premises), then load into the warehouse. This keeps the warehouse relatively simple but requires significant processing power outside it.
ELT (Extract, Load, Transform) flips the order. You load raw data into the warehouse first, then perform transformations using the warehouse’s own compute resources. Cloud data warehouses with massive parallel processing (Snowflake, BigQuery, Redshift) made this pattern practical. Now you can often skip a separate transformation layer entirely and let the warehouse do the heavy lifting.
For most new projects, ELT is the simpler path if you’re using a cloud warehouse. You get faster loading, less infrastructure to manage, and transformations that scale automatically with your warehouse size. ETL still makes sense when you need complex data quality checks before loading, when your source systems can’t handle the read pressure of loading raw data, or when regulatory requirements mandate specific handling of sensitive data before it touches your warehouse.
The Staging Layer: Raw Zone Matters
Modern architectures often implement a raw or bronze layer that keeps an exact copy of source data before any transformations. When business logic changes, you can reprocess from the same source data rather than asking for extracts again. When transformation bugs surface, you can see what the source actually contained. Auditors can verify what data you received versus what you reported.
The raw layer also decouples your loading process from your transformation process. Load first, transform second, in separate steps that can be monitored and retried independently.
-- Example: Raw table structure capturing source state
CREATE TABLE raw_orders (
load_id BIGINT,
source_system VARCHAR(50),
extracted_at TIMESTAMP,
order_id VARCHAR(50),
customer_id VARCHAR(50),
order_total DECIMAL(12,2),
order_status VARCHAR(20),
raw_json JSON,
CONSTRAINT pk_raw_orders PRIMARY KEY (load_id, order_id)
);
Notice the raw_json column. Storing the original payload alongside parsed fields gives you flexibility when downstream parsing logic changes or when you need to debug exactly what the source sent.
Dimensional Modeling: Fact and Dimension Tables
The presentation layer uses dimensional modeling, organizing data into fact tables (events to measure) and dimension tables (the who, what, when, where, why of each event). This is not the only way to organize analytical data, but it is the most practical for business users who need to write queries without a PhD in database theory.
Fact tables store quantitative metrics—orders, clicks, payments, measurements. Each row represents an event, and foreign keys link to dimension tables. Dimension tables store descriptive attributes—customer name, product category, region, date components. These are the axes along which you slice and dice facts.
A well-designed dimensional model makes query writing intuitive. “Show revenue by region and product category for Q1” becomes a straightforward aggregation joining fact_revenue to dim_region and dim_product. Compare that to the contortions required to extract the same insight from a third-normal-form transactional schema.
The two dominant modeling patterns are star schema and snowflake schema, covered in detail in their own posts. For now, understand that star schema denormalizes dimension data into the fact table’s dimension keys, while snowflake schema normalizes dimensions into separate tables. Star is simpler and faster; snowflake is more storage-efficient and enforces consistency better when dimensions are shared.
Slowly Changing Dimensions: Handling Change Over Time
Customer addresses change. Product categories get renamed. Employee titles evolve. A pure snapshot approach would store a complete new row for every change, but that explodes storage and complicates queries. Slowly changing dimensions (SCD) provide strategies for managing this.
Type 1 simply overwrites the old value with the new one. You lose history but keep things simple. Type 2 creates a new row with effective date ranges, preserving complete history at the cost of storage and query complexity. Type 3 adds columns for previous values, letting you see both current and prior state in the same row.
Most production warehouses use a combination of these. Low-cardinality attributes that do not require historical tracking might use Type 1. High-value analytical attributes like customer segmentation tiers use Type 2. Type 3 appears less frequently, usually for specific compliance or business requirements.
Architecture Patterns by Scale
The architecture that works for a 50GB warehouse falls apart at 50TB. Cloud data warehouses have largely automated the scaling concerns that plagued on-premises deployments, but you still need to choose the right architectural pattern.
For small to medium deployments (under 1TB), a single schema with star or snowflake modeling works fine. Query performance comes from proper indexing and materialized views, not from architectural complexity.
At medium scale (1-10TB), you start caring about data distribution and query parallelization. Columnar storage becomes important for query speed. Table clustering on frequently-filtered columns reduces the data scanned per query.
At large scale (10TB+), partitioning strategy becomes critical. You partition by date or another high-cardinality attribute that aligns with your query patterns. Distribution keys determine how data is spread across nodes for parallel processing. At this scale, schema design choices have outsized impact on performance, and you’ll want to carefully model your most important queries before finalizing the warehouse structure.
Tools of the Trade
Modern data warehouses rarely exist in isolation. They integrate with orchestration tools (Airflow, dbt Cloud, Prefect), transformation layers (dbt, Spark, Snowflake’s Snowpark), and BI platforms (Tableau, Looker, Metabase). The warehouse itself is the core, but the surrounding ecosystem determines how pleasant the system is to operate.
dbt has become central to modern architectures, treating SQL transformations as code with version control, testing, and documentation. The dbt post covers how it fits into this ecosystem and why it has become a de facto standard for transformation workflows.
When to Use a Data Warehouse
A data warehouse is the right choice when:
- You need to answer questions that require joining data across multiple business domains (finance, CRM, supply chain)
- Query response time matters but can tolerate seconds rather than milliseconds
- Your data volumes are large enough that analytical queries would degrade operational systems
- You need a shared source of truth for business metrics across teams
A data warehouse is not the right choice when:
- Your primary workload is simple key-value lookups or single-record CRUD operations
- Data volumes are small enough that a well-indexed relational database handles analytical queries without impact
- You need real-time or sub-second response times for operational dashboards
- Your team lacks the SQL skills to write analytical queries against a dimensional model
Data Warehouse Architecture
The canonical data warehouse architecture flows through distinct layers. Data enters from source systems, lands in a staging area, transforms into the core warehouse, and surfaces through data marts:
flowchart TD
subgraph Sources[Source Systems]
ERP[ERP]
CRM[CRM]
Commerce[E-commerce]
Finance[Finance]
end
subgraph Staging[Staging / Raw Layer]
Raw[Raw Tables]
end
subgraph Warehouse[Core Warehouse]
Fact[Fact Tables]
Dim[Dimension Tables]
end
subgraph Marts[Data Marts]
FinanceMart[Finance Mart]
MarketingMart[Marketing Mart]
OpsMart[Operations Mart]
end
subgraph BI[BI & Analytics]
Dashboards[Dashboards]
Reports[Reports]
AdHoc[Ad-hoc Queries]
end
ERP --> Raw
CRM --> Raw
Commerce --> Raw
Finance --> Raw
Raw --> Fact
Raw --> Dim
Fact --> FinanceMart
Fact --> MarketingMart
Fact --> OpsMart
Dim --> FinanceMart
Dim --> MarketingMart
Dim --> OpsMart
FinanceMart --> Dashboards
MarketingMart --> Reports
OpsMart --> AdHoc
Each layer has a distinct purpose. Sources push raw data. Staging holds unmodified copies for recovery and reprocessing. The warehouse integrates and cleans. Data marts serve specific team needs.
Trade-off Table: ETL vs ELT vs Cloud Native
| Aspect | Traditional ETL | Traditional ELT | Cloud Native Warehouse |
|---|---|---|---|
| Transform location | External processing cluster | Warehouse compute | Warehouse compute |
| Infrastructure | Dedicated transformation servers | None extra | None extra |
| Setup complexity | High | Medium | Low |
| Scale | Fixed capacity | Elastic | Elastic |
| Cost model | Always-on servers | Credit-based | Credit-based |
| Data quality enforcement | Before load | After load | After load |
| Best for | Complex pre-load validation, sensitive data separation | Cloud warehouses, SQL-centric teams | New cloud projects, simple pipelines |
Traditional ETL suits regulated industries where sensitive data must be masked or validated before touching the warehouse. ELT suits teams that want to use warehouse compute for everything. Cloud native approaches are fastest to set up but cost scales with usage — watch credits carefully.
Production Failure Scenarios
Staging layer data corruption
Bad data enters the staging layer from a source system with a schema change the pipeline did not expect. The raw JSON column contains truncated records. Downstream transformation queries return nulls or wrong values, but the pipeline completes without errors because staging was loaded successfully.
Mitigation: validate row counts and null rates at the staging layer before marking a load complete. Reject loads where expected row counts deviate more than 10% from the moving average. Store schema version metadata with each staging load so pipeline logic can detect and handle breaking changes.
Dimension table inconsistency after late-arriving records
A dimension record arrives late with an updated_at timestamp before the fact record that references it. The fact loads into the warehouse. The dimension update arrives seconds later. A Type 2 dimension record is inserted with effective dates that overlap the fact record. Queries that filter by dimension attributes return different results depending on join order.
Mitigation: use a staging table for dimension updates with a batch watermark. Only commit dimension changes to the production dimension table after the fact load watermark has advanced past the dimension record’s event time.
ETL pipeline silently produces wrong aggregates
The transformation logic has a subtle bug — the SUM(amount) used amount_usd instead of amount_eur for European sales records. The pipeline completes without errors. Incorrect totals appear in dashboards. The bug is discovered weeks later during a financial audit.
Mitigation: add data contract assertions that compare aggregate totals against expected ranges. Alert if a pipeline produces totals that deviate more than 5% from the same period last week.
Distribution key skew causing query timeouts
A new fact table was created with customer_id as the distribution key. A few large enterprise customers dominate the row count. Their rows land on the same node. Queries that join on customer_id cause a memory spill on that node and time out.
Mitigation: profile key distribution before deploying new tables. Use a hash of customer_id or a random suffix if the distribution is naturally skewed.
Observability Checklist
Track these at each pipeline stage:
Staging layer: rows_loaded vs expected_row_count — alert if more than 10% off. null_rate per column. schema_version_mismatch count.
Transformation layer: dbt test failures (block on critical failures). rows_processed vs rows_loaded. duration_ms per model. warehouse credit consumption.
Warehouse layer: query_duration_p95 — alert if above 30 seconds. concurrent_queries approaching the warehouse size limit. warehouse credit usage vs budget.
Data mart layer: mart_row_counts vs historical baseline. query_usage_by_team to see which marts are actually used. dashboard_refresh_latency.
-- Pipeline SLA monitoring query
SELECT
pipeline_name,
MAX(completed_at) as last_completed,
EXTRACT(EPOCH FROM (NOW() - MAX(completed_at))) / 60 as minutes_since_completion,
CASE
WHEN EXTRACT(EPOCH FROM (NOW() - MAX(completed_at))) / 60 > 60 THEN 'BREACHED'
WHEN EXTRACT(EPOCH FROM (NOW() - MAX(completed_at))) / 60 > 45 THEN 'WARNING'
ELSE 'OK'
END as sla_status
FROM pipeline_runs
GROUP BY pipeline_name;
Alert on: pipelines missing their SLA window by more than 15 minutes, row count deviation exceeding 10%, warehouse credit usage above 80% of daily budget.
Common Anti-Patterns
Building a warehouse before understanding query patterns
Teams design a warehouse schema from the data they have, not the queries they need. The resulting schema is technically correct but serves no one well. Business users cannot write the queries they need, or the queries run for minutes when they should take seconds.
Before designing anything, interview business users. Ask what questions they are trying to answer, not what data they have. Design the schema to serve those questions first, then figure out how to load the data to support it.
Using the warehouse as a staging ground
Loading raw data directly into production fact and dimension tables without a staging layer couples load failures to query availability. A failed load mid-transaction can leave the warehouse partially updated, with no way to roll back cleanly.
Always load through staging. Transformations run against staging. Only swap to production tables after validation passes.
Ignoring the datamart layer
The warehouse is one large schema serving all use cases. Query performance degrades as the schema grows, and competing workloads from different teams interfere with each other.
The warehouse should serve production marts, not direct BI queries. Create separate marts for each major team. Let each team optimize their mart for their query patterns without affecting others.
Quick Recap
- A data warehouse serves analytical workloads — different from OLTP in access patterns, scale, and query shape.
- ETL transforms before load; ELT loads raw then transforms using warehouse compute. Choose based on your team’s skills and regulatory constraints.
- Always load through a staging layer. Never couple load failures to query availability.
- Design the schema around the questions users need answered, not the data you have.
- Monitor row counts, query latency, and warehouse credit usage at every stage.
Conclusion
Data warehouse architecture isn’t about following a textbook blueprint. It’s about making deliberate choices that match your data volume, query patterns, team capabilities, and business requirements. Start simpler than you think you need, measure actual usage patterns, and evolve from there.
The dimensional modeling principles—facts, dimensions, slowly changing dimensions—provide the conceptual framework. The physical decisions—partitioning, distribution, indexing—determine whether your queries finish in seconds or minutes. Get the architecture right, and your analytics team can focus on insights instead of database gymnastics.
For deeper dives into specific modeling approaches, see star schema and snowflake schema. To understand where raw data lives, read about data lakes. Teams building modern platforms increasingly adopt the lakehouse pattern to unify these approaches.
Category
Related Posts
Star Schema: The Workhorse of Dimensional Data Modeling
Discover why star schema remains the dominant approach for analytical databases, how it enables fast queries, and when to use it versus alternatives.
Data Warehousing
OLAP vs OLTP comparison. Star and snowflake schemas, fact and dimension tables, slowly changing dimensions, and columnar storage in data warehouses.
ELT Pattern: Transforming Data in the Data Warehouse
ELT flips ETL by loading raw data first, then transforming in the warehouse. Learn how modern cloud platforms enable ELT at scale.