Data Warehousing
OLAP vs OLTP comparison. Star and snowflake schemas, fact and dimension tables, slowly changing dimensions, and columnar storage in data warehouses.
Data Warehousing
Your operational database handles transactions: orders get placed, accounts get updated, inventory changes. It is optimized for writes and short, targeted reads. It is not built for analyzing months of order history across multiple dimensions.
A data warehouse is built for analytics. It stores data from across your organization, structured for complex queries and aggregation. Where an operational system handles 100 transactions per second, a data warehouse handles queries scanning millions of rows.
OLTP vs OLAP
OLTP (Online Transaction Processing) systems handle day-to-day operations. inserts and updates dominate. Queries are simple, targeting specific records. Schema is normalized to reduce redundancy and support concurrent writes.
OLAP (Online Analytical Processing) systems handle analysis. Reads dominate—complex queries scanning large datasets. Writes are batch loads from source systems, not individual transactions. Schema is denormalized to support fast aggregations.
| Characteristic | OLTP | OLAP |
|---|---|---|
| Purpose | Day-to-day operations | Historical analysis |
| Writes | High volume, small transactions | Low volume, batch loads |
| Reads | Simple, targeted | Complex, aggregations |
| Data volume | Current data | Historical data |
| Schema | Normalized | Denormalized (star/snowflake) |
The same data often flows from OLTP systems into OLAP systems. ETL pipelines move and transform data nightly or continuously.
Star Schema
Star schema organizes data into fact tables and dimension tables. The name matches the shape: a central fact table with dimension tables radiating outward.
Fact tables store measurements or metrics. Examples: sales transactions, page views, account balances. Fact tables contain foreign keys to dimension tables plus numeric measures.
Dimension tables store context around events. Examples: customer information, product details, time periods, geographic locations. Dimension tables contain descriptive attributes.
-- Fact table: sales
CREATE TABLE fact_sales (
sale_id BIGINT PRIMARY KEY,
customer_id INT,
product_id INT,
store_id INT,
time_id INT,
quantity INT,
unit_price DECIMAL(10,2),
total DECIMAL(10,2)
);
-- Dimension table: customers
CREATE TABLE dim_customers (
customer_id INT PRIMARY KEY,
customer_name VARCHAR(100),
email VARCHAR(255),
city VARCHAR(50),
region VARCHAR(50),
segment VARCHAR(20)
);
A typical sales query joins fact_sales to dim_customers, dim_products, dim_time, and dim_stores to produce reports by customer segment, product category, time period, or region.
flowchart LR
Fact[("fact_sales<br/>Sales Transactions")]
Fact --> DimC[("dim_customers<br/>Customer Attributes")]
Fact --> DimP[("dim_products<br/>Product Details")]
Fact --> DimT[("dim_time<br/>Calendar")]
Star schema prioritizes query performance. Joins are predictable, and the database can efficiently scan fact tables filtered by dimension keys.
Snowflake Schema
Snowflake schema normalizes dimension tables into hierarchies. Star schema stores customer city and region on the dimension table. Snowflake splits these into separate tables.
-- Normalized dimension table structure
CREATE TABLE dim_customers (
customer_id INT PRIMARY KEY,
customer_name VARCHAR(100),
email VARCHAR(255),
city_id INT,
segment_id INT
);
CREATE TABLE dim_cities (
city_id INT PRIMARY KEY,
city_name VARCHAR(50),
region_id INT
);
CREATE TABLE dim_regions (
region_id INT PRIMARY KEY,
region_name VARCHAR(50)
);
Snowflake uses less storage through normalization and reduces redundancy. Cost: more joins to answer questions, slower performance.
Most analytical workloads use star schema. Snowflake appears in regulated industries where data redundancy creates compliance concerns, or when dimension tables are very large and normalization saves meaningful storage.
Fact Tables
Fact tables come in different shapes depending on what they measure:
Transaction fact tables record individual events. One row per transaction. Sales transactions, user logins, web page views.
Periodic snapshot fact tables record measurements at regular intervals. One row per entity per time period. Daily inventory levels, monthly account balances.
Accumulating snapshot fact tables record the lifecycle of an event. One row per order, with columns for order date, ship date, delivery date. Useful for pipeline analysis.
Fact tables tend to be large. A retailer’s fact_sales might contain billions of rows. Fact tables are typically partitioned by date to support efficient pruning and archival.
Dimension Tables
Dimension tables provide context. A customer dimension contains attributes that describe customers: name, email, address, demographic information, segment classification.
Dimension tables are typically much smaller than fact tables. A dim_customers with a few million rows supports billions of sales records. Dimension tables are often shared across multiple fact tables—dim_time appears in sales, inventory, and returns facts.
Dimension tables support filtering and grouping. “Show sales by customer segment” or “Filter to the Northeast region” operate on dimension attributes.
Slowly Changing Dimensions (SCD)
Customer addresses change. Product categories get renamed. Employee titles evolve. Data warehouses often need to track these changes over time.
Type 1: overwrite. The old value is lost. Simple, no history.
Type 2: add new row. Track history with effective date ranges. A customer who moves from Region A to Region B has two rows, each with validity dates. Queries can see the state at any point in time.
Type 3: add new column. Store both old and new values. Less common—supports seeing previous value but not full history.
Type 6 (hybrid): combines types 1, 2, and 3 with additional columns to handle multiple scenarios.
Most data warehouses use Type 2 for attributes that change rarely but matter historically—customer address, product category, employee department.
Data Lake vs Data Warehouse
Data warehouses store structured data optimized for analytics. Data lakes store raw data in native format—structured, semi-structured, unstructured—until needed.
| Characteristic | Data Warehouse | Data Lake |
|---|---|---|
| Data format | Structured | Any format |
| Schema | Defined before loading | Defined on read |
| Cost per storage | Higher | Lower (object storage) |
| Performance | Optimized for queries | Varies widely |
| Users | Business analysts | Data scientists, analysts |
| Use cases | Reporting, dashboards | ML training, exploration |
Modern architectures often combine both. Raw data lands in a data lake for data science. Pipelines transform and structure data into a data warehouse for business reporting.
Snowflake, Redshift, and BigQuery work as both. They can query structured data in optimized columnar storage and raw data in external object storage.
Columnar Storage
Traditional row storage keeps all columns for a row together. Querying specific columns requires reading entire rows.
Columnar storage keeps all values for a column together. To compute an average of a column, you read only that column. This reduces I/O for analytical queries that access a subset of columns.
Columnar storage compresses well. Same values stored contiguously enable dictionary encoding, run-length encoding, and compression schemes that achieve significant storage reduction compared to row storage.
Redshift, BigQuery, Snowflake, and ClickHouse use columnar storage. Parquet and ORC are columnar file formats used in data lakes and Spark workloads.
-- BigQuery columnar storage is automatic
-- Query only columns you need
SELECT
customer_segment,
DATE_TRUNC(order_date, MONTH) AS month,
SUM(total) AS revenue
FROM fact_sales
JOIN dim_customers USING (customer_id)
WHERE order_date >= '2025-01-01'
GROUP BY customer_segment, month;
BigQuery does not read columns you do not specify. If you only need customer_segment, order_date, and total, the other columns are never loaded from storage.
When to Use / When Not to Use Data Warehousing
Use a data warehouse when:
- You need to analyze data across multiple source systems
- Queries scan large historical datasets (months or years of data)
- Business intelligence and reporting are primary workloads
- You need to separate analytical queries from transactional OLTP workloads
- Regulatory requirements demand audit trails and historical data retention
Do not use a data warehouse when:
- Your workload is primarily real-time operational queries
- You need sub-second query response times on current data
- You are dealing with unstructured or semi-structured data that changes shape frequently
- Your team lacks ETL expertise and data modeling skills
- Cost is a primary constraint and simple aggregation tables suffice
Star vs Snowflake vs Galaxy Schema Trade-offs
| Dimension | Star Schema | Snowflake Schema | Galaxy Schema |
|---|---|---|---|
| Normalization | Denormalized | Partially normalized | Multiple fact tables sharing dimension tables |
| Join complexity | Simple — few joins | Complex — many hierarchy joins | Complex — multiple fact joins |
| Query performance | Fastest — optimized for analytics | Slower — more joins | Depends on query shape |
| Storage efficiency | Higher — data duplicated | Lower — normalized | Mixed |
| Data redundancy | High | Low | Medium |
| Dimension table size | Large | Smaller (normalized) | Shared dimensions can be large |
| Maintenance | Easier — simpler model | Harder — referential integrity | Most complex |
| Best for | Simple analytics, BI dashboards | Regulated industries, complex hierarchies | Enterprise with multiple business processes |
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| SCD Type 2 misimplementation | Historical facts attributed to wrong dimension value | Test with known historical scenarios, audit trail on dimension changes |
| Late-arriving facts | New transactions assigned to wrong time period | Use surrogate keys, implement late-arrival handling in ETL |
| Fact table partition key causing hot partitions | Write bottleneck on single partition | Choose high-cardinality partition keys, spread writes across partitions |
| Dimension table referential integrity violations | Query returns phantom or missing rows | Enforce FK constraints in loading, validate before publish |
| ETL job failures causing missing data | Reports show incomplete periods | Implement monitoring with data freshness checks, rebuild procedures |
| Columnar storage query predicates on non-projected columns | Full column scan, slow performance | Always project only needed columns, use columnar-optimized formats |
Capacity Estimation: Fact Table Sizing and Dimension Table Growth
Data warehouse sizing requires estimating fact table growth and dimension table evolution over time.
Fact table sizing formula:
bytes_per_fact_row = sum(width_of_each_column_in_bytes)
total_fact_storage = bytes_per_fact_row × rows_per_year × years_of_history
total_with_indexes = total_fact_storage × (1 + index_overhead_factor) # typically 1.3-1.5×
For a retail sales fact table with these columns: sale_id (8), date_id (4), customer_id (4), product_id (4), store_id (4), quantity (4), unit_price (4), total (8) = 40 bytes per row. At 10M daily sales with 3 years of history:
- Rows per year: 10M × 365 = 3.65B rows
- Raw fact storage: 40 × 3.65B = 146GB
- With columnar compression (10:1 typical for columnar): 14.6GB compressed
- With indexes (2 secondary indexes at 30% overhead): ~19GB total
Dimension table growth:
Dimension tables grow by adding rows (new customers, products, time periods) and sometimes adding columns (SCD Type 2 tracking).
new_rows_per_year = current_rows × annual_growth_rate
sCD_type2_storage = base_rows × avg_versions_per_surrogate_key × bytes_per_row
For a customer dimension starting at 10M customers, growing 20% per year, with SCD Type 2 averaging 3 versions per customer (some never change, some change often): at year 3: 10M × 1.2^3 ≈ 17.3M base rows, but SCD Type 2 multiplies this by ~3 versions = 52M rows. At 500 bytes per row (name, address history, contact history), that is 26GB — the same dimension table that was 5GB in year 1 is 26GB in year 3 purely from SCD Type 2 versioning.
Partition planning: Partition fact tables on the time dimension (date_id or month_id). Target 1-10 partitions per query, 100M-1B rows per partition maximum. With daily partitions on 3.65B annual rows: 365 partitions per year. If you query a single month, that is 30 partitions — manageable. If you query a single day, that is 1 partition — fast.
Observability Hooks: ELT Job Monitoring and Data Freshness
Key warehouse metrics: data freshness (time since last ETL load), ETL job duration trends, and fact table storage growth rate.
-- Check data freshness per table
SELECT
table_name,
last_loaded_at,
EXTRACT(EPOCH FROM (now() - last_loaded_at)) / 3600 AS hours_since_load,
rows_loaded,
load_status
FROM etl_job_history
WHERE job_name LIKE '%daily_load%'
ORDER BY last_loaded_at DESC
LIMIT 20;
-- Monitor fact table row count growth over time
SELECT
table_name,
reltuples / 1e6 AS estimated_millions_of_rows,
pg_size_pretty(pg_total_relation_size(relid)) AS total_size
FROM pg_class
WHERE relnamespace = 'analytics'::regnamespace
AND relkind = 'r'
ORDER BY pg_total_relation_size(relid) DESC;
Critical alerts:
# Data freshness breach
- alert: WarehouseDataStale
expr: (now() - max(etl_last_success_timestamp)) > 86400 # 24 hours
for: 15m
labels:
severity: critical
annotations:
summary: "Warehouse table {{ $labels.table }} has not been loaded in over 24 hours"
# ETL job duration spike
- alert: EtlJobDurationSpike
expr: etl_job_duration_seconds > 1.5 * avg_over_time(etl_job_duration_seconds[7d])
for: 10m
labels:
severity: warning
annotations:
summary: "ETL job {{ $labels.job }} running {{ $value }}s vs typical {{ $labels.typical }}s"
# Dimension table unexpectedly large
- alert: DimensionTableBloat
expr: dimension_table_row_count > 2 * dimension_table_baseline_row_count
for: 1h
labels:
severity: warning
annotations:
summary: "Dimension {{ $labels.dimension }} has grown to {{ $value }} rows vs baseline of {{ $labels.baseline }}"
Real-World Case Study: Snowflake at Databricks Scale
Databricks runs a lakehouse platform where customers process analytical workloads at massive scale. Their own internal data platform uses Snowflake for operational analytics over their customer base.
At their scale, fact table queries routinely scan billions of rows. Their optimization strategy: partition fact tables by date and cluster by the most common filter columns (customer_id, product_id). Snowflake’s micro-partition feature automatically compresses and clustering optimizations reduce the data scanned per query.
The challenge that bit them: SCD Type 2 on their customer_events dimension was consuming 40% of total warehouse storage. Every customer event created a new dimension row version, and with billions of events, dimension table size exploded. The fix: move to SCD Type 1 for low-importance attributes (never track history), keep SCD Type 2 only for high-importance attributes (customer tier, account status), and use a separate history table for detailed audit requirements rather than baking it into the dimension.
The lesson: SCD Type 2 multiplies dimension storage by the average number of versions per key. For large fact tables with many events per key, the dimension storage tax is significant. Design SCD Type 2 scope deliberately — apply it only where point-in-time historical accuracy is genuinely needed.
Interview Questions
Q: Your fact table has 10 years of history but queries only ever use the last 2 years. How do you reduce storage without losing historical data?
Partition by time period and archive old partitions to cheaper storage (Snowflake’s fail-safe or S3-compatible storage tier). Your hot data remains in the active warehouse tier for fast queries; cold data remains queryable on demand but at higher latency and lower cost. Alternatively, use columnar storage with tiering — older partitions get aggressive compression and lower-performance storage hardware. Most cloud data warehouses handle this automatically with tiered storage policies.
Q: Your dimension table is 5× larger than the fact table. What went wrong?
SCD Type 2 over-application is the likely cause. If your dimension has 100,000 base keys but SCD Type 2 tracking has created 5M rows, the dimension is bloated. Each change to any attribute of any customer creates a new row version. The fix: audit which attributes actually need SCD Type 2 tracking versus SCD Type 1 (overwrite in place). Move infrequently-changing attributes to Type 1. For detailed change history that does not need to be in the dimension, use a separate history or audit table.
Q: Late-arriving facts arrive days after a transaction. How do you handle this in a star schema?
Late-arriving facts require a mechanism to update the dimension join without corrupting historical aggregations. The standard approach: use surrogate keys for all dimension joins. When a late fact arrives, look up the surrogate key for the dimension as-of the fact date — not the current dimension value. If the customer had tier “gold” 5 days ago but is “platinum” now, late-arriving fact from 5 days ago should join on the “gold” surrogate. This requires maintaining a time-series of dimension surrogates. For systems without full surrogate key tracking, implement late-arrival handling in the ETL: insert a corrective record into the fact table with the correct surrogate key and a flag indicating correction.
Q: How do you estimate columnar storage compression ratio for a new fact table?
Columnar storage compression depends on cardinality and data patterns. Measure actual compression on a sample: load 1% of your data into the columnar format and measure the compression ratio. General rules of thumb: low-cardinality strings (status codes, category IDs) compress 10:1 or better. High-cardinality strings (UUIDs, free-text) compress 2:1 or worse. Numeric types compress 5:10:1 depending on value distribution. In practice, expect 5:1-15:1 for well-designed star schemas. If your compression is less than 3:1, investigate high-cardinality string columns that should be numeric IDs, or reconsider the schema design.
Related Posts
- Data Engineering Roadmap - ETL pipelines, stream processing, and data transformation patterns that complement data warehousing architecture
Conclusion
Data warehouses serve analytical workloads against large historical datasets. Star schema organizes data into fact and dimension tables optimized for aggregation. Slowly changing dimensions track attribute history. Columnar storage accelerates queries by reading only needed columns.
OLTP and OLAP systems serve different purposes. Operational systems handle transactions; analytical systems handle queries. Data flows from operational systems into warehouses via ETL pipelines.
Modern tools blur the lines—data lakes and cloud data warehouses handle both raw and structured data. The underlying principles of star schema, columnar storage, and fact/dimension separation remain relevant.
See Also
- Materialized Views — Precomputed aggregations in relational databases
- Object Storage — Storing unstructured data at scale
- Relational Databases — OLTP foundations
Category
Related Posts
Data Warehouse Architecture: Building the Foundation for Analytics
Learn the core architectural patterns of data warehouses, from ETL pipelines to dimensional modeling, and how they enable business intelligence at scale.
Star Schema: The Workhorse of Dimensional Data Modeling
Discover why star schema remains the dominant approach for analytical databases, how it enables fast queries, and when to use it versus alternatives.
Data Migration: Strategies and Patterns for Moving Data
Learn proven strategies for migrating data between systems with minimal downtime. Covers bulk migration, CDC patterns, validation, and rollback.