Slowly Changing Dimensions: History in Data Warehouses
Master Slowly Changing Dimension techniques including Type 1, Type 2, and Type 3 for maintaining historical accuracy in data warehouses.
Slowly Changing Dimensions: Tracking History in Data Warehouses
Data changes. Customer addresses change. Products get rebranded. Employees transfer departments. A data warehouse that only tracks current state loses valuable historical context.
Slowly Changing Dimensions (SCD) is the technique for handling attribute changes in dimension tables while preserving historical accuracy. The term comes from early data warehousing, where dimensions were considered “slowly” changing compared to facts which update constantly.
Understanding SCD is essential for accurate historical analysis. Get it wrong and your reports tell lies.
The Core Problem
Consider a customer dimension with a region attribute. Customer ABC was in the Northeast region until March 2025, then moved to the Southwest.
If you simply overwrite the region, what happens when you run a report showing sales by region for 2024? The customer appears in the Southwest, even though they were actually in the Northeast at that time.
This is not a minor edge case. Historical accuracy by geography, territory assignments, product categories, and organizational hierarchies is critical for business decisions.
SCD Type 1: Overwrite
Type 1 is the simplest approach. When an attribute changes, overwrite the old value with the new one.
UPDATE dim_customer
SET region = 'Southwest',
last_updated = CURRENT_TIMESTAMP
WHERE customer_id = 'ABC';
After this update, the customer appears in the Southwest for all time periods, including historical reports.
When Type 1 Makes Sense
Type 1 is appropriate when:
- Historical accuracy is not important
- The attribute is purely current-state (like a customer’s email)
- Regulatory requirements do not mandate historical tracking
-- Type 1 dimension table
CREATE TABLE dim_customer_type1 (
customer_id VARCHAR(50) PRIMARY KEY,
customer_name VARCHAR(200),
customer_email VARCHAR(200),
region VARCHAR(100),
last_updated TIMESTAMP
);
The simplicity of Type 1 is attractive. There is no additional storage cost and no complex querying. The trade-off is complete loss of historical context.
SCD Type 2: Add New Row
Type 2 preserves complete history by inserting a new row for each change. The old row remains with its historical effective dates.
CREATE TABLE dim_customer_type2 (
customer_id VARCHAR(50) NOT NULL,
customer_name VARCHAR(200),
region VARCHAR(100),
effective_date DATE NOT NULL,
expiry_date DATE,
is_current BOOLEAN,
version_number INT,
PRIMARY KEY (customer_id, effective_date)
);
When the customer moves regions:
-- Close out the current row
UPDATE dim_customer_type2
SET expiry_date = '2026-03-26',
is_current = FALSE
WHERE customer_id = 'ABC' AND is_current = TRUE;
-- Insert the new row
INSERT INTO dim_customer_type2 (
customer_id, customer_name, region,
effective_date, expiry_date, is_current, version_number
) VALUES (
'ABC', 'Customer ABC', 'Southwest',
'2026-03-27', '9999-12-31', TRUE, 2
);
Now historical queries can find the correct region by looking for the row where the fact date falls between effective_date and expiry_date.
Querying Type 2 Dimensions
SELECT
f.order_date,
f.total_amount,
d.region AS customer_region_at_time
FROM fact_sales f
JOIN dim_customer_type2 d
ON f.customer_id = d.customer_id
AND f.order_date >= d.effective_date
AND f.order_date < d.expiry_date
WHERE f.order_date BETWEEN '2025-01-01' AND '2025-12-31';
This is more complex than a simple join, but it returns accurate historical context.
Natural Keys and Surrogate Keys in Type 2
The customer_id in your dimension is the natural key. It identifies the business entity. The surrogate key (often a hash or sequence) uniquely identifies each version of that entity.
CREATE TABLE dim_customer_type2 (
surrogate_key VARCHAR(50) PRIMARY KEY, -- Hash or sequence
customer_id VARCHAR(50) NOT NULL, -- Natural key
customer_name VARCHAR(200),
region VARCHAR(100),
effective_date DATE NOT NULL,
expiry_date DATE,
is_current BOOLEAN
);
The fact table should reference the surrogate key, not the natural key. This ensures each fact points to exactly one dimension row.
SCD Type 3: Add New Attribute
Type 3 keeps both old and new values in the same row, using separate columns.
ALTER TABLE dim_customer ADD COLUMN previous_region VARCHAR(100);
ALTER TABLE dim_customer ADD COLUMN region_change_date DATE;
When a customer moves:
UPDATE dim_customer
SET previous_region = region,
region = 'Southwest',
region_change_date = CURRENT_DATE
WHERE customer_id = 'ABC';
This allows queries to reference both current and previous values.
Type 3 Query Example
SELECT
customer_id,
region AS current_region,
previous_region AS prior_region,
region_change_date,
CASE
WHEN region_change_date <= '2025-12-31' THEN previous_region
ELSE region
END AS region_at_year_end_2025
FROM dim_customer;
Type 3 only tracks one level of change. If the customer moves again, you lose the first historical value. For tracking multiple changes, Type 2 is necessary.
SCD Type 4: Mini-Dimension
Type 4 separates frequently changing attributes into a separate table called a mini-dimension.
-- Main customer dimension with stable attributes
CREATE TABLE dim_customer (
customer_key VARCHAR(50) PRIMARY KEY,
customer_id VARCHAR(50),
customer_name VARCHAR(200),
date_of_birth DATE,
gender VARCHAR(10)
);
-- Mini-dimension for changing attributes
CREATE TABLE dim_customer_profile (
profile_key VARCHAR(50) PRIMARY KEY,
customer_id VARCHAR(50),
region VARCHAR(100),
customer_tier VARCHAR(20),
credit_score INT,
effective_date DATE
);
Facts join to the appropriate mini-dimension row based on the transaction date. This avoids the performance problem of scanning large dimension tables with many version rows.
SCD Type 6: Hybrid Type 1 and Type 2
Type 6 combines Type 1 and Type 2 behaviors. A row has both current values (overwritten on change) and historical values (new rows inserted).
CREATE TABLE dim_customer_type6 (
surrogate_key VARCHAR(50) PRIMARY KEY,
customer_id VARCHAR(50),
customer_name VARCHAR(200),
region VARCHAR(100), -- Type 1: current value, overwritten
region_history VARCHAR(200), -- Type 2: historical value from when row was inserted
effective_date DATE,
is_current BOOLEAN
);
This hybrid approach is less common but useful when you need both current-state queries to be simple and historical queries to be accurate.
Implementing Type 2: A Practical Load Process
Loading Type 2 dimensions requires careful change detection.
def load_customer_dimension_type2(staging_customers, dim_customers):
"""
Load customer dimension with Type 2 handling.
staging_customers: new data from source system
dim_customers: existing dimension data
"""
for staging_row in staging_customers:
# Find existing current row
existing = find_current_customer(dim_customers, staging_row.customer_id)
if existing is None:
# New customer: insert with effective date
insert_customer({
'surrogate_key': hash(staging_row.customer_id),
'customer_id': staging_row.customer_id,
'customer_name': staging_row.customer_name,
'region': staging_row.region,
'effective_date': staging_row.extract_date,
'expiry_date': '9999-12-31',
'is_current': True
})
elif has_changed(existing, staging_row):
# Attribute change: close out old, insert new
close_out_customer(existing, staging_row.extract_date)
insert_customer({
'surrogate_key': hash(staging_row.customer_id + str(uuid)),
'customer_id': staging_row.customer_id,
'customer_name': staging_row.customer_name,
'region': staging_row.region,
'effective_date': staging_row.extract_date,
'expiry_date': '9999-12-31',
'is_current': True
})
else:
# No change: update slowly changing attributes if needed
update_customer_name(existing, staging_row.customer_name)
def has_changed(existing, new_row):
"""Check if any tracked attributes have changed."""
return (
existing.customer_name != new_row.customer_name or
existing.region != new_row.region
)
Handling Multiple SCD Types in One Dimension
In practice, a single dimension often uses multiple SCD types for different attributes.
| Attribute | SCD Type | Rationale |
|---|---|---|
| Customer Name | Type 1 | Historical accuracy not needed |
| Customer Region | Type 2 | Critical for regional analysis |
| Customer Tier | Type 3 | Need to compare current vs previous |
| Credit Score | Type 4 | Mini-dimension for frequent changes |
CREATE TABLE dim_customer (
-- Primary key
surrogate_key VARCHAR(50) PRIMARY KEY,
customer_id VARCHAR(50) NOT NULL,
-- Type 1 attributes (overwritten)
customer_name VARCHAR(200),
-- Type 2 attributes (versioned)
region VARCHAR(100),
region_effective_date DATE,
region_expiry_date DATE,
-- Type 3 attributes (current + previous)
previous_tier VARCHAR(20),
current_tier VARCHAR(20),
-- Metadata
is_current BOOLEAN,
load_timestamp TIMESTAMP
);
SCD Timeline Visualization
gantt
title Type 1 vs Type 2 vs Type 3 Behavior
dateFormat YYYY-MM-DD
section Type 1
Overwrite: 2025-01-01, 2025-06-01
section Type 2
Row v1 (Northeast): 2025-01-01, 2025-06-01
Row v2 (Southwest): 2025-06-02, 2026-03-28
section Type 3
Row (current=Southwest, prev=Northeast): 2025-06-02, 2026-03-28
Common Pitfalls
Forgetting to Close Expired Rows
The most common mistake is inserting a new row without setting the expiry_date on the old row.
-- WRONG: Old row has no expiry, creates ambiguous history
INSERT INTO dim_customer (...) VALUES (...);
-- RIGHT: Close out old row first
UPDATE dim_customer
SET expiry_date = '2026-03-26'
WHERE customer_id = 'ABC' AND is_current = TRUE;
INSERT INTO dim_customer (...) VALUES (...);
Using Natural Keys in Fact Tables
Fact tables should reference surrogate keys, not natural keys. If you reference customer_id directly, a query without the date range filter returns incorrect results.
Missing Change Detection
Comparing all attribute values on every load is expensive but necessary. Hash comparisons help.
-- Compute hash of tracking attributes
SELECT
customer_id,
MD5(CONCAT_WS('|', customer_name, region)) AS hash_diff
FROM staging_customers;
Compare hash_diff to detect changes without comparing each attribute individually.
Query Performance with Type 2
Type 2 dimensions grow over time. A customer dimension might have hundreds of rows per customer after years of updates.
Query performance suffers if you scan all versions for every query.
Mitigation strategies:
Index strategically. Index on (customer_id, effective_date, expiry_date).
Use a view for current values. Separate the current-state query path from historical queries.
CREATE VIEW dim_customer_current AS
SELECT * FROM dim_customer WHERE is_current = TRUE;
Consider partitioning. Partition the dimension table by customer_id range for faster access.
Use PIT tables. A Point-In-Time table pre-computes the correct surrogate key for each date.
Slowly Changing Dimensions Production Failure Scenarios
Dimension explosion from Type 2 on high-cardinality change attributes
A dimension like dim_order applies Type 2 to order_status. An order transitions through 20+ statuses during fulfillment. Every status change inserts a new dimension row. After a year, a single order has 20+ rows in the dimension table. Queries joining to this dimension return 20x the expected row count, producing incorrect aggregates.
Mitigation: Apply Type 2 only to attributes that genuinely need historical tracking for analytics. Status fields that change frequently belong in the fact table or use Type 1. Set a maximum version count per natural key and alert when exceeded.
Late-arriving facts referencing expired dimension rows
A fact record arrives 5 days late with a transaction date of March 20. The dimension row for the customer on March 20 was expired on March 22 when the Type 2 row closed. The query that joins on fact_date BETWEEN effective_date AND expiry_date misses the match. The late fact contributes zero to aggregations.
Mitigation: Build late-arrival handling into fact loads: for late facts, use the dimension row that was active on the fact’s transaction date, not the currently-active row. Add a valid_from and valid_to date range check that correctly matches historical dimension state to late facts.
Stale PIT table causing wrong surrogate key joins
A Point-In-Time table is supposed to map (customer_id, date) to the correct surrogate_key. The PIT table refresh runs daily. On March 28, it reflects the dimension state as of March 27. A query for March 28 returns the current surrogate key, but the March 28 dimension row has a different surrogate key because a change occurred on March 28 morning. Aggregations by surrogate key mix old and new customer attributes.
Mitigation: Refresh PIT tables more frequently for volatile dimensions, or use a “latest only” PIT strategy that returns the current surrogate key for queries without a date filter. Document PIT staleness assumptions clearly.
Concurrent load corrupting Type 2 expiry dates
Two concurrent ETL jobs load customer dimension updates. Both see the same current row and both try to close it with different expiry timestamps. One job sets expiry to March 27 02:00:00, the other to March 27 02:00:01. Both insert new rows. The result is two current rows for the same customer. Downstream joins produce Cartesian products.
Mitigation: Serialize Type 2 updates for the same natural key using a queue or row-level lock. In Snowflake/Redshift, use MERGE with a condition on is_current = TRUE and expiry_date = '9999-12-31' as part of the same transaction. Add a unique index on (customer_id, is_current) where is_current = TRUE to prevent duplicate current rows.
Slowly Changing Dimensions Quick Recap
- Type 1 overwrites: current value only, no history. Use for attributes where history does not matter.
- Type 2 adds rows: complete history preserved. Use for geography, tier, segment — anything that affects historical analysis.
- Type 3 stores current and previous as columns: good for comparing before/after on the most recent change only.
- Type 4 mini-dimension: extract frequently-changing attributes into a separate dimension to avoid scanning dozens of version rows.
- Type 6 hybrid: Type 1 + Type 2 in the same row for both current simplicity and historical accuracy.
- Change detection via hash_diff is faster than comparing every attribute individually.
- Dimension version explosion is a real problem: index strategically, use PIT tables, consider partitioning by customer_id range.
For more on dimensional modeling, see Kimball Dimensional Modeling for the foundational star schema approach, or Data Vault for enterprise-scale change tracking built into the architecture.
Slowly Changing Dimensions are a critical tool for accurate historical analysis. Choosing the right SCD type depends on your analytical requirements:
- Type 1 for attributes where current value is all that matters
- Type 2 for complete historical audit trails
- Type 3 when you need to compare before and after states
- Type 4 mini-dimensions for frequently changing attributes
Most enterprise dimensions use a combination of types. A customer dimension might use Type 2 for geography and customer tier, Type 1 for contact information, and Type 3 for comparison purposes.
The implementation complexity is real. Change detection, version management, and historical querying all require careful handling. But the alternative is reports that tell lies about your business.
For more on dimensional modeling, see Kimball Dimensional Modeling for the foundational star schema approach, or Data Vault for enterprise-scale change tracking built into the architecture.
Category
Related Posts
Data Vault: Scalable Enterprise Data Modeling
Learn Data Vault modeling methodology for building auditable, scalable enterprise data warehouses with hash keys and satellite tables.
Kimball Dimensional Modeling: Star Schema Techniques
Learn Kimball dimensional modeling techniques for building efficient star schema data warehouses with fact and dimension tables.
One Big Table: The Denormalized Approach to Data Modeling
Learn how One Big Table architecture simplifies data pipelines by combining all attributes into single wide denormalized tables.