Data Vault: Scalable Enterprise Data Modeling
Learn Data Vault modeling methodology for building auditable, scalable enterprise data warehouses with hash keys and satellite tables.
Data Vault: Enterprise-Ready Data Modeling for Scalable Warehouses
Kimball star schemas work well for moderate data volumes with relatively stable business requirements. But when you are dealing with enterprise-scale data, multiple source systems, and constantly evolving business needs, a different approach helps.
Data Vault was created by Dan Linstedt in the 1990s to handle these challenges specifically. It prioritizes auditability, scalability, and adaptability over query simplicity. The trade-off is more complex schemas and more complex queries.
The Core Problem Data Vault Solves
In traditional dimensional modeling, you design your warehouse schema around what the business needs today. When source systems change, new columns get added, or business definitions shift, you end up retrofitting your carefully designed star schema.
Data Vault embraces change as a first-class concern. The methodology explicitly separates concerns into three distinct table types: Hubs, Links, and Satellites. Each type has a specific responsibility and predictable behavior when requirements change.
The result is a warehouse that can absorb changes from source systems without requiring rewrites of existing structures.
Hubs: The Business Keys
A hub table represents a core business concept. A customer, a product, an order. The hub contains only the business key and metadata about when it was loaded.
CREATE TABLE hub_customer (
hub_customer_key VARCHAR(50) PRIMARY KEY, -- Hash of business key
customer_id VARCHAR(50) NOT NULL, -- Natural business key
load_date TIMESTAMP NOT NULL,
record_source VARCHAR(100) NOT NULL
);
CREATE TABLE hub_product (
hub_product_key VARCHAR(50) PRIMARY KEY,
product_id VARCHAR(50) NOT NULL,
load_date TIMESTAMP NOT NULL,
record_source VARCHAR(100) NOT NULL
);
The hub_key is a hash of the natural business key. This decouples the warehouse from source system key formats and enables loading from multiple sources into the same hub.
Note that hubs contain no attributes. Attribute storage is delegated to satellites.
flowchart TB
subgraph Hubs[Hubs (Business Keys)]
H1[hub_customer<br/>hub_customer_key, customer_id, load_date]
H2[hub_product<br/>hub_product_key, product_id, load_date]
H3[hub_order<br/>hub_order_key, order_id, load_date]
end
subgraph Links[Links (Relationships)]
L1[link_customer_order<br/>link_customer_order_key, hub_customer_key, hub_order_key]
L2[link_product_category<br/>hub_product_key, hub_category_key]
end
subgraph Satellites[Satellites (Descriptive Data)]
S1[sat_customer_details<br/>hub_customer_key, name, email, region, load_date]
S2[sat_product_info<br/>hub_product_key, brand, category, load_date]
S3[sat_order_status<br/>hub_order_key, status, total_amount, load_date]
end
H1 --> L1
H3 --> L1
H2 --> L2
H1 --> S1
H2 --> S2
H3 --> S3
Links: The Relationships
A link table represents relationships between hubs. If a customer places an order, there is a link. If a product belongs to a category, there is a link.
CREATE TABLE link_customer_order (
link_customer_order_key VARCHAR(50) PRIMARY KEY,
hub_customer_key VARCHAR(50) NOT NULL REFERENCES hub_customer(hub_customer_key),
hub_order_key VARCHAR(50) NOT NULL REFERENCES hub_order(hub_order_key),
load_date TIMESTAMP NOT NULL,
record_source VARCHAR(100) NOT NULL
);
Links can connect more than two hubs for complex relationships. A many-to-many relationship is handled naturally without junction tables.
-- Many-to-many: customer purchases product at store
CREATE TABLE link_transaction (
link_transaction_key VARCHAR(50) PRIMARY KEY,
hub_customer_key VARCHAR(50),
hub_product_key VARCHAR(50),
hub_store_key VARCHAR(50),
load_date TIMESTAMP NOT NULL,
record_source VARCHAR(100) NOT NULL
);
Satellites: The Descriptive Data
Satellites hold all the descriptive attributes. They link to their parent hub or link and carry the payload of descriptive data.
CREATE TABLE sat_customer_details (
hub_customer_key VARCHAR(50) NOT NULL REFERENCES hub_customer(hub_customer_key),
load_date TIMESTAMP NOT NULL,
hash_diff VARCHAR(32) NOT NULL, -- Hash of attribute values for change detection
customer_name VARCHAR(200),
customer_email VARCHAR(200),
customer_phone VARCHAR(50),
customer_region VARCHAR(100),
is_current BOOLEAN DEFAULT TRUE,
load_end_date TIMESTAMP,
record_source VARCHAR(100)
);
Each satellite row has a load_date and load_end_date, enabling point-in-time queries. When a customer attribute changes, a new satellite row is inserted with the new values and the previous row’s load_end_date is set.
This gives you complete change history without any special SCD handling. Type 2 behavior is baked into the satellite design.
Hash Keys: The Link to the Business World
Data Vault uses hash keys instead of surrogate keys. The hub_customer_key is a hash of the customer_id from the source system.
-- Generate hash key in load process
SELECT SHA256(CONCAT(customer_id, '|', record_source)) AS hub_customer_key
FROM source_customer;
Hash keys provide several benefits:
Source independence. The same customer from CRM and ERP systems generates the same hash key if you use a consistent business key. This enables matching across systems.
No sequence management. You do not need a central key generation service. The hash is computed from the data itself.
Auditability. Given a business key and source, you can recompute the hash at any time and verify data integrity.
The trade-off is longer keys (SHA-256 produces 64 hex characters) and the theoretical possibility of collisions. For practical data volumes, collision risk is negligible.
Point-in-Time Tables and Bridge Tables
Data Vault enables complex historical analysis through point-in-time tables and bridge tables.
Point-in-Time Tables
A point-in-time (PIT) table is a helper table that makes it efficient to find the correct satellite row for any point in time.
CREATE TABLE pit_customer (
hub_customer_key VARCHAR(50) PRIMARY KEY,
load_date TIMESTAMP NOT NULL,
sat_customer_details_load_date TIMESTAMP,
sat_customer_address_load_date TIMESTAMP,
sat_customer_contact_load_date TIMESTAMP
);
Instead of scanning satellite tables and comparing dates for every query, you join through the PIT table.
Bridge Tables for Hierarchies
Bridge tables handle parent-child hierarchies without the complexity of recursive CTEs.
CREATE TABLE bridge_customer_hierarchy (
hub_customer_key VARCHAR(50) NOT NULL,
parent_hub_customer_key VARCHAR(50) NOT NULL,
level INT NOT NULL,
load_date TIMESTAMP NOT NULL
);
This lets you aggregate at any level of the hierarchy efficiently.
Loading Data into Data Vault
The Data Vault load process follows a predictable pattern.
def load_customer_satellite(source_record, hub_customer_key):
# Compute hash of current attributes
current_hash = compute_hash(source_record)
# Check if this is new data or a change
previous_row = get_latest_satellite(hub_customer_key)
if previous_row is None or previous_row.hash_diff != current_hash:
# Close out previous row
if previous_row:
update_satellite_end_date(previous_row, source_record.load_date)
# Insert new row
insert_satellite({
'hub_customer_key': hub_customer_key,
'load_date': source_record.load_date,
'hash_diff': current_hash,
'customer_name': source_record.name,
'customer_email': source_record.email,
'is_current': True,
'record_source': source_record.system
})
This is essentially Type 2 SCD behavior, but it emerges naturally from the loading pattern rather than requiring special handling.
Comparison with Kimball
| Aspect | Kimball Star Schema | Data Vault |
|---|---|---|
| Schema complexity | Simple star | Complex hub/link/satellite |
| Query complexity | Simple joins | Multi-stage joins with PIT tables |
| Change handling | Requires SCD strategies | Built-in historical tracking |
| Scalability | Good to ~100TB | Designed for enterprise scale |
| Auditability | Limited | Full audit trail |
| Source flexibility | Single source per dimension | Multiple sources per hub |
| Agility | Schema changes are costly | New satellites add without modifying existing |
Kimball wins on query simplicity and developer productivity. Data Vault wins on scalability, auditability, and adaptability.
For organizations with complex source landscapes and strong audit requirements, Data Vault is worth the added schema complexity. For smaller organizations with simpler needs, the Kimball approach remains practical.
For other approaches to data modeling, see Kimball Dimensional Modeling for the traditional star schema approach, or One Big Table for a denormalized alternative.
Integration Point Tables
One powerful Data Vault pattern is the integration point table. This is a link that represents a business event at a specific moment, capturing the full context of that event.
-- Instead of separate links for different event aspects:
-- Order placed link
-- Payment received link
-- Shipment confirmed link
-- Use a single integration point:
CREATE TABLE link_sales_transaction (
link_sales_transaction_key VARCHAR(50) PRIMARY KEY,
hub_customer_key VARCHAR(50),
hub_product_key VARCHAR(50),
hub_store_key VARCHAR(50),
hub_date_key VARCHAR(50),
hub_time_key VARCHAR(50),
transaction_amount DECIMAL(12,2),
load_date TIMESTAMP NOT NULL,
record_source VARCHAR(100) NOT NULL
);
This captures the complete transaction context in one place, enabling comprehensive analysis without joins across multiple link tables.
Raw Vault and Business Vault
The Data Vault methodology distinguishes between the raw vault and the business vault.
Raw Vault contains exactly what came from the source systems, with minimal transformation. Hubs, links, and satellites match source structures. This layer preserves auditability and enables reprocessing if definitions change.
Business Vault applies business rules and derived calculations on top of the raw vault. Computed metrics, conformed dimensions, and business keys live here.
Separating these layers gives you a clean lineage from source data to business metrics. You can always trace a number back to its origin.
When to Use Data Vault
Data Vault is particularly well-suited for:
Large-scale enterprise data warehouses where multiple source systems feed into a central repository. The hub/link/satellite structure handles source heterogeneity gracefully.
Regulatory environments requiring complete audit trails. Every record has load dates, record sources, and change history built in.
Organizations with evolving requirements where source systems change frequently. Adding a new satellite does not affect existing structures.
Merger and acquisition scenarios where you need to integrate data from acquired companies with different key structures. Hash keys enable matching across systems.
The added complexity is not free. Plan for longer development cycles and more complex query logic. Data Vault pays off over years of operation, not weeks.
Data Vault Production Failure Scenarios
Hash collision causing duplicate business keys
Two different customers from different source systems happen to produce the same SHA-256 hash despite having different natural keys. Both records load into the same hub with the same hub_customer_key. Downstream link tables reference a single customer when they should reference two distinct entities.
Mitigation: Use a composite hash that includes the record source (SHA256(CONCAT(customer_id, '|', record_source))) so identical business keys from different systems get distinct hashes. For very high-volume hubs, periodically audit hash distributions to detect potential collision clusters.
PIT table falling out of sync with satellites
A pipeline bug causes a satellite load to skip updating the corresponding PIT table. Queries through the PIT table return stale attribute values while direct satellite queries return current values. Reports show inconsistent results depending on which path is used.
Mitigation: Build PIT table refresh into the satellite load transaction so they always stay in sync. Add a reconciliation check after each load that compares PIT timestamps against satellite latest load dates and alerts on mismatch.
Satellite version explosion from high-frequency changes
A transactional source system emits change events multiple times per minute for the same entity. Each change inserts a new satellite row with a new load_date. After a month, a single customer has 50,000 satellite rows consuming gigabytes of storage.
Mitigation: Implement satellite-level batching or CDC windowing: group changes within a configurable time window before inserting a new version. Set hash_diff comparison at the batch level rather than individual change level. Profile satellite row counts per hub key before going to production.
Raw vault loading sequence violations
A satellite loads before its parent hub record exists (FK constraint failure), or two satellites load in parallel and one completes first, leaving the PIT table referencing a load date that doesn’t exist in the other satellite.
Mitigation: Enforce load order with dependency tracking in the orchestration layer. Never run hub/link loads in parallel without dependency management. Use raw vault as the single integration point that serializes loads.
Data Vault Anti-Patterns
Treating Data Vault like a star schema. Building Business Vault that looks exactly like a star schema on top of raw vault, but skipping the raw vault layer because “it’s simpler.” This defeats the auditability purpose and creates a schema that will need rewriting when source systems change again.
Adding satellites too early. Modeling every possible attribute before you understand which ones actually change and which source systems they come from. Satellites should be added incrementally as you discover new requirements, not upfront in one big design.
Ignoring the Business Vault layer. Staying in raw vault forever because it is “flexible.” Raw vault queries are complex and slow. The purpose of Data Vault is to eventually enable business users to query through a business-friendly layer, not to live in raw vault forever.
No automation for hash key generation. Hand-computing hashes in ad-hoc SQL instead of building a reusable hash generation function. This leads to inconsistent hash algorithms across loads and silent data quality issues.
Data Vault Quick Recap
- Hubs store business keys (customer_id, product_id); links store relationships (customer-to-order); satellites store attributes that change over time.
- Hash keys decouple the warehouse from source system key formats and enable cross-system matching.
- Raw vault preserves source data exactly as-is; business vault applies business rules on top.
- PIT tables make point-in-time queries efficient by pre-computing which satellite row was current at any given time.
- Plan for longer development cycles and more complex queries than star schema — Data Vault pays off over years, not weeks.
- Implement satellite batching to prevent row explosion from high-frequency CDC sources.
For other approaches to data modeling, see Kimball Dimensional Modeling for the traditional star schema approach, or One Big Table for a denormalized alternative.
Category
Related Posts
Kimball Dimensional Modeling: Star Schema Techniques
Learn Kimball dimensional modeling techniques for building efficient star schema data warehouses with fact and dimension tables.
One Big Table: The Denormalized Approach to Data Modeling
Learn how One Big Table architecture simplifies data pipelines by combining all attributes into single wide denormalized tables.
Slowly Changing Dimensions: History in Data Warehouses
Master Slowly Changing Dimension techniques including Type 1, Type 2, and Type 3 for maintaining historical accuracy in data warehouses.