Data Warehousing

OLAP vs OLTP comparison. Star and snowflake schemas, fact and dimension tables, slowly changing dimensions, and columnar storage in data warehouses.

published: March 26, 2026 reading time: 28 min read author: GeekWorkBench updated: January 1, 1970

Quick Summary

Data warehouses store data organized for analytics rather than transactions—OLAP systems handle long queries scanning millions of rows while OLTP systems handle high-volume writes with simple targeted reads. Star schema organizes data into fact tables (measurements like sales) radiating outward to dimension tables (context like customer info), prioritizing query performance. Snowflake schema normalizes dimension tables into hierarchies, using less storage but requiring more joins. Slowly changing dimensions track historical changes to attributes like customer address using Type 1 (overwrite), Type 2 (add new row with validity dates), or hybrid approaches. Columnar storage keeps all values for a column together, reducing I/O for analytical queries that access subset columns.

Data Warehousing

Your operational database handles transactions: orders get placed, accounts get updated, inventory changes. It is optimized for writes and short, targeted reads. It is not built for analyzing months of order history across multiple dimensions.

A data warehouse is built for analytics. It stores data from across your organization, structured for complex queries and aggregation. Where an operational system handles 100 transactions per second, a data warehouse handles queries scanning millions of rows.

Introduction

Operational databases and analytical databases have opposite optimizations. An OLTP system handles short transactions with high write throughput; an OLAP system handles long queries scanning millions of rows. A data warehouse is an OLAP system — it stores data organized for fast aggregation, built to answer questions about trends, patterns, and historical performance.

This guide covers OLTP versus OLAP trade-offs, star and snowflake schema design for analytics, dimension table patterns (slowly changing dimensions, degenerate dimensions), fact table design, and how columnar storage changes query performance. It also covers ETL pipelines that move data from operational systems into the warehouse.

OLTP vs OLAP

OLTP (Online Transaction Processing) systems handle day-to-day operations. inserts and updates dominate. Queries are simple, targeting specific records. Schema is normalized to reduce redundancy and support concurrent writes.

OLAP (Online Analytical Processing) systems handle analysis. Reads dominate—complex queries scanning large datasets. Writes are batch loads from source systems, not individual transactions. Schema is denormalized to support fast aggregations.

Characteristic	OLTP	OLAP
Purpose	Day-to-day operations	Historical analysis
Writes	High volume, small transactions	Low volume, batch loads
Reads	Simple, targeted	Complex, aggregations
Data volume	Current data	Historical data
Schema	Normalized	Denormalized (star/snowflake)

The same data often flows from OLTP systems into OLAP systems. ETL pipelines move and transform data nightly or continuously.

Star Schema

Star schema organizes data into fact tables and dimension tables. The name matches the shape: a central fact table with dimension tables radiating outward.

Fact tables store measurements or metrics. Examples: sales transactions, page views, account balances. Fact tables contain foreign keys to dimension tables plus numeric measures.

Dimension tables store context around events. Examples: customer information, product details, time periods, geographic locations. Dimension tables contain descriptive attributes.

-- Fact table: sales
CREATE TABLE fact_sales (
    sale_id BIGINT PRIMARY KEY,
    customer_id INT,
    product_id INT,
    store_id INT,
    time_id INT,
    quantity INT,
    unit_price DECIMAL(10,2),
    total DECIMAL(10,2)
);

-- Dimension table: customers
CREATE TABLE dim_customers (
    customer_id INT PRIMARY KEY,
    customer_name VARCHAR(100),
    email VARCHAR(255),
    city VARCHAR(50),
    region VARCHAR(50),
    segment VARCHAR(20)
);

A typical sales query joins fact_sales to dim_customers, dim_products, dim_time, and dim_stores to produce reports by customer segment, product category, time period, or region.

flowchart LR
    Fact[("fact_sales<br/>Sales Transactions")]

    Fact --> DimC[("dim_customers<br/>Customer Attributes")]
    Fact --> DimP[("dim_products<br/>Product Details")]
    Fact --> DimT[("dim_time<br/>Calendar")]

Star schema prioritizes query performance. Joins are predictable, and the database can efficiently scan fact tables filtered by dimension keys.

Snowflake Schema

Snowflake schema normalizes dimension tables into hierarchies. Star schema stores customer city and region on the dimension table. Snowflake splits these into separate tables.

-- Normalized dimension table structure
CREATE TABLE dim_customers (
    customer_id INT PRIMARY KEY,
    customer_name VARCHAR(100),
    email VARCHAR(255),
    city_id INT,
    segment_id INT
);

CREATE TABLE dim_cities (
    city_id INT PRIMARY KEY,
    city_name VARCHAR(50),
    region_id INT
);

CREATE TABLE dim_regions (
    region_id INT PRIMARY KEY,
    region_name VARCHAR(50)
);

Snowflake uses less storage through normalization and reduces redundancy. Cost: more joins to answer questions, slower performance.

Most analytical workloads use star schema. Snowflake appears in regulated industries where data redundancy creates compliance concerns, or when dimension tables are very large and normalization saves meaningful storage.

Fact Tables

Fact tables come in different shapes depending on what they measure:

Transaction fact tables record individual events. One row per transaction. Sales transactions, user logins, web page views.

Periodic snapshot fact tables record measurements at regular intervals. One row per entity per time period. Daily inventory levels, monthly account balances.

Accumulating snapshot fact tables record the lifecycle of an event. One row per order, with columns for order date, ship date, delivery date. Useful for pipeline analysis.

Fact tables tend to be large. A retailer’s fact_sales might contain billions of rows. Fact tables are typically partitioned by date to support efficient pruning and archival.

Dimension Tables

Dimension tables provide context. A customer dimension contains attributes that describe customers: name, email, address, demographic information, segment classification.

Dimension tables are typically much smaller than fact tables. A dim_customers with a few million rows supports billions of sales records. Dimension tables are often shared across multiple fact tables—dim_time appears in sales, inventory, and returns facts.

Dimension tables support filtering and grouping. “Show sales by customer segment” or “Filter to the Northeast region” operate on dimension attributes.

Slowly Changing Dimensions (SCD)

Customer addresses change. Product categories get renamed. Employee titles evolve. Data warehouses often need to track these changes over time.

Type 1: overwrite. The old value is lost. Simple, no history.

Type 2: add new row. Track history with effective date ranges. A customer who moves from Region A to Region B has two rows, each with validity dates. Queries can see the state at any point in time.

Type 3: add new column. Store both old and new values. Less common—supports seeing previous value but not full history.

Type 6 (hybrid): combines types 1, 2, and 3 with additional columns to handle multiple scenarios.

Most data warehouses use Type 2 for attributes that change rarely but matter historically—customer address, product category, employee department.

Data Lake vs Data Warehouse

Data warehouses store structured data optimized for analytics. Data lakes store raw data in native format—structured, semi-structured, unstructured—until needed.

Characteristic	Data Warehouse	Data Lake
Data format	Structured	Any format
Schema	Defined before loading	Defined on read
Cost per storage	Higher	Lower (object storage)
Performance	Optimized for queries	Varies widely
Users	Business analysts	Data scientists, analysts
Use cases	Reporting, dashboards	ML training, exploration

Modern architectures often combine both. Raw data lands in a data lake for data science. Pipelines transform and structure data into a data warehouse for business reporting.

Snowflake, Redshift, and BigQuery work as both. They can query structured data in optimized columnar storage and raw data in external object storage.

Columnar Storage

Traditional row storage keeps all columns for a row together. Querying specific columns requires reading entire rows.

Columnar storage keeps all values for a column together. To compute an average of a column, you read only that column. This reduces I/O for analytical queries that access a subset of columns.

Columnar storage compresses well. Same values stored contiguously enable dictionary encoding, run-length encoding, and compression schemes that achieve significant storage reduction compared to row storage.

Redshift, BigQuery, Snowflake, and ClickHouse use columnar storage. Parquet and ORC are columnar file formats used in data lakes and Spark workloads.

-- BigQuery columnar storage is automatic
-- Query only columns you need
SELECT
    customer_segment,
    DATE_TRUNC(order_date, MONTH) AS month,
    SUM(total) AS revenue
FROM fact_sales
JOIN dim_customers USING (customer_id)
WHERE order_date >= '2025-01-01'
GROUP BY customer_segment, month;

BigQuery does not read columns you do not specify. If you only need customer_segment, order_date, and total, the other columns are never loaded from storage.

When to Use / When Not to Use Data Warehousing

Use a data warehouse when:

You need to analyze data across multiple source systems
Queries scan large historical datasets (months or years of data)
Business intelligence and reporting are primary workloads
You need to separate analytical queries from transactional OLTP workloads
Regulatory requirements demand audit trails and historical data retention

Do not use a data warehouse when:

Your workload is primarily real-time operational queries
You need sub-second query response times on current data
You are dealing with unstructured or semi-structured data that changes shape frequently
Your team lacks ETL expertise and data modeling skills
Cost is a primary constraint and simple aggregation tables suffice

Star vs Snowflake vs Galaxy Schema Trade-offs

Dimension	Star Schema	Snowflake Schema	Galaxy Schema
Normalization	Denormalized	Partially normalized	Multiple fact tables sharing dimension tables
Join complexity	Simple — few joins	Complex — many hierarchy joins	Complex — multiple fact joins
Query performance	Fastest — optimized for analytics	Slower — more joins	Depends on query shape
Storage efficiency	Higher — data duplicated	Lower — normalized	Mixed
Data redundancy	High	Low	Medium
Dimension table size	Large	Smaller (normalized)	Shared dimensions can be large
Maintenance	Easier — simpler model	Harder — referential integrity	Most complex
Best for	Simple analytics, BI dashboards	Regulated industries, complex hierarchies	Enterprise with multiple business processes

Production Failure Scenarios

Failure	Impact	Mitigation
SCD Type 2 misimplementation	Historical facts attributed to wrong dimension value	Test with known historical scenarios, audit trail on dimension changes
Late-arriving facts	New transactions assigned to wrong time period	Use surrogate keys, implement late-arrival handling in ETL
Fact table partition key causing hot partitions	Write bottleneck on single partition	Choose high-cardinality partition keys, spread writes across partitions
Dimension table referential integrity violations	Query returns phantom or missing rows	Enforce FK constraints in loading, validate before publish
ETL job failures causing missing data	Reports show incomplete periods	Implement monitoring with data freshness checks, rebuild procedures
Columnar storage query predicates on non-projected columns	Full column scan, slow performance	Always project only needed columns, use columnar-optimized formats

Capacity Estimation: Fact Table Sizing and Dimension Table Growth

Data warehouse sizing requires estimating fact table growth and dimension table evolution over time.

Fact table sizing formula:

bytes_per_fact_row = sum(width_of_each_column_in_bytes)
total_fact_storage = bytes_per_fact_row × rows_per_year × years_of_history
total_with_indexes = total_fact_storage × (1 + index_overhead_factor)  # typically 1.3-1.5×

For a retail sales fact table with these columns: sale_id (8), date_id (4), customer_id (4), product_id (4), store_id (4), quantity (4), unit_price (4), total (8) = 40 bytes per row. At 10M daily sales with 3 years of history:

Rows per year: 10M × 365 = 3.65B rows
Raw fact storage: 40 × 3.65B = 146GB
With columnar compression (10:1 typical for columnar): 14.6GB compressed
With indexes (2 secondary indexes at 30% overhead): ~19GB total

Dimension table growth:

Dimension tables grow by adding rows (new customers, products, time periods) and sometimes adding columns (SCD Type 2 tracking).

new_rows_per_year = current_rows × annual_growth_rate
sCD_type2_storage = base_rows × avg_versions_per_surrogate_key × bytes_per_row

For a customer dimension starting at 10M customers, growing 20% per year, with SCD Type 2 averaging 3 versions per customer (some never change, some change often): at year 3: 10M × 1.2^3 ≈ 17.3M base rows, but SCD Type 2 multiplies this by ~3 versions = 52M rows. At 500 bytes per row (name, address history, contact history), that is 26GB — the same dimension table that was 5GB in year 1 is 26GB in year 3 purely from SCD Type 2 versioning.

Partition planning: Partition fact tables on the time dimension (date_id or month_id). Target 1-10 partitions per query, 100M-1B rows per partition maximum. With daily partitions on 3.65B annual rows: 365 partitions per year. If you query a single month, that is 30 partitions — manageable. If you query a single day, that is 1 partition — fast.

Observability Hooks: ELT Job Monitoring and Data Freshness

Key warehouse metrics: data freshness (time since last ETL load), ETL job duration trends, and fact table storage growth rate.

-- Check data freshness per table
SELECT
    table_name,
    last_loaded_at,
    EXTRACT(EPOCH FROM (now() - last_loaded_at)) / 3600 AS hours_since_load,
    rows_loaded,
    load_status
FROM etl_job_history
WHERE job_name LIKE '%daily_load%'
ORDER BY last_loaded_at DESC
LIMIT 20;

-- Monitor fact table row count growth over time
SELECT
    table_name,
    reltuples / 1e6 AS estimated_millions_of_rows,
    pg_size_pretty(pg_total_relation_size(relid)) AS total_size
FROM pg_class
WHERE relnamespace = 'analytics'::regnamespace
  AND relkind = 'r'
ORDER BY pg_total_relation_size(relid) DESC;

Critical alerts:

# Data freshness breach
- alert: WarehouseDataStale
  expr: (now() - max(etl_last_success_timestamp)) > 86400 # 24 hours
  for: 15m
  labels:
    severity: critical
  annotations:
    summary: "Warehouse table {{ $labels.table }} has not been loaded in over 24 hours"

# ETL job duration spike
- alert: EtlJobDurationSpike
  expr: etl_job_duration_seconds > 1.5 * avg_over_time(etl_job_duration_seconds[7d])
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "ETL job {{ $labels.job }} running {{ $value }}s vs typical {{ $labels.typical }}s"

# Dimension table unexpectedly large
- alert: DimensionTableBloat
  expr: dimension_table_row_count > 2 * dimension_table_baseline_row_count
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Dimension {{ $labels.dimension }} has grown to {{ $value }} rows vs baseline of {{ $labels.baseline }}"

Real-World Case Study: Snowflake at Databricks Scale

Databricks runs a lakehouse platform where customers process analytical workloads at massive scale. Their own internal data platform uses Snowflake for operational analytics over their customer base.

At their scale, fact table queries routinely scan billions of rows. Their optimization strategy: partition fact tables by date and cluster by the most common filter columns (customer_id, product_id). Snowflake’s micro-partition feature automatically compresses and clustering optimizations reduce the data scanned per query.

The challenge that bit them: SCD Type 2 on their customer_events dimension was consuming 40% of total warehouse storage. Every customer event created a new dimension row version, and with billions of events, dimension table size exploded. The fix: move to SCD Type 1 for low-importance attributes (never track history), keep SCD Type 2 only for high-importance attributes (customer tier, account status), and use a separate history table for detailed audit requirements rather than baking it into the dimension.

The lesson: SCD Type 2 multiplies dimension storage by the average number of versions per key. For large fact tables with many events per key, the dimension storage tax is significant. Design SCD Type 2 scope deliberately — apply it only where point-in-time historical accuracy is genuinely needed.

Interview Questions

1. Your fact table has 10 years of history but queries only ever use the last 2 years. How do you reduce storage without losing historical data?

Partition by time period and archive old partitions to cheaper storage (Snowflake's fail-safe or S3-compatible storage tier). Your hot data remains in the active warehouse tier for fast queries; cold data remains queryable on demand but at higher latency and lower cost. Alternatively, use columnar storage with tiering — older partitions get aggressive compression and lower-performance storage hardware. Most cloud data warehouses handle this automatically with tiered storage policies.

2. Your dimension table is 5× larger than the fact table. What went wrong?

SCD Type 2 over-application is the likely cause. If your dimension has 100,000 base keys but SCD Type 2 tracking has created 5M rows, the dimension is bloated. Each change to any attribute of any customer creates a new row version. The fix: audit which attributes actually need SCD Type 2 tracking versus SCD Type 1 (overwrite in place). Move infrequently-changing attributes to Type 1. For detailed change history that does not need to be in the dimension, use a separate history or audit table.

3. Late-arriving facts arrive days after a transaction. How do you handle this in a star schema?

Late-arriving facts require a mechanism to update the dimension join without corrupting historical aggregations. The standard approach: use surrogate keys for all dimension joins. When a late fact arrives, look up the surrogate key for the dimension as-of the fact date — not the current dimension value. If the customer had tier "gold" 5 days ago but is "platinum" now, late-arriving fact from 5 days ago should join on the "gold" surrogate. This requires maintaining a time-series of dimension surrogates. For systems without full surrogate key tracking, implement late-arrival handling in the ETL: insert a corrective record into the fact table with the correct surrogate key and a flag indicating correction.

4. How do you estimate columnar storage compression ratio for a new fact table?

Columnar storage compression depends on cardinality and data patterns. Measure actual compression on a sample: load 1% of your data into the columnar format and measure the compression ratio. General rules of thumb: low-cardinality strings (status codes, category IDs) compress 10:1 or better. High-cardinality strings (UUIDs, free-text) compress 2:1 or worse. Numeric types compress 5:10:1 depending on value distribution. In practice, expect 5:1-15:1 for well-designed star schemas. If your compression is less than 3:1, investigate high-cardinality string columns that should be numeric IDs, or reconsider the schema design.

5. You are designing a star schema for a retail chain. The fact table is 10 billion rows and growing. How do you handle partitioning?

Partition the fact table on the time dimension (date_id or month_id). This aligns with the most common query pattern: filter by date range. Target 100M-1B rows per partition. With 10B rows and daily partitioning: 3650 partitions, too many. Use monthly partitioning: ~83 partitions for a 7-year history. If you query a single month, you scan one partition (fast). If you query a single day, you scan one partition with additional filtering (still fast because partition pruning eliminates 99.9% of data). Key: choose partition granularity that keeps partition size in the 100M-1B row range while matching your query patterns. Do not partition on high-cardinality keys like customer_id — that creates hot spots where one partition receives most writes.

6. What is the difference between a degenerate dimension and a junk dimension? When would you use each?

A degenerate dimension (DD) stores a transaction identifier (order number, invoice number) as a dimension key in the fact table without a separate dimension table. It is not a dimension in the traditional sense — it is a transaction identifier. Use DDs when the identifier is used for filtering and grouping (show all orders for order_id = 12345) but has no attributes to join. A junk dimension (JD) combines many low-cardinality flags and attributes (order type, shipping method, customer tier, promo code) into a single dimension table to avoid having many sparse columns in the fact table. Use JDs when you have many indicator or categorical columns that would create sparse fact table columns — combine them into one dimension with a single surrogate key. Both solve different problems: DD eliminates unnecessary joins for identifiers; JD reduces fact table width by combining small categorical attributes.

7. In a star schema, how do you handle many-to-many relationships between fact tables and dimension tables?

The stand weight (if a customer belongs to a segment with a certain percentage or priority). The fact table joins to the bridge table, which joins to both customer and segment dimensions. Without a bridge, you either store multiple segment keys on the fact table (limited to fixed number of segments) or duplicate fact rows (one per segment), both of which are poor designs. The bridge table is the correct pattern for many-to-many at the fact/dimension boundary.

8. You need to load data into a star schema incrementally. How do you handle late-arriving facts and dimensions?

Late-arriving facts: a transaction arrives days after it occurred. Use a pending facts table or late-arrival fact staging area. Load the fact with a surrogate key for the dimension as-of the fact date (not current date) — this requires tracking dimension surrogate key time-series. Alternatively, load the fact with the current dimension surrogate and flag it for correction, then run a reconciliation job to backfill correct surrogate keys when late dimension data arrives. Late-arriving dimensions: a new customer dimension record arrives after facts have been loaded. Use a Type 1 overwrite for low-importance attributes and SCD Type 2 for high-importance attributes with full history. For facts that already joined on the wrong dimension surrogate, run a correction load that updates the fact table with the correct surrogate when the dimension record finally arrives.

9. What is the role of a surrogate key in a data warehouse star schema? Why not use natural keys?

Surrogate keys (integer IDs generated sequentially) replace natural keys (business identifiers like customer_id, product_id) in the warehouse. Reasons: natural keys can change (customer ID reassigned, product code renamed), surrogate keys are immutable; surrogate keys enable tracking of SCD Type 2 changes (the same natural key has different surrogate keys for different point-in-time records); surrogate keys compress better (integer vs varchar), reducing fact table size and improving join performance; surrogate keys isolate the warehouse from source system changes. Downside: you must maintain the mapping between natural keys and surrogate keys, and ensure every ETL load resolves the correct surrogate for the fact record's as-of date. The complexity is worth it for analytic flexibility and performance.

10. How do you design a star schema for a real-time analytics use case where sub-second query response is required?

Real-time analytics requires in-memory or high-performance storage, not traditional disk-based data warehouses. Design: use a real-time OLAP database (ClickHouse, Druid, Pinot) instead of traditional warehouse. Pre-aggregate data at the granularity users query (minute-level rollups for dashboard queries). Partition aggressively to match query patterns. Use columnar storage with high compression to fit more data in memory. Materialized views for the most common queries. For extreme real-time, load data into Redis or MemSQL for sub-second queries on recent data, with the historical data in a traditional warehouse for slower queries. The star schema structure remains the same (facts + dimensions), but the technology is OLAP-optimized for real-time rather than batch-oriented.

11. What is the difference between a conformed dimension and a role-playing dimension?

A conformed dimension is shared across multiple fact tables or data marts — the same dimension table is used by different parts of the organization. If dim_date is used in both the sales fact and the inventory fact, it is a conformed dimension. Conformed dimensions enable cross-functional reporting (joining sales and inventory on date). A role-playing dimension is a single dimension table used for multiple different purposes within the same fact table. For example, dim_date used three times in an orders fact: order_date, ship_date, delivery_date. The same physical dimension table is joined three times with different aliases. Role-playing does not require the dimension to be conformed across fact tables — it is about using the same dimension for multiple roles within one schema.

12. In a snowflake schema, how do you handle a hierarchy that is deeper than 2 levels (e.g., product category, subcategory, product)?

Normalize each level into its own table: dim_product_category, dim_product_subcategory, dim_product. The product table contains a subcategory_id foreign key, and subcategory contains a category_id foreign key. Queries that need the full hierarchy join through all three tables. Performance cost: each level adds a join. For snowflake schemas, this is the correct design — normalization eliminates redundancy. For star schemas, you would typically denormalize all levels into a single dim_product table with category and subcategory as columns, trading write simplicity for read performance. Choose snowflake when storage is constrained or when the hierarchy changes infrequently; choose denormalized star when read performance matters more than storage efficiency.

13. How do you handle type 6 (hybrid) slowly changing dimensions in practice? Give an example.

Type 6 combines Type 1 (overwrite), Type 2 (add row), and Type 3 (add column). Example: customer dimension tracking region changes. You have columns: customer_key (surrogate), customer_id (natural key), customer_name (Type 1 — always current, overwrite), region_current (Type 1 — current region), region_previous (Type 3 — previous region), region_effective_date (Type 3 — when current region took effect), and region_history (Type 2 — separate table or flag per row for full history). This gives you: current state (Type 1 columns), previous + current state at time of query (Type 3 columns), and full history (Type 2). Type 6 is complex — most teams use Type 1 for frequently-changing low-importance attributes and Type 2 for rarely-changing high-importance attributes, avoiding the hybrid complexity unless the business genuinely needs all three capabilities simultaneously.

14. What is the role of a fact constellation schema? When would you use it?

A fact constellation (galaxy schema) has multiple fact tables that share dimension tables. Where a star schema has one central fact table, a fact constellation has several related fact tables (e.g., orders fact, inventory fact, shipping fact) all joined to shared dimensions (dim_customer, dim_product, dim_date). Use a fact constellation when different business processes require different grain fact tables but need to report across them using shared dimensions. Complexity increases: more fact tables to maintain, shared dimensions must be conformed across all facts. Benefit: cross-fact reporting using shared dimensions. Fact constellation is common in enterprise data warehouses where multiple business processes (sales, orders, shipments) each have their own fact table but analytics needs to join across them.

15. How do you estimate the storage requirement for a star schema fact table with 3 years of history?

Calculate: row width × rows per year × years of history. Example: fact_sales with columns: sale_id (8), date_id (4), customer_id (4), product_id (4), store_id (4), quantity (4), unit_price (4), total (8) = 40 bytes per row. At 10M daily sales: 10M × 365 = 3.65B rows per year. 3-year raw: 40 × 10.95B = 438GB. With columnar compression (5:1 typical): ~88GB compressed. With 2 secondary indexes (30% overhead): ~115GB. So roughly 100-120GB for 3 years of daily sales data at 10M rows/day. Without compression, it would be ~500GB. The key insight: columnar compression is essential for warehouse economics — without it, your storage costs are 5× higher.

16. What is the difference between a mini-dimension and a slowly changing dimension? When would you use a mini-dimension?

A slowly changing dimension tracks attribute changes over time with historical versioning (SCD Type 2). A mini-dimension extracts frequently-changing, high-cardinality attributes into a separate dimension to avoid the storage overhead of SCD Type 2 versioning on the main dimension. Example: a customer dimension with 50 attributes where email_domain and login_count change weekly — mini-dimension extracts these into a separate dim_customer_behavior with a key on the fact table. You avoid versioning the entire customer dimension for attributes that change frequently but are not historically important. Use mini-dimensions when: attribute changes frequently (weekly or daily), attribute is high-cardinality (making SCD Type 2 storage expensive), and you do not need full historical tracking of this specific attribute.

17. How do you design a star schema for a multi-tenancy SaaS application where each tenant needs full data isolation?

Multi-tenant star schema options: tenant_id as a dimension column in all fact and dimension tables (simplest, shared tables, row-level security enforced), separate schema per tenant (full isolation but harder to manage 1000+ tenants), or hybrid (shared schema for common dimensions, tenant-specific fact tables for high-security tenants). For most SaaS apps, tenant_id as a filter column on shared fact tables works well: dim_tenant provides tenant metadata, all facts have tenant_id, queries filter on tenant_id. The risk: fact table scans grow with all tenants. Mitigate with partitioning by tenant_id if tenants are large, or by date + tenant_id. Row-level security policies enforce isolation at the database level so queries from one tenant cannot see another tenant's data.

18. What is the difference between SCD Type 2 and SCD Type 6? When would you choose Type 6 over Type 2?

SCD Type 2 tracks full history via multiple rows per natural key, each with effective date ranges. SCD Type 6 combines Type 1 (current value overwrite), Type 2 (historical rows), and Type 3 (previous value in a separate column). You choose Type 6 when you need: current value immediately (Type 1), full history on demand (Type 2), and the previous value at query time without joining to history (Type 3). The complexity cost is high — most data warehouses default to Type 2 and add Type 3 columns only for the 2-3 most critical attributes where both current and previous values are queried frequently. If you need all three capabilities simultaneously in the same row, Type 6 is correct. Otherwise, Type 2 is simpler and sufficient.

19. What are the key ETL patterns for handling dimension table slowly changing dimensions in a production data warehouse?

Key ETL patterns for SCD in production: first, use surrogate keys — never join on natural keys directly because natural keys can be reassigned. Second, for Type 2: on the dimension source system change, insert a new row with new surrogate key and expiration date on the old row. Third, for late-arriving dimensions (dimension record arrives after facts have been loaded), use a pending dimension table or "unknown member" surrogate key (-1) for facts that cannot yet join. Fourth, for Type 1 attributes mixed with Type 2 in the same dimension, use separate update paths — Type 1 updates overwrite in place, Type 2 inserts a new row. Fifth, audit trail: track when each surrogate key was effective and expired so you can correctly join facts to dimension as-of any date.

20. How do you optimize columnar storage query performance when queries filter on non-projected columns?

In columnar storage, if you filter on a column not in your projection, the database must scan that column — violating the "only read what you need" principle. Optimization strategies: always project the columns you filter on in your query SELECT list. If you filter on status and date but only project amount, the engine must read status and date columns anyway. Design your fact table schema around the most common query patterns — if 80% of queries filter by date and customer_segment, those columns must be projected. For ad-hoc analytics, use columnar formats (Parquet, ORC) with partition pruning on the most selective column. In BigQuery and Snowflake, clustering keys colocate data by column value, making filters on those columns cheap even if not projected.

Conclusion

Data warehouses serve analytical workloads against large historical datasets. Star schema organizes data into fact and dimension tables optimized for aggregation. Slowly changing dimensions track attribute history. Columnar storage accelerates queries by reading only needed columns.

OLTP and OLAP systems serve different purposes. Operational systems handle transactions; analytical systems handle queries. Data flows from operational systems into warehouses via ETL pipelines.

Modern tools blur the lines—data lakes and cloud data warehouses handle both raw and structured data. The underlying principles of star schema, columnar storage, and fact/dimension separation remain relevant.

Data Warehousing

Introduction

OLTP vs OLAP

Star Schema

Snowflake Schema

Fact Tables

Dimension Tables

Slowly Changing Dimensions (SCD)

Data Lake vs Data Warehouse

Columnar Storage

When to Use / When Not to Use Data Warehousing

Star vs Snowflake vs Galaxy Schema Trade-offs

Production Failure Scenarios

Capacity Estimation: Fact Table Sizing and Dimension Table Growth

Observability Hooks: ELT Job Monitoring and Data Freshness

Real-World Case Study: Snowflake at Databricks Scale

See Also

Related Posts

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

Data Warehouse Architecture: Building the Foundation for Analytics

Star Schema: The Workhorse of Dimensional Data Modeling

Data Migration: Strategies and Patterns for Moving Data