Lakehouse: Best of Both Worlds in Data Architecture
Understand how lakehouse architecture combines the scalability of data lakes with the reliability and performance of data warehouses.
Lakehouse: The Best of Both Worlds in Modern Data Architecture
For years, data engineers chose between two architectures: data warehouses for clean, reliable, fast queries, or data lakes for cheap, flexible, scalable storage. The warehouse-lake divide created complexity—engineers wrote pipelines to move data from lake to warehouse, dealt with synchronization issues, and maintained two systems instead of one.
Lakehouse architecture closes this gap. It provides data lake storage economics and flexibility with data warehouse reliability and query performance.
Why the Separation Existed
Understanding why warehouses and lakes were separate helps explain how lakehouse bridges them.
Data warehouses emerged to serve business intelligence. Queries are predictable (aggregations, joins across known dimensions), data is structured and cleaned, and response times need to be fast for interactive analysis. Warehouses achieved this by enforcing schema on write, storing data in columnar formats optimized for analytical scans, and running on expensive, specialized hardware.
Data lakes emerged to handle the stuff warehouses couldn’t. Machine learning needed raw features. Data science needed diverse, unprocessed data. Companies had terabytes of logs and clickstream data that nobody knew how to analyze yet but didn’t want to delete. Lakes stored everything cheaply in object storage without requiring upfront schema decisions.
The costs were real. Data lakes became swamps of untracked, unreliable data. Schema-on-read sounded flexible but meant every consumer reinvented parsing logic. ACID transactions didn’t exist, so partial writes corrupted data. Query performance was poor compared to columnar warehouses because files were scattered and metadata was absent.
The Lakehouse Solution
Lakehouse architecture adds warehouse capabilities to data lake storage:
ACID transactions ensure that writes either complete fully or roll back cleanly. No more corrupted Parquet files from interrupted jobs. Multiple concurrent readers and writers don’t see partial states. This alone solves the most painful operational problem with data lakes.
Time travel lets you query historical versions of data. When a transformation bug poisons your silver layer, you can instantly revert to yesterday’s gold while debugging. This changes how you handle data quality—you can always reprocess from the base layer.
Schema enforcement and evolution prevent bad data from entering the system. New columns can be added safely, and incompatible changes can be rejected or handled gracefully.
Production Failure Scenarios
ACID transaction timeout causing log corruption
A long-running compaction job holds a table lock while rewriting thousands of files. The job exceeds the transaction timeout. The log records a partial commit. On restart, reads see an inconsistent state between the log and data files.
Mitigation: configure transaction timeouts explicitly (Delta Lake: delta.logStore.timeout). Break large compactions into smaller units with checkpoint commits. Monitor pending_commits metric.
Time travel reading stale data after compaction
A user queries VERSION AS OF 5 expecting historical state. Meanwhile, a background compaction merged files from versions 1-5 into a new file at version 6. The old files are removed. The historical query returns empty results because the referenced files no longer exist.
Mitigation: Retain at least N versions in the transaction log for time travel. Set delta.deletedFileRetentionDuration appropriately for your compliance requirements. Document that time travel depends on file retention, not just log entries.
Hidden partition speculation causing wrong results
Iceberg’s hidden partitioning means queries don’t explicitly filter partition directories. A bug in partition value computation causes all records to land in the same partition. Queries that should scan one partition scan the entire table, returning correct results but with 100x the expected latency. The problem isn’t visible in small datasets.
Mitigation: Validate partition distributions in QA. Alert when any partition exceeds expected size thresholds. Profile query plans before deploying to production.
Z-order with high-cardinality columns causing OOM
Z-ordering by user_id (millions of distinct values) on a table with billions of rows creates file groups with extremely small per-group row counts. The Z-order rewrite reads and rewrites huge amounts of data, exceeding executor memory and causing OOM errors mid-job.
Mitigation: Z-order only on columns with moderate cardinality (thousands to low millions of distinct values). For very high cardinality columns, use range partitioning instead. Test Z-order operations on sample data before running on full table.
Open table formats like Delta Lake and Apache Iceberg run on commodity object storage (S3, ADLS, GCS) rather than requiring proprietary warehouse infrastructure. You get portability across engines—Spark, Trino, Presto, Snowflake, BigQuery all read the same tables.
Lakehouse Architecture
flowchart TD
subgraph ObjectStorage[Object Storage (S3 / ADLS / GCS)]
Bronze[s3://lake/bronze/]
Silver[s3://lake/silver/]
Gold[s3://lake/gold/]
end
subgraph QueryEngines[Query Engines]
Spark[Apache Spark]
Trino[Trino]
Snowflake[Snowflake]
BigQuery[BigQuery]
end
subgraph TransactionLog[Transaction Log]
Log[delta/_delta_log/]
end
Bronze -->|ingest| Silver
Silver -->|transform| Gold
Gold -->|query| Spark
Gold -->|query| Trino
Gold -->|query| Snowflake
Gold -->|query| BigQuery
Spark -->|writes| Log
Trino -->|writes| Log
Log -->|validates| Bronze
Log -->|validates| Silver
Log -->|validates| Gold
All query engines read and write the same table files in object storage. The transaction log coordinates concurrent access and provides ACID guarantees.
Delta Lake in Practice
Delta Lake, originally developed by Databricks and now open source, is the most widely deployed lakehouse table format. It stores transaction logs alongside Parquet data files, enabling ACID semantics and time travel.
from delta.tables import DeltaTable
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("LakehousePipeline") \
.getOrCreate()
# Write to Delta table with schema enforcement
data = spark.read.json("s3://raw-events/clickstream/2025/03/")
data.write \
.format("delta") \
.mode("append") \
.partitionBy("event_date") \
.option("mergeSchema", "true") \
.saveAsTable("gold.clickstream_events")
# Time travel query: read yesterday's version
yesterday = DeltaTable.forPath(spark, "s3://warehouse-bucket/gold/clickstream_events/")
yesterday_df = spark.read \
.format("delta") \
.option("timestampAsOf", "2025-03-26") \
.load("s3://warehouse-bucket/gold/clickstream_events/")
Delta Lake’s transaction log is an ordered sequence of actions—adds, removes, metadata changes—stored as Parquet files. When a query reads a Delta table, it first reads the log to determine which data files are valid, then reads only those files. The log provides the ACID guarantee and the historical versioning.
Apache Iceberg: The Open Alternative
Apache Iceberg offers similar capabilities with a different implementation approach and governance model. Originally developed at Netflix and Apple, it became an Apache top-level project in 2023.
Iceberg’s design emphasizes specification stability—the on-disk format won’t change in breaking ways—and engine neutrality. Unlike Delta Lake, which is closely tied to Databricks’ ecosystem, Iceberg is designed to be implemented by any engine. Snowflake, AWS Athena, Trino, Impala, and others support Iceberg natively.
-- Iceberg time travel query (works in Trino, Athena, Spark)
SELECT * FROM production.analytics_events
FOR VERSION AS OF 1234567890
WHERE customer_id = 'CUST-001';
-- Or using timestamp
SELECT * FROM production.analytics_events
FOR TIMESTAMP AS OF TIMESTAMP '2025-03-26 00:00:00';
Iceberg’s hidden partitioning addresses a pet peeve of data engineers: partition filters that depend on physical layout rather than logical content. In older systems, you had to know that dt = '2025-03-27' corresponded to a specific directory. Iceberg abstracts partitioning completely—queries filter on event_date, and the engine automatically discovers which data files match.
Query Performance: Closing the Gap
Lakehouse systems achieve warehouse-like query performance through several mechanisms.
Data skipping uses column statistics stored in table metadata. A query filtering on country = 'Germany' skips any file where the country column’s min/max statistics show no German records. This works even when files contain millions of rows.
File caching keeps frequently accessed data in local storage or memory. Cloud data warehouses like Databricks, Snowflake, and BigQuery automatically cache hot data, eliminating object storage latency for common queries.
Compaction consolidates small files into larger ones. Remember that Parquet files under 128MB cause metadata overhead. Background compaction jobs merge small files into optimally-sized ones without disrupting concurrent queries.
Z-order clustering co-locates related data within the same files based on column values. If you frequently filter by customer_id, Z-ordering by customer_id groups each customer’s transactions into the same files, maximizing data skipping for customer-scoped queries.
-- Delta Lake: optimize and z-order for common filter patterns
OPTIMIZE events
WHERE event_date >= '2025-01-01'
ZORDER BY (customer_id, session_id);
Lakehouse Trade-Offs
| Aspect | Delta Lake | Apache Iceberg | Raw Parquet |
|---|---|---|---|
| ACID transactions | Yes | Yes | No |
| Time travel | Yes | Yes | No |
| Schema evolution | Yes | Yes | No |
| Hidden partitioning | No | Yes | No |
| Engine portability | Spark, Trino, Snowflake, BigQuery | Snowflake, Trino, Athena, Spark, Flink | Any |
| Open specification | Open (Delta Sharing) | Apache TLP (fully open) | N/A |
| Managed cloud option | Databricks, AWS, Azure | AWS, Snowflake, Dremio | None |
| Maturity | Production-hardened (2019+) | Maturing (2022+) | Proven but limited |
| Partition evolution | No (drop and recreate) | Yes | N/A |
Choose Delta Lake for Databricks-heavy environments. Choose Iceberg for multi-engine portability or when partition evolution is needed.
Capacity Estimation
The transaction log grows with write frequency, not data volume. Each commit creates a JSON action file (typically 1-50KB). At high write throughput, log files accumulate quickly.
# Estimate transaction log storage
commits_per_day = 1000 # streaming micro-batch: 1 per second
avg_log_file_kb = 10
retention_days = 30
log_size_mb = commits_per_day * avg_log_file_kb * retention_days / 1024
print(f"Daily log files: {commits_per_day}")
print(f"Monthly log storage: {log_size_mb:.1f} MB")
# Data file compaction planning
total_data_gb = 1_000_000 # 1 PB table
target_file_size_gb = 1
current_file_size_gb = 0.128 # 128MB default
file_count_before = total_data_gb / current_file_size_gb
file_count_after = total_data_gb / target_file_size_gb
print(f"Files before compaction: {file_count_before:,.0f}")
print(f"Files after compaction: {file_count_after:,.0f}")
print(f"Compaction reduces file count by: {(1 - file_count_after/file_count_before)*100:.0f}%")
# Files before compaction: 7,812,500
# Files after compaction: 1,000,000
# Compaction reduces file count by 87%
The transaction log is tiny compared to data storage, but at millions of commits per day, log retention for time travel can reach tens of gigabytes. Set deletedFileRetentionDuration based on your time travel window requirements, not indefinitely.
The Modern Lakehouse Stack
A typical lakehouse architecture looks like this:
Bronze layer in object storage stores raw ingested data in open formats (Parquet, ORC). This is immutable history. Schema-on-read applies here—consumers handle their own parsing.
Silver layer contains cleaned, deduplicated, and conformed data. Types are standardized, nulls are handled, and business rules are applied. This is the staging ground for analytics and ML.
Gold layer provides business-level aggregates and analytics-ready datasets. The gold layer is what BI tools and analysts query directly. It’s curated, documented, and reliable.
# Lakehouse medallion pipeline with Databricks
# Bronze: raw ingestion
raw_events = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", "broker:9092") \
.option("subscribe", "clickstream") \
.load()
raw_events.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "s3://chkpts/bronze/events/") \
.table("bronze.clickstream_raw")
# Silver: cleaning and deduplication
spark.sql("""
MERGE INTO silver.clickstream_events AS target
USING (
SELECT *, ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY event_time DESC) AS rn
FROM bronze.clickstream_raw
) AS source
ON target.event_id = source.event_id
WHEN NOT MATCHED THEN INSERT *
""")
# Gold: business aggregations
spark.sql("""
CREATE OR REPLACE TABLE gold.daily_customer_activity AS
SELECT
customer_id,
DATE(event_time) AS activity_date,
COUNT(*) AS event_count,
COUNT(DISTINCT session_id) AS sessions
FROM silver.clickstream_events
GROUP BY customer_id, DATE(event_time)
""")
When Lakehouse Makes Sense
Lakehouse isn’t always the right choice. For straightforward BI workloads where data is well-understood and query latency is critical, a native cloud data warehouse (Snowflake, BigQuery, Redshift) remains simpler to operate. The lakehouse’s flexibility comes with complexity that not every team needs.
Lakehouse shines when you have diverse consumers with different needs: BI analysts querying curated datasets, data scientists exploring raw features, ML pipelines building feature stores, and streaming applications needing low-latency access. When you need to preserve raw data for reprocessing or compliance. When your team has the engineering sophistication to manage the additional complexity.
For many organizations, the pragmatic approach is a hybrid: use lakehouse table formats (Delta Lake or Iceberg) to get ACID transactions and time travel even if the primary query engine is a managed warehouse. The table format abstraction means you can move workloads between engines without migrating data.
Lakehouse Observability
Track these metrics to keep lakehouse tables healthy:
# Key lakehouse metrics to monitor
lakehouse_metrics = {
# Table health
"file_count": "Alert if > 100,000 per table (compaction needed)",
"avg_file_size_mb": "Alert if < 128MB (compaction needed)",
"pending_commits": "Alert if > 10 (transaction congestion)",
"last_commit_timestamp": "Alert if stale > expected batch interval",
# Storage
"total_table_size_gb": "Track growth over time",
"removed_files_size_gb": "Monitor vacuum effectiveness",
# Query performance
"p99_query_duration_sec": "Baseline and alert on regression",
"bytes_scanned_per_query": "Detect full table scans on large tables"
}
# Example: Delta Lake table metrics query
delta_stats = spark.sql("""
DESCRIBE DETAIL delta.`s3://lake/gold/clickstream_events/`
""").collect()[0]
print(f"File count: {delta_stats.numFiles}")
print(f"Table size: {delta_stats.sizeInBytes / 1024**3:.1f} GB")
print(f"Partition count: {len(delta_stats.partitionColumns)}")
Alert on: file count growing 10x between compaction runs, average file size dropping below 64MB, pending commits stale for more than 5 minutes.
Lakehouse Anti-Patterns
Using lakehouse as a data swamp in disguise. Adding ACID transactions on top of a disorganized data dump doesn’t fix the underlying problem. Without naming conventions, partitioning discipline, and catalog entries, you have a transactional swamp. The format change is the easy part; the organizational discipline is the real work.
Running compaction too aggressively. Frequent compactions consume compute resources that could go toward useful work. If your workload is append-only (log data, events), small files are less of a problem because queries tend to be recent-time-scoped. Compact less frequently and monitor actual query patterns before tuning.
Ignoring partition evolution in Iceberg. Iceberg allows partition specs to evolve, but existing data isn’t retroactively repartitioned. Teams add a new partition column, write new data with the new spec, but historical queries still scan the old partition layout. Plan partition evolution carefully before deploying to production with historical data.
Conclusion
Lakehouse architecture represents a practical convergence of data lake flexibility and data warehouse reliability. Open table formats like Delta Lake and Iceberg have matured to the point where they are the default choice for new data platform investments.
The key insight is that storage and compute can be separated. Object storage holds the data in open formats. Compute engines—whether Spark, Trino, Snowflake, or BigQuery—query the same tables. You are no longer locked into a single vendor’s proprietary format.
If you are building a new data platform or migrating an existing one, lakehouse architecture with Delta Lake or Iceberg should be the starting point. The tooling has matured, the performance is competitive, and the flexibility to run diverse workloads on the same data is worth the architectural investment.
Quick Recap
- Lakehouse combines data lake storage economics with warehouse reliability—ACID transactions, time travel, and schema enforcement on open table formats.
- Delta Lake is the production default for Databricks environments; Iceberg is better for multi-engine portability across Snowflake, Trino, Athena, and Spark.
- Use Z-ordering on moderate-cardinality columns to co-locate related data for better data skipping.
- Monitor file count, average file size, and pending commits—compaction is the operational lever for keeping tables healthy.
- Time travel depends on file retention, not just the transaction log—configure
deletedFileRetentionDurationfor your compliance window. - Avoid the “transactional swamp” trap: format changes don’t fix organizational discipline problems.
For foundational data warehouse concepts, see data warehouse architecture. To understand the raw storage layer, read about data lakes.
Category
Tags
Related Posts
AWS Data Services: Kinesis, Glue, Redshift, and S3
Guide to AWS data services for building data pipelines. Compare Kinesis vs Kafka, use Glue for ETL, query with Athena, and design S3 data lakes.
Azure Data Services: Data Factory, Synapse, and Event Hubs
Build data pipelines on Azure with Data Factory, Synapse Analytics, and Event Hubs. Learn integration patterns, streaming setup, and data architecture.
Data Lake Architecture: Raw Data Storage at Scale
Learn how data lakes store raw data at scale for machine learning and analytics, and the patterns that prevent data swamps.