Data Lake Architecture: Raw Data Storage at Scale

Learn how data lakes store raw data at scale for machine learning and analytics, and the patterns that prevent data swamps.

published: reading time: 11 min read

Data Lake Architecture: Storing Everything for Future Discovery

Data warehouses excel at answering known questions fast. You know you need revenue by region by quarter, so you build the schema to make that query efficient. But machine learning teams need to experiment with raw features that haven’t been aggregated yet. Data science projects need to explore datasets that don’t fit neatly into predefined schemas. Discovery questions—“what patterns in user behavior predict churn?”—require access to everything, not just the metrics someone thought to warehouse.

Data lakes exist to bridge this gap. They store data in its raw form at scale, without the upfront schema enforcement that warehouses require. You can dump terabytes of logs, clickstream events, IoT sensor readings, and third-party data feeds into a data lake and figure out what to do with it later.

The problem is that data lakes have a tendency to become data swamps—formless masses of unstructured files where nothing is findable and nothing is trustworthy. Getting the architecture right from the beginning prevents this.

When to Use a Data Lake

A data lake is the right choice when:

  • You need to store raw data before schema requirements are known (discovery analytics, ML feature engineering)
  • Data comes from multiple sources with different schemas that will evolve over time
  • You need time travel—reading older versions of data for auditing, model reprocessing, or error correction
  • Your team runs ML workloads that require flexible access to denormalized feature stores
  • Storage cost matters and you need cheap, durable object storage for rarelyAccessed data

A data lake is not the right choice when:

  • Your primary workload is pre-aggregated business intelligence dashboards (use a warehouse)
  • You need sub-second query response times on large datasets (use a specialized query engine)
  • Your team lacks the data engineering discipline to maintain catalog metadata and partitioning standards
  • Strict ACID transactions across multiple tables are required for every operation (use a warehouse or lakehouse)

The Core Idea: Schema-on-Read

Traditional databases and data warehouses use schema-on-write: you define the structure before data goes in, and the system enforces it. Data lakes flip this to schema-on-read. You load data in its native format—CSV, JSON, Parquet, Avro, raw logs—and define the schema when you query it.

This flexibility is the data lake’s main advantage. A data engineer can ingest streaming events into a data lake without first deciding how marketing wants to segment customers or how finance wants to aggregate revenue.

# Example: Writing raw events to a data lake in Parquet format
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("RawEventsToDataLake") \
    .getOrCreate()

# Events land as raw JSON, stored in bronze/raw/ partition
events_df = spark.read.json("s3://raw-events-bucket/clickstream/2025/03/")

# Write to data lake in Parquet, partitioned by date
events_df.write \
    .mode("append") \
    .partitionBy("event_date") \
    .parquet("s3://datalake-bucket/bronze/clickstream_events/")

The bronze/silver/gold pattern (also called raw/refined/curated) has become standard practice. Bronze holds the original ingested data, bit-for-bit identical to what sources sent. Silver contains cleaned and conformed data, ready for analysis. Gold provides business-level aggregates and analytics-ready datasets.

Storage Technologies

Modern data lakes sit on object storage: AWS S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage (GCS). Object storage provides virtually unlimited capacity, pay-per-use pricing, and durability at a fraction of the cost of traditional storage area networks.

The data lake formats you choose affect everything downstream. The dominant formats are:

Parquet is a columnar format that stores data efficiently for analytical queries. Reading a subset of columns from a Parquet file skips the columns you don’t need, dramatically reducing I/O. Parquet supports complex nested structures and offers good compression. It’s the default choice for analytical workloads.

ORC (Optimized Row Columnar) is similar to Parquet, originally developed for Hive. If you’re in a Hadoop ecosystem, ORC is a reasonable default. For cross-platform environments, Parquet has broader tool support.

Delta Lake and Apache Iceberg add ACID transactions to data lakes, solving the problem of partial writes corrupting data. They also support time travel queries (reading older versions of data) and schema evolution. Delta Lake (from Databricks) runs on S3, ADLS, GCS, and HDFS. Iceberg (now an Apache project) offers similar capabilities with different implementation details. If you’re building a new data lake today, use one of these table formats rather than raw Parquet files.

Storage Format Trade-Offs

FormatStrengthsWeaknessesBest For
ParquetColumnar, excellent compression, broad tool supportNo ACID transactions, no time travelAnalytical reads on immutable data
ORCHive-optimized columnar, good compressionNarrower ecosystem than ParquetHadoop/Hive environments
Delta LakeACID transactions, time travel, schema evolutionDatabricks-lock-in risk without open DeltaTeams needing transactional guarantees on lake storage
IcebergACID transactions, portable across engines, hidden partitioningNewer, smaller communityMulti-engine environments (Spark, Flink, Trino, Snowflake)
-- Delta Lake example: reading historical data with time travel
SELECT * FROM events
VERSION AS OF 20250301
WHERE event_type = 'purchase';

-- Reading data from a specific checkpoint
SELECT * FROM events
TIMESTAMP AS OF '2025-03-01 00:00:00';

Partitioning Strategy

How you partition data determines whether queries run in seconds or scan for hours. A partition is a directory structure that groups related records together—usually by date, but sometimes by entity ID or another high-cardinality attribute.

Partitioning by event date makes sense for time-series data. A query for “all events in March 2025” only reads the March 2025 partitions, skipping everything else. But if your queries almost always filter on customer ID rather than date, date partitioning forces full table scans.

# Poor partitioning: too many small files per partition
events_df.write.partitionBy("event_date", "event_hour").parquet(path)

# Better partitioning: larger files per partition, manageable partition count
events_df.repartition(100, "event_date").write.partitionBy("event_date").parquet(path)

File size matters as much as partition structure. Hundreds of tiny files (under 128MB each) kill performance because the query engine spends more time managing file handles than reading data. Target file sizes of 128MB to 1GB after compression for analytical workloads.

Capacity Estimation

Partition count grows with your data volume and date range. A table with 3 years of daily partitions creates over 1,000 partitions—each one a directory with metadata overhead.

# Estimate partition count and storage
years_of_data = 3
events_per_day = 10_000_000
avg_event_size_bytes = 512
compression_factor = 0.25  # Parquet typically compresses 4-10x
days_per_year = 365

total_events = events_per_day * days_per_year * years_of_data
raw_size_gb = (total_events * avg_event_size_bytes) / (1024**3)
compressed_size_gb = raw_size_gb * compression_factor
partition_count = days_per_year * years_of_data

# Target: files of 256MB compressed
target_file_size_mb = 256
events_per_file = (target_file_size_mb * 1024**2) / (avg_event_size_bytes * compression_factor)
files_per_day = events_per_day / events_per_file

print(f"Total events: {total_events:,}")
print(f"Raw size: {raw_size_gb:.1f} GB")
print(f"Compressed size: {compressed_size_gb:.1f} GB")
print(f"Partition count: {partition_count}")
print(f"Files per day: {files_per_day:.0f}")
# Total events: 10,950,000,000
# Raw size: 5251.2 GB
# Compressed size: 1312.8 GB
# Partition count: 1095
# Files per day: 40

At 10M events/day with 512-byte events, you get ~40 files per day at 256MB target size. Partition by date only (not date + hour) gives you manageable partition counts while keeping file sizes reasonable. If files are too small, aggregate micro-batches before writing.

Data Catalog: Finding What Exists

Without a data catalog, your data lake is a black hole. Engineers don’t know what datasets exist, analysts reuse the same extract rather than finding authoritative sources, and data quality degrades because nobody knows who owns what.

AWS Glue Data Catalog, Azure Purview, Google Dataplex, and open-source alternatives like Apache Atlas provide centralized metadata repositories. They track table schemas, partition information, sample data, and ownership. They also integrate with query engines so that Spark, Presto, and other tools can discover tables without manual configuration.

# AWS Glue: registering a table with schema inference
import boto3

glue = boto3.client('glue')

glue.create_table(
    DatabaseName='marketing_analytics',
    TableInput={
        'Name': 'clickstream_events',
        'StorageDescriptor': {
            'Location': 's3://datalake-bucket/gold/clickstream_events/',
            'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
            'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
            'SerdeInfo': {
                'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
            }
        },
        'PartitionKeys': [
            {'Name': 'event_date', 'Type': 'string'},
            {'Name': 'source', 'Type': 'string'}
        ],
        'TableType': 'EXTERNAL_TABLE'
    }
)

A catalog entry should include business description, column-level descriptions, sample queries showing common use cases, data freshness metrics, and contact information for the owning team.

Observability for Data Lakes

Track these metrics to catch problems before they become swamps:

# Partition discovery monitoring
def check_partition_health():
    """
    Key data lake observability metrics:
    """
    queries = [
        # Count partitions per table (alert if > threshold)
        \"\"\"
        SELECT table_name, COUNT(DISTINCT partition_key) as partition_count
        FROM information_schema.partitions
        GROUP BY table_name
        HAVING partition_count > 5000
        \"\"\",
        # Find tables with no recent data (stale catalog)
        \"\"\"
        SELECT table_name, MAX(partition_timestamp) as last_data
        FROM information_schema.partitions
        GROUP BY table_name
        HAVING DATEDIFF(now(), MAX(partition_timestamp)) > 7
        \"\"\",
        # Detect schema drift: column count changed since last run
        \"\"\"
        SELECT table_name, column_count,
               LAG(column_count) OVER (PARTITION BY table_name ORDER BY checked_at) as prev_count
        FROM schema_version_history
        WHERE column_count != prev_count
        \"\"\"
    ]
    return queries

# Alert thresholds:
# - Partition count > 10,000: investigate partitioning strategy
# - Stale data > 7 days: check pipeline health
# - Schema drift: block downstream pipelines until resolved

What to log per pipeline run: rows_read, rows_written, files_created, partition_count_delta, bytes_written, duration_seconds, schema_version_hash. Alert on: zero rows written when source had rows, partition count growing 10x expected, schema hash mismatch vs expected.

Handling Schema Evolution

Source systems change. A web application starts capturing a new field. A database adds a column. An API modifies a response format. Schema-on-read handles this gracefully in theory, but in practice you need processes for managing evolution.

Apache Spark and other engines support schema merging when reading multiple Parquet files with different schemas. New columns appear as nulls in older files, which is correct behavior but can cause subtle bugs if queries don’t handle nulls properly.

Better practice: explicitly manage schema changes through the ingestion pipeline. When a source schema changes, update the bronze layer’s metadata, propagate the change through silver processing, and notify downstream consumers. Delta Lake and Iceberg handle some of this automatically through schema evolution features.

Data Lake Architecture

The bronze/silver/gold medallion pattern organizes the data lake into quality tiers:

flowchart TD
    subgraph Sources[Source Systems]
        ERP[ERP System]
        CRM[CRM Platform]
        Logs[App Logs]
        IoT[IoT Sensors]
    end
    subgraph Bronze[Bronze Layer (Raw&#41]
        B_ERP[s3://datalake/bronze/erp/]
        B_CRM[s3://datalake/bronze/crm/]
        B_Logs[s3://datalake/bronze/logs/]
        B_IoT[s3://datalake/bronze/iot/]
    end
    subgraph Silver[Silver Layer (Cleaned&#41]
        S_Events[s3://datalake/silver/events/]
        S_Customers[s3://datalake/silver/customers/]
        S_Transactions[s3://datalake/silver/transactions/]
    end
    subgraph Gold[Gold Layer (Curated&#41]
        G_Marts[Analytics Marts]
        G_Features[ML Feature Store]
        G_Reports[BI Datasets]
    end
    ERP --> B_ERP
    CRM --> B_CRM
    Logs --> B_Logs
    IoT --> B_IoT
    B_ERP --> S_Events
    B_CRM --> S_Customers
    B_Logs --> S_Transactions
    B_IoT --> S_Transactions
    S_Events --> G_Marts
    S_Customers --> G_Features
    S_Transactions --> G_Reports
    S_Events --> G_Features

Bronze holds bit-for-bit copies of source data. Silver cleans and conforms data into common schemas. Gold serves business-level aggregates and ML-ready features.

When Data Lakes Go Wrong

The data swamp is the most common failure mode. Without discipline around naming conventions, partitioning, and cataloging, lakes become indistinguishable from cheap storage buckets with extra steps.

Another failure mode: treating the data lake as a replacement for the data warehouse. Data lakes are excellent for raw storage, ML feature engineering, and discovery analytics. They’re poor at serving pre-aggregated metrics, handling complex joins across unrelated domains, and providing the query predictability that business intelligence requires. The warehouse and lake serve different purposes and should be architected together.

Security and governance also get neglected. Object storage permissions can be surprisingly coarse-grained, and without proper access controls, you end up with either data silos (everyone gets their own bucket) or security risks (too many users have too much access). Integration with a data catalog’s governance features helps track who accesses what and for what purpose.

Quick Recap

  • Data lakes store raw data in schema-on-read mode — flexibility for discovery, discipline required to avoid swamps.
  • Use Delta Lake or Iceberg table formats instead of raw Parquet — ACID transactions and time travel matter in production.
  • Partition by a high-cardinality attribute matching your query patterns — date partitioning is the default for time-series data.
  • Target 128MB–1GB compressed file sizes — smaller files cause query engine overhead, larger files reduce parallelism.
  • Catalog every dataset with schema, ownership, and freshness metrics — without metadata, your lake is a black hole.
  • Monitor partition counts, file sizes, and schema drift — catch data swamp formation early.

Conclusion

A data lake is infrastructure for discovery as much as for reporting. It stores everything inexpensively, preserves raw data for future reprocessing, and enables workloads that traditional warehouses cannot support.

The architecture choices that matter most: the table format (Delta Lake or Iceberg), partitioning strategy (match your query patterns), and metadata management (catalog everything). Get these right and your lake remains useful for years. Get them wrong and you will spend cycles on data archaeology instead of building features.

For most teams, the right architecture is a data lake for raw storage and exploration, a processing layer (Spark, dbt, or Flink) for transformations, and a data warehouse for curated, business-level datasets. Each component has its role.

To unify lake storage with warehouse reliability, consider the lakehouse architecture. Understanding data warehouse architecture helps clarify when to use lake versus warehouse.

Category

Related Posts

Apache Spark: The Workhorse of Distributed Data Processing

A deep dive into Apache Spark's architecture, RDDs, DataFrames, and how it processes massive datasets across clusters at unprecedented scale.

#data-engineering #apache-spark #distributed-computing

AWS Data Services: Kinesis, Glue, Redshift, and S3

Guide to AWS data services for building data pipelines. Compare Kinesis vs Kafka, use Glue for ETL, query with Athena, and design S3 data lakes.

#data-engineering #aws #kinesis

Azure Data Services: Data Factory, Synapse, and Event Hubs

Build data pipelines on Azure with Data Factory, Synapse Analytics, and Event Hubs. Learn integration patterns, streaming setup, and data architecture.

#data-engineering #azure #data-factory