Data Engineering Roadmap: From Pipelines to Data Warehouse Architecture
A practical learning path for building reliable data pipelines, choosing between batch and stream processing, and designing analytics infrastructure that actually works in production.
Data Engineering Roadmap
Data engineering is the craft of building and maintaining the infrastructure that collects, processes, and stores data for analysis. Data scientists build models and analysts build dashboards. Data engineers build the pipelines that move terabytes of data daily, transform raw events into analytics-ready datasets, and keep the systems running that data teams depend on.
This roadmap teaches you the full data engineering stack — from moving data between systems to building analytical models in a data warehouse. You’ll learn how to design pipelines that don’t fall apart at 2am, when to use batch versus stream processing, how to build warehouses that stay fast as they grow, and how to catch data quality problems before they reach your analysts.
Whether you’re at a startup building your first analytics setup or modernizing enterprise ETL, these skills translate directly to the problems you’ll actually face.
Before You Start
You should know basic SQL (SELECT, JOIN, GROUP BY), at least one programming language — Python is the most common choice — and understand how applications produce and consume data. If you’ve worked with databases before, you’re in a good place to start.
The Roadmap
🔗 Data Integration Basics
🔄 ETL and Data Pipelines
📬 Stream Processing
🏢 Data Warehouse
🔍 Data Processing Engines
🗺️ Data Modeling
🔐 Data Quality & Governance
⚙️ Pipeline Operations
☁️ Cloud Data Services
🎯 Next Steps
Timeline & Milestones
📅 Estimated Timeline
🎓 Capstone Track
- Design a dimensional model for a business domain (e-commerce, logistics, or finance)
- Implement CDC-based incremental data ingestion from a source database
- Build both batch ETL and real-time stream processing paths
- Load data into a star schema data warehouse with slowly changing dimensions
- Set up dbt transformations with data quality tests and documentation
- Configure Airflow orchestration with monitoring and alerting
- Implement data lineage tracking from source to consumption
- Add PII handling and de-identification for sensitive fields
- Create data contracts between producer and consumer teams
- Build audit trail logging for compliance requirements
- Migrate pipeline to AWS/GCP/Azure cloud services
- Implement serverless ETL components for cost optimization
- Set up cost monitoring and optimize resource utilization
- Configure disaster recovery and data migration procedures
Milestone Markers
| Milestone | When | What you can do |
|---|---|---|
| Data Foundations | Week 2 | Move data between databases using CDC and replication. Work with object storage (S3) and understand data formats (Parquet, Avro, JSON). Handle schema evolution without breaking downstream consumers. |
| Pipeline Engineering | Week 4 | Build robust ETL pipelines with proper data quality checks. Implement incremental loads using watermarks and CDC. Orchestrate complex workflows with Airflow, Prefect, or Dagster. Handle backfills and historical data reprocessing. |
| Stream Processing | Week 6 | Design event streaming architectures with Kafka. Process streams with Flink or Spark Streaming. Implement exactly-once semantics and handle backpressure in slow consumer scenarios. |
| Storage & Schema | Week 9 | Choose between star and snowflake schemas for analytics workloads. Distinguish between data lake, lakehouse, and data warehouse patterns. Apply dimensional modeling using Kimball methodology. |
| Quality & Operations | Week 12 | Implement data validation and anomaly detection. Track data lineage from source to consumer. Handle PII with proper de-identification. Configure alerting for pipeline failures with SLAs. |
| Capstone Complete | Week 14 | Design and deploy a complete cloud-native data architecture — from source ingestion through stream processing, warehouse storage, dbt transformations, and governance — with production-grade monitoring and cost optimization. |
Core Topics: When to Use / When Not to Use
ETL vs ELT — When to Use vs When Not to Use
| When to Use | When NOT to Use |
|---|---|
| Complex transformations that need dedicated transformation infrastructure | Cloud warehouses with powerful native SQL engines (use ELT instead) |
| Data coming from diverse sources needing heavy cleansing and normalization | When your transformation layer (dbt, Spark) lives inside the warehouse |
| Strict compliance requirements needing isolated transformation environments | Scenarios where speed of loading matters more than transformation complexity |
| Legacy systems where you need to reshape data before loading | When your team is more comfortable with SQL than external pipeline tools |
Trade-off Summary: ETL (Extract-Transform-Load) transforms data before loading into the warehouse, keeping raw data separate and enabling heavy transformation infrastructure. ELT (Extract-Load-Transform) loads raw data first then transforms in-database using warehouse compute — faster to set up and leverages warehouse power but requires careful permission management. Modern cloud data stacks increasingly favor ELT with dbt for transformations, but ETL remains valuable when transformation logic is complex or compliance requires separation.
Batch vs Stream Processing — When to Use vs When Not to Use
| When to Use | When NOT to Use |
|---|---|
| Analytics workloads with acceptable data freshness of minutes to hours | Sub-second freshness requirements (use stream processing) |
| Large dataset aggregations and complex joins across historical data | Low-latency event-driven applications (Kafka, Flink are better) |
| Reproducible pipeline runs with idempotent semantics | When you need real-time dashboards or alerting on event streams |
| Cost-sensitive workloads where you can schedule during off-peak | Spiky event volumes that require elastic scaling (stream handles this better) |
Trade-off Summary: Batch processing handles large volumes efficiently with lower infrastructure cost per record and natural fault-tolerance through recomputation. Stream processing delivers sub-second latency but requires managing state, exactly-once semantics, and handling backpressure. Lambda architecture (both batch and stream) provides freshness at the cost of doubled complexity — most practical systems pick one based on their freshness requirements and add the other only when genuinely needed.
Star Schema vs Snowflake Schema — When to Use vs When Not to Use
| When to Use | When NOT to Use |
|---|---|
| Simple analytics queries with fast BI dashboard performance | Highly normalized operational data models (normalize instead) |
| Ad-hoc queries from business users who need intuitive table relationships | Scenarios with many normalized hierarchies (snowflake handles this) |
| Low-latency query requirements (denormalized = fewer joins) | When storage space is critical (snowflake uses less space) |
| Teams without experienced modelers (star is easier to understand) | Slowly changing dimension complexity requiring Type 2 tracking across many hierarchies |
Trade-off Summary: Star schema denormalizes dimension tables into the fact table for query simplicity and performance — business users can navigate it easily and BI tools generate faster SQL. Snowflake schema normalizes dimensions further into sub-dimensions, reducing storage and handles complex hierarchical relationships better, but increases join complexity. Most data warehouses start with star for speed and normalize to snowflake only when dimension complexity genuinely demands it.
Data Lake vs Lakehouse — When to Use vs When Not to Use
| When to Use | When NOT to Use |
|---|---|
| Storing large volumes of raw data for future exploration without defined schema | Directly serving analytics queries without transformation layer |
| Machine learning training data requiring massive dataset access | Scenarios needing strong ACID guarantees on frequent updates |
| Cost-effective storage for rarely-accessed historical data | When you need reliable concurrent reads and writes (use lakehouse instead) |
| Multi-format data storage (images, logs, structured data) without predefined schema | Unstructured data with consistent query performance requirements |
Trade-off Summary: Data lakes provide cheap, scalable object storage for any data format, making them ideal for raw data and ML workloads. However, they lack ACID transactions and reliable concurrent access, leading to “data swamp” problems. Lakehouse architecture (Delta Lake, Iceberg) adds transactional guarantees and reliable query performance while maintaining lake economics — worth the added complexity when you need both raw data storage and performant analytics on the same platform.
Apache Kafka vs Cloud-native Streaming Services — When to Use vs When Not to Use
| When to Use | When NOT to Use |
|---|---|
| Maximum control over streaming infrastructure and tuning | Teams wanting minimal operational overhead |
| Multi-cloud or hybrid cloud architectures needing consistent streaming layer | When your workload fits entirely within AWS/GCP/Azure ecosystem |
| Complex event processing with Exactly-Once semantics required | Simple queue-based workloads without stream processing needs |
| Open-source requirement or avoiding vendor lock-in | When Kafka’s operational complexity outweighs your team’s capabilities |
Trade-off Summary: Apache Kafka is the industry standard for distributed streaming with mature ecosystem, exactly-once delivery, and maximum flexibility. Cloud services (Kinesis, Pub/Sub, Event Hubs) offer operational simplicity with pay-per-use pricing but create vendor lock-in and limit tuning options. For startups or teams with limited ops capacity, cloud-native is pragmatic. For enterprises with multi-cloud strategy or complex streaming requirements, Kafka provides the control and consistency needed across environments.
dbt vs Stored Procedures — When to Use vs When Not to Use
| When to Use | When NOT to Use |
|---|---|
| SQL-first teams wanting version-controlled transformations | Complex procedural logic beyond SQL (use Python/Spark instead) |
| Analytics engineers owning transformation logic without strong dev skills | Very large-scale transformations requiring distributed compute (Spark better) |
| Cross-database transformations with centralized documentation | Scenarios requiring heavy procedural loops and complex state management |
| Testing and CI/CD for transformation logic | When transformation logic needs to run outside the warehouse |
Trade-off Summary: dbt brings software engineering practices (version control, testing, documentation, CI/CD) to SQL transformations with analytics engineer accessibility. Stored procedures keep logic in-database but lack versioning, testing, and documentation standards — leading to transformation logic that becomes incomprehensible over time. dbt’s modular approach (models, seeds, macros) makes transformation logic maintainable and auditable, making it the clear choice for teams wanting sustainable transformation architecture.
Data Vault vs Kimball Dimensional Modeling — When to Use vs When Not to Use
| When to Use | When NOT to Use |
|---|---|
| Enterprise environments with complex source system integrations | Small teams needing quick analytics delivery (Kimball is faster) |
| Auditing and regulatory compliance requiring complete data traceability | Business user-facing analytics requiring intuitive table names (Kimball wins) |
| Rapid integration of new source systems without modifying existing structures | Scenarios where query performance is critical (denormalized star is faster) |
| Parallel development across multiple teams adding data independently | When you need simple dimensional hierarchies (snowflake handles this) |
Trade-off Summary: Data Vault provides auditability, scalability, and decoupled integration at the cost of query complexity — business users find it harder to navigate. Kimball delivers intuitive, performant dimensional models that business users love but requires careful upfront design and becomes fragile when sources change. Data Vault is enterprise-scale for audit-heavy industries; Kimball is analytics-focused for BI teams. Many modern architectures use Data Vault for raw integration layers then denormalize to dimensional models for consumption.
Resources
- Data Engineering with Python and AWS Lambda — Louis-Etienne Dorn. Practical if you’re already on AWS.
- The Data Warehouse Toolkit — Ralph Kimball. The classic on dimensional modeling, still relevant.
- Designing Data-Intensive Applications — Martin Kleppmann. Not strictly a data engineering book, but it will make you understand databases and distributed systems properly.
Official Documentation
- Apache Kafka Documentation
- Apache Spark Documentation
- Apache Airflow Documentation
- dbt Documentation
Reference Architecture
- AWS Big Data Blog
- GCP Data Analytics Blog
- Patterns of Distributed Systems — Martin Fowler’s collection, useful when you’re dealing with consistency and coordination problems.
Category
Related Posts
Data Warehouse Architecture: Building the Foundation for Analytics
Learn the core architectural patterns of data warehouses, from ETL pipelines to dimensional modeling, and how they enable business intelligence at scale.
Data Warehousing
OLAP vs OLTP comparison. Star and snowflake schemas, fact and dimension tables, slowly changing dimensions, and columnar storage in data warehouses.
Operating Systems Roadmap
A comprehensive learning path from computing fundamentals to advanced operating system concepts. Master process management, memory allocation, file systems, and concurrency.