Apache Beam: Portable Batch and Streaming Pipelines

Discover how Apache Beam's unified programming model lets you write batch and streaming pipelines once and run them on Spark, Flink, or cloud runners.

published: March 27, 2026 reading time: 15 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Beam lets you write batch and streaming pipelines once and run them on Spark, Flink, Dataflow, or other runners. Your code stays the same; the runner is just a deployment choice.

Apache Beam: The Portable Framework for Batch and Streaming

Data pipelines come in two flavors: batch processes accumulated data, while streaming handles data as it arrives. Historically, you would write different code using different frameworks for each. Batch might use Spark, streaming might use Flink or Kafka Streams. When requirements changed or infrastructure evolved, you would port everything.

Apache Beam proposed a different approach: write your pipeline once using a unified model, then execute it on the runner that fits your needs. The same code runs on Spark, Flink, Google Cloud Dataflow, or portable runners. You choose the execution environment based on operational requirements, not lock-in.

The Core Abstraction: PCollections and Transforms

Beam pipelines center on PCollections (parallel collections) and Transforms (operations). A PCollection represents a distributed dataset—bounded for batch, unbounded for streaming. A Transform is a processing step that reads from one or more PCollections and outputs to another.

// Java SDK example: Count words in a collection of text
public static void main(String[] args) {
    PipelineOptions options = PipelineOptionsFactory.fromArgs(args).create();
    Pipeline pipeline = Pipeline.create(options);

    PCollection<String> lines = pipeline
        .apply("ReadLines", TextIO.read().from("gs://data-bucket/input/*.txt"));

    PCollection<String> words = lines
        .apply("ExtractWords", FlatMapElements.into(TypeDescriptors.strings())
            .via(line -> Arrays.asList(line.split("\\s+"))));

    PCollection<KV<String, Long>> wordCounts = words
        .apply("CountWords", Count.perElement());

    wordCounts.apply("WriteResults", TextIO.write().to("gs://data-bucket/output/wordcounts")
        .withNumShards(1));

    pipeline.run().waitUntilFinish();
}

The Python SDK reads similarly, trading Java’s verbosity for Python’s readability:

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

options = PipelineOptions()
with beam.Pipeline(options=options) as p:
    (p
     | 'ReadLines' >> beam.io.ReadFromText('gs://data-bucket/input/*.txt')
     | 'ExtractWords' >> beam.FlatMap(lambda x: x.split())
     | 'PairWithOne' >> beam.Map(lambda x: (x, 1))
     | 'GroupAndSum' >> beam.CombinePerKey(sum)
     | 'WriteResults' >> beam.io.WriteToText('gs://data-bucket/output/wordcounts'))

The key insight is that the pipeline definition is independent of execution. pipeline.run() targets whatever runner you’ve configured—Flink, Spark, Dataflow, or a local runner for testing.

Windowing: Handling Unbounded Data

Unbounded PCollections (streaming data) require windowing to divide time into finite chunks for aggregation. Beam provides several windowing strategies.

flowchart TD
    subgraph Fixed[Fixed Windows &#40;5-min tumbling&#41]
        F1[00:00-00:05]
        F2[00:05-00:10]
        F3[00:10-00:15]
    end
    subgraph Sliding[Sliding Windows &#40;1-hour, slide 5-min&#41]
        S1[00:00-01:00]
        S2[00:05-01:05]
        S3[00:10-01:10]
    end
    subgraph Session[Session Windows &#40;30-min gap&#41]
        E1[(event A&#41]
        E2[(event B&#41]
        E3[(event C&#41]
        GAP1[30-min gap]
        E4[(event D&#41]
        E5[(event E&#41]
        W1[Session 1&#58; A&#44;B&#44;C]
        W2[Session 2&#58; D&#44;E]
    end
    E1 --> W1
    E2 --> W1
    E3 --> W1
    GAP1 ---|no events| W2
    E4 --> W2
    E5 --> W2

Fixed windows create non-overlapping intervals of fixed duration. “Tumbling windows of 5 minutes” means every 5 minutes you emit the aggregate for that window. Fixed windows are the simplest model and match most alerting and monitoring use cases.

Sliding windows overlap. “Windows of 1 hour, sliding every 5 minutes” produces outputs every 5 minutes, each covering the preceding hour. Sliding windows smooth out fluctuations and are useful for moving averages.

Session windows group events by activity gaps. If no events arrive for a customer for 30 minutes, a session window closes. Session windows adapt to data patterns rather than fixed time boundaries.

# Sliding window: 1-hour windows updated every 5 minutes
(p
 | 'WindowIntoSliding' >> beam.WindowInto(
     beam.window.SlidingWindows(3600, 300)  # 1 hour, 5 min slide
 )
 | 'Aggregate' >> beam.CombinePerKey(average_revenue)

Watermarks handle late-arriving data. When a window closes, you might still receive events that belong to that window from before it closed. The watermark estimates when all data for a given window should have arrived. Events after the watermark but before garbage collection are considered late and can trigger a late-fire output or be dropped.

The Runner Landscape

Beam’s portability means you can run the same pipeline on different runners. Each runner has different strengths.

Apache Flink is a full-featured streaming engine with exactly-once semantics, sophisticated windowing, and checkpoint-based fault tolerance. Running Beam on Flink gives you Flink’s operational features with Beam’s programming model.

Apache Spark is the dominant batch processing runner. Beam on Spark handles batch workloads and structured streaming through Spark’s micro-batch engine.

Google Cloud Dataflow is a fully managed runner on Google Cloud. You write Beam code, submit it to Dataflow, and Google handles provisioning, scaling, and monitoring. No cluster management. Dataflow’s dynamic rebalancing automatically distributes work across workers, even when data is skewed.

Runner Comparison

Aspect	DirectRunner	FlinkRunner	SparkRunner	DataflowRunner
Use case	Local testing	Production streaming	Batch + micro-batch	Managed cloud
Exactly-once	No	Yes	Yes (with checkpointing)	Yes
Latency	N/A (local)	Sub-second	Seconds (micro-batch)	Sub-second
Scaling	None (single node)	Manual cluster config	Manual cluster config	Auto-scaling managed
State management	In-memory	RocksDB state backend	Checkpointed RDDs	Streaming engine
Windowing support	Full	Full	Limited (structured streaming)	Full
Side inputs	Yes	Yes	Yes	Yes
Operational burden	Low	High	High	Low (managed)
Cost model	Local only	Your infrastructure	Your infrastructure	Pay-per-use GCS
Best for	CI/CD, local dev	On-prem streaming	Spark shops migrating	GCP-native teams

# Run same pipeline on different runners
# Local (for testing)
python my_pipeline.py --runner=DirectRunner

# Flink cluster
python my_pipeline.py --runner=FlinkRunner --flink_master=flink:8081

# Google Cloud Dataflow
python my_pipeline.py --runner=DataflowRunner \
    --project=my-project \
    --region=us-central1 \
    --temp_location=gs://my-temp-bucket/tmp

Side Inputs and Side Outputs

Beyond the main data flow, Beam supports side inputs (additional data available at runtime) and side outputs (routing data to different sinks based on content).

Side inputs let a DoFn access data from another PCollection at runtime. If you have a lookup table of customer attributes and you need to enrich events with those attributes, side input provides the lookup:

# Side input for customer enrichment
customer_lookup = (p
    | 'ReadCustomers' >> beam.io.ReadFromJdbc(
        driver_class_name='org.postgresql.Driver',
        jdbc_url='jdbc:postgresql://db:5432/warehouse',
        table_name='dim_customer')
    | 'AsDict' >> beam.Map(lambda row: (row['customer_id'], row)))

enriched_events = (
    events
    | 'EnrichWithCustomer' >> beam.Map(
        lambda event,
             cust: {**event, 'customer_name': cust.get(event['customer_id'], {}).get('name')},
             side_inputs=[beam.pvalue.AsDict(customer_lookup)])
)

Side outputs route records to different destinations based on logic:

# Side output: separate clean and dirty records
clean, dirty = (
    records
    | 'ParseRecords' >> beam.ParDo(ParseRecords()).with_outputs('clean', 'dirty')
)

clean | 'WriteClean' >> beam.io.WriteToBigQuery('project:dataset.clean')
dirty | 'WriteDirty' >> beam.io.WriteToGCS('gs://dirty-records-bucket/')

When Beam Makes Sense

Beam is valuable when portability across runners matters. If you’re running on-premises today but planning a cloud migration, you can write pipelines in Beam and avoid rewrites. If you’re comparing runners for performance or operational reasons, Beam lets you test without rewriting.

Beam’s unified model also clarifies thinking. Batch and streaming often have the same logical structure—read, transform, write—with different windowing semantics. Beam forces you to be explicit about windowing, watermarks, and triggering, which prevents subtle batch-versus-streaming bugs.

The cost is abstraction overhead. Beam’s model is more constrained than native Flink or Spark. If you need advanced features that Beam doesn’t expose, you end up fighting the abstraction or dropping to runner-specific APIs.

For most teams building new pipelines, Beam is worth considering if you value portability or if you’re already on Google Cloud and want managed execution through Dataflow. For teams deeply invested in Spark or Flink ecosystems, native APIs sometimes offer more capabilities.

Production Failure Scenarios

Watermark stalls causing window emission delays

A pipeline reads from Kafka with a watermark that estimates when all events for a time window should have arrived. A network partition causes a 10-minute delay in one Kafka partition. The watermark stalls, delaying window triggering for all partitions. On-time aggregations wait; late data accumulates. Dashboard queries show gaps until the watermark advances.

Think of the watermark as a clock hand marking how far into event time the pipeline has processed. It only moves as fast as the slowest partition. One lagging broker—a crashed node, a network hiccup, even just a slow consumer—means every window that depends on that partition sits idle. Your 5-minute tumbling windows stall because one partition fell behind. Multi-partition sources feel this hardest: one straggler drags down the freshness of the entire pipeline’s output.

Beyond partitions, watermark stalls show up in pipelines that join across different source types. A join between a Kafka topic and a JDBC table is a common culprit. Kafka’s watermark races ahead; the JDBC polling schedule crawls. The join emits only when both watermarks have passed the join key’s event time, so the slower source dictates output cadence. You see this as bursty join results rather than a steady stream.

Monitoring is the first line of defense. Beam exposes WatermarkAnalysis metrics through the runner, and Dataflow gives you a watermark lag histogram per collection. Set alerts when watermark lag exceeds your longest window duration. On Flink, the watermark interval and idle timeout settings control how aggressively the runner pushes watermarks through stateless operators—tune these if your pipeline has known stragglers.

Mitigation: Set withAllowedLateness() explicitly to hold windows open longer. Use AfterWatermark.markFiringNewTriggers() to emit partial results before watermark closes. Monitor watermark lag per key with custom metrics. For heterogeneous source joins, materialize the slower source into a staging collection with a configurable watermark strategy rather than joining raw sources directly.

Late-fire trigger firing multiple times

A session window uses AfterWatermark to fire on-time and AfterPane.elementCountAtLeast(1) for late data. When late elements arrive after watermark, the trigger fires once per element instead of once per window. Side effect outputs (BigQuery writes, Pub/Sub sends) accumulate duplicate actions.

Trigger semantics in Beam determine when outputs fire relative to the watermark and incoming elements. AfterWatermark fires once when the watermark passes a window’s end, but it does not handle late arrivals on its own. You layer additional triggers with orFinally to catch stragglers—a secondary condition that fires a final output. The trap is pairing AfterWatermark with a late-fire trigger like AfterPane.elementCountAtLeast(1) but forgetting to wrap everything in Repeatedly.forever(). Without that outer Repeatedly, the late trigger fires on every single late element, not once to close the window.

This matters because side effects compound. If your pipeline writes to BigQuery on every trigger fire, a session window that receives 50 late elements produces 50 duplicate inserts. BigQuery’s streaming insert API has built-in deduplication, but only if you set insert IDs deterministically—and teams often skip this. Pub/Sub delivers at-least-once by default, so a retried trigger fire sends duplicate messages to subscribers. Downstream consumers that are not idempotent will see duplicate processing.

The correct pattern is Repeatedly.forever(AfterWatermark(...)).orFinally(AfterPane.elementCountAtLeast(N)). The Repeatedly.forever wrapper keeps the on-time trigger active across windows, while orFinally fires once when late data arrives, then closes the window. Pick N based on how much late data you expect before closing—small N closes faster but risks premature final fires; larger N waits longer for stragglers.

Mitigation: Use Repeatedly.forever(AfterWatermark(...)).orFinally(AfterPane.elementCountAtLeast(N)) to ensure one final fire. Design side effects to be idempotent since pipeline delivery guarantees differ from sink guarantees.

Checkpoint corruption on DirectRunner

A test pipeline uses DirectRunner for production (a mistake). Checkpoint state is stored in-memory and lost on restart. The pipeline reprocesses from the beginning, producing duplicate outputs for idempotent sinks and incorrect results for non-idempotent aggregations.

DirectRunner is Beam’s single-process executor. It runs elements sequentially in one JVM, which makes it useful for local development, debugging in an IDE, and CI pipelines where you want quick feedback without a cluster. The tradeoff is that it has no concept of distributed state or fault tolerance. Checkpoint state is stored in-memory as plain Java objects. When the process exits, restarts, or hits an OOM, that state disappears. The pipeline simply restarts from the beginning.

What happens next depends on your sink. Append-only destinations like Kafka consumer group offsets or BigQuery streaming with insert IDs silently drop duplicates—the reprocessing is invisible. But if you’re writing to a database with upsert semantics, incrementing a counter, or hitting any sink where the write operation itself changes state non-idempotently, you get wrong results. The tell is counts that land at exactly 2x, 3x, or N times the expected value, depending on how many restarts happened.

DirectRunner also has no heap management for large pipelines. A pipeline processing millions of elements eventually OOMs. Beam treats this as a worker failure and tries to restart from the last checkpoint—which does not exist on DirectRunner. The result is a crash loop if you resubmit without addressing the memory issue.

Mitigation: Use DirectRunner only for local development and CI. Configure checkpointingInterval explicitly and never run DirectRunner in production. Add pipeline start timestamps to outputs for duplicate detection. For production workloads, use a distributed runner (FlinkRunner, SparkRunner, or DataflowRunner) that persists checkpoint state to a durable storage backend.

Dataflow autoscaling causing memory pressure

Dataflow’s autoscaler adds workers mid-pipeline when CPU utilization exceeds 75%. A heavy GroupByKey operation batches data across new workers, exceeding memory limits on individual workers. The job fails with OOM rather than gracefully redistributing the state.

Dataflow’s autoscaler watches bundle processing time and CPU utilization. When it sees sustained high CPU, it provisions new workers mid-pipeline and redistributes pending bundles. This works fine for stateless transforms. GroupByKey is where it breaks down. GBK materializes all elements for a given key on the worker that owns that key. One high-cardinality key can mean millions of elements on a single machine. When autoscaling triggers a redistribution and that key lands on a new worker, the materialization step can OOM before redistribution finishes.

The real problem is that autoscaling decisions use bundle-level metrics, not key-level cardinality. The scheduler has no idea that Key X holds 50 million elements while everything else has hundreds. It just sees CPU climbing on Worker 3 and adds capacity. The new worker receives Key X’s data and runs out of memory during materialization—not during normal processing, but during the reshuffle.

The cruel part is that the pipeline was already broken. Autoscaling just exposed it by forcing a redistribution that hit the memory ceiling. A pipeline that never autoscales might plod through to completion even with skewed keys, because one worker handles Key X without hitting an OOM. Autoscaling is what makes the underlying skew fatal.

To diagnose: check Dataflow system pass progress wait time in the Dataflow UI. A high value means workers are waiting on resources, not just queued. Also look at BigQueryWriteErrors and worker logs for OOM messages. The job graph shows per-stage split counts—high splits on a GBK stage where one key dominates are a clear skew signal.

Mitigation: Estimate peak memory per worker before scaling events. Set machine_type explicitly rather than letting Dataflow choose. Monitor system_pass_progress_wait_time in Dataflow metrics—if it’s high, the pipeline is waiting on resources, not just scheduling. For known skewed keys, use a composite key transformation to distribute high-cardinality keys across multiple logical keys, then merge results downstream.

Apache Beam Observability

Track these metrics to keep Beam pipelines healthy:

# Key Beam metrics to monitor via runner-specific monitoring
beam_metrics = {
    # Latency
    "element_count": "Input elements per bundle",
    "wall_time_ms": "Bundle processing time",
    "system_pass_progress_wait_time": "Time waiting for resources",

    # Watermark progress
    "watermark_delay_sec": "How far behind the event time watermark is",
    "unprocessed_records": "Records waiting in buffers",

    # Output freshness
    "output_watermark_delay_sec": "Latency between processing and output",
    "late_data_dropped_count": "Records dropped after watermark close",

    # Dataflow-specific
    "DataflowJob_currentCPI": "CPU utilization (alert if > 75%)",
    "DataflowJob_pendingness": "Bundles waiting for workers"
}

# Python: access Beam metrics in DoFn
class MetricsDoFn(beam.DoFn):
    def process(self, element):
        # Increment custom counter
        counter = beam.metrics.metrics.Metrics.counter(self.__class__, 'processed_elements')
        counter.inc()
        yield element

Log per bundle: bundle ID, start/end time, element count, watermark at start. Alert on: watermark lag > 5 minutes, late data rate > 1%, Dataflow pendingness sustained > 100 bundles.

Apache Beam Anti-Patterns

Using DirectRunner in production. DirectRunner is single-threaded, stores state in memory, and has no fault tolerance. It is designed only for local development and testing. Production DirectRunner pipelines will lose state on failure and cannot scale horizontally.

Ignoring watermark configuration. Default watermark behavior is conservative. If your data source has known latencies (e.g., mobile events batched overnight), set withAllowedLateness() explicitly or windows close before late-arriving data arrives.

Non-idempotent sinks with at-least-once runners. FlinkRunner with checkpointing gives exactly-once guarantees within the pipeline, but side effects (database writes, API calls) are at-least-once. Design sinks to handle duplicates, or use exactly-once sinks (BigQuery streaming inserts with deduplication).

GroupByKey before windowing. GroupByKey materializes all elements for a key in memory before passing to downstream transforms. For high-cardinality keys or unbounded groups, this causes memory pressure. Use CombinePerKey with incremental aggregation instead, or window before grouping.

Quick Recap

Beam’s unified model expresses batch and streaming with the same API—write once, run on Flink, Spark, Dataflow, or DirectRunner.
Windowing divides unbounded streams into finite chunks: fixed (non-overlapping), sliding (overlapping), or session (gap-based).
Watermarks estimate when all data for a window has arrived; withAllowedLateness() controls how late data is handled.
Use DirectRunner for local testing only; FlinkRunner for on-prem streaming; DataflowRunner for managed GCP execution.
Design all side effects to be idempotent since pipeline delivery guarantees differ from sink guarantees.

For distributed batch processing at scale, see Apache Spark. For real-time stream processing, Apache Flink provides native streaming with advanced state management.

Conclusion

Apache Beam landed as a response to a real problem: writing separate code for batch and streaming, then being stuck with whatever runner you chose. Its unified model means the same pipeline handles both, and the runner is a deployment choice rather than a design constraint.

The tradeoff is real. Beam cannot express everything that native Flink or Spark can do, and when you hit those edges, you end up either fighting the abstraction or reaching for runner-specific APIs. For GCP teams who want managed execution, DataflowRunner removes a lot of operational overhead. For teams already running Flink or Spark, the portability benefit shrinks.

What Beam did well was articulate what batch and streaming share. That framing helped teams reason about their pipelines more clearly, even if they ultimately used runner-specific APIs instead.

For distributed batch processing at scale, see Apache Spark. For single-node analytical queries, DuckDB offers excellent performance without cluster overhead.

Apache Beam: Portable Batch and Streaming Pipelines

Apache Beam: The Portable Framework for Batch and Streaming

The Core Abstraction: PCollections and Transforms

Windowing: Handling Unbounded Data

The Runner Landscape

Runner Comparison

Side Inputs and Side Outputs

When Beam Makes Sense

Production Failure Scenarios

Watermark stalls causing window emission delays

Late-fire trigger firing multiple times

Checkpoint corruption on DirectRunner

Dataflow autoscaling causing memory pressure

Apache Beam Observability

Apache Beam Anti-Patterns

Quick Recap

Conclusion

Apache Beam: Portable Batch and Streaming Pipelines

Apache Beam: The Portable Framework for Batch and Streaming

The Core Abstraction: PCollections and Transforms

Windowing: Handling Unbounded Data

The Runner Landscape

Runner Comparison

Side Inputs and Side Outputs

When Beam Makes Sense

Production Failure Scenarios

Watermark stalls causing window emission delays

Late-fire trigger firing multiple times

Checkpoint corruption on DirectRunner

Dataflow autoscaling causing memory pressure

Apache Beam Observability

Apache Beam Anti-Patterns

Quick Recap

Conclusion

Category

Tags

Related Posts

Backpressure Handling: Protecting Pipelines from Overload

Data Validation: Ensuring Reliability in Data Pipelines

Data Engineering Roadmap: From Pipelines to Data Warehouse Architecture