Azure Data Services: Data Factory, Synapse, and Event Hubs

Build data pipelines on Azure with Data Factory, Synapse Analytics, and Event Hubs. Learn integration patterns, streaming setup, and data architecture.

published: March 27, 2026 reading time: 22 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Event Hubs does Kafka-compatible streaming, Data Factory orchestrates pipelines, Synapse Analytics runs SQL warehouse queries alongside Spark, and Blob Storage with hierarchical namespace serves as the data lake foundation. The natural choice if you are already on Microsoft.

Azure Data Services: Data Factory, Synapse, Event Hubs, and Blob Storage

Azure has a comprehensive data platform that integrates well with the Microsoft ecosystem. If your organization already uses Microsoft products, Azure data services connect naturally with Power BI, Excel, and the rest of the stack.

This guide covers the key Azure data services and shows how to build modern data architectures on Azure.

Service Overview

Service	Type	Use Case
Data Factory	Orchestration	Pipeline orchestration, ETL
Synapse Analytics	Data Warehouse	Enterprise analytics, BI
Event Hubs	Streaming	High-throughput event ingestion
Blob Storage	Object Storage	Data lake, unstructured storage
Databricks	Spark Analytics	ML, advanced analytics
Stream Analytics	Stream Processing	Real-time analytics on streams

Azure Event Hubs

Event Hubs handles durable, ordered event streaming. It also speaks the Kafka protocol, which is useful if you have Kafka tooling already.

Event Hubs Namespace and Hubs

from azure.eventhub import EventHubProducerClient, EventHubConsumerClient
import json

# Create producer
producer = EventHubProducerClient.from_connection_string(
    conn_str='Endpoint=sb://namespace.servicebus.windows.net/;SharedAccessKeyName=...',
    eventhub_name='click-events'
)

# Send events
def send_events(events):
    with producer:
        event_batch = producer.create_batch()
        for event in events:
            event_batch.add(EventData(json.dumps(event)))
        producer.send_batch(event_batch)

# Consumer
consumer = EventHubConsumerClient.from_connection_string(
    conn_str='Endpoint=sb://namespace.servicebus.windows.net/;SharedAccessKeyName=...',
    consumer_group='$Default',
    eventhub_name='click-events'
)

def on_event(partition_context, event):
    data = json.loads(event.body_as_str())
    process_event(data)
    partition_context.update_checkpoint(event)

with consumer:
    consumer.receive(on_event=on_event, starting_position="-1")

Event Hubs Kafka Compatibility

from kafka import KafkaProducer, KafkaConsumer

# Use Kafka protocol to interact with Event Hubs
producer = KafkaProducer(
    bootstrap_servers='namespace.servicebus.windows.net:9093',
    security_protocol='SASL_SSL',
    sasl_mechanism='PLAIN',
    sasl_plain_username='$ConnectionString',
    sasl_plain_password='Endpoint=sb://namespace.servicebus.windows.net/;SharedAccessKeyName=...',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

producer.send('click-events', {'user_id': '123', 'action': 'page_view'})

Azure Data Factory

Data Factory provides visual pipeline orchestration with 90+ connectors.

Data Factory Pipeline

{
  "name": "ingest_and_transform",
  "properties": {
    "activities": [
      {
        "name": "CopyFromBlobToADLS",
        "type": "Copy",
        "inputs": [
          {
            "referenceName": "BlobSourceDataset",
            "type": "DatasetReference"
          }
        ],
        "outputs": [
          {
            "referenceName": "ADLSSinkDataset",
            "type": "DatasetReference"
          }
        ],
        "typeProperties": {
          "source": { "type": "BlobSource" },
          "sink": { "type": "DelimitedTextSink" }
        }
      },
      {
        "name": "RunDatabricksNotebook",
        "type": "DatabricksNotebook",
        "linkedServiceName": {
          "referenceName": "DatabricksLinkedService",
          "type": "LinkedServiceReference"
        },
        "parameters": {
          "sourcePath": "events.parquet",
          "outputPath": "processed/events"
        }
      }
    ]
  }
}

Data Flow (Azure Data Flow)

# Data Flow transformation (equivalent in ADF visual)
# Source: Blob Storage
# Derive: Add computed columns
# Select: Choose output columns
# Sink: Azure Synapse

# Mapping Data Flow script
dataflow = """
source(
    allowSchemaDrift: true,
    validateSchema: false,
    ignoreNoFilesFound: true
)
~> BlobSource

BlobSource derive(
    event_date = toDate(timestamp),
    event_hour = hour(timestamp),
    user_segment = iif(total_purchases > 10, 'high_value', 'regular')
)
~> DerivedColumns

DerivedColumns select(
    allowSchemaDrift: true,
    validateSchema: false,
    skipDuplicateMapInputs: true,
    skipDuplicateMapOutputs: true
)
(
    output(
        event_id,
        user_id,
        event_type,
        timestamp,
        event_date,
        event_hour,
        user_segment
    )
)
~> SelectColumns

SelectColumns sink(
    type: 'synapse',
    LinkedServiceName: 'SynapseWorkspace',
    Table: 'dbo.events'
)
"""

Triggering Pipelines

from azure.mgmt.datafactory import DataFactoryManagementClient
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
client = DataFactoryManagementClient(credential, 'subscription-id')

# Trigger pipeline run
run_response = client.pipelines.create_run(
    resource_group_name='data-rg',
    factory_name='datafactory',
    pipeline_name='ingest_and_transform',
    parameters={
        'inputPath': 'raw/events/2026/03/27',
        'outputPath': 'processed/events/2026/03/27'
    }
)

# Wait for completion
while True:
    run = client.pipeline_runs.get(
        resource_group_name='data-rg',
        factory_name='datafactory',
        run_id=run_response.run_id
    )
    if run.status in ['Succeeded', 'Failed', 'Cancelled']:
        break
    time.sleep(30)

Azure Synapse Analytics

Synapse combines data warehousing and big data analytics in one platform.

Dedicated SQL Pool

import pyodbc

# Connect to Synapse
server = 'synapse-workspace.sql.azuresynapse.net'
database = 'analytics'
conn = pyodbc.connect(
    f'DRIVER={{ODBC Driver 17 for SQL Server}};'
    f'SERVER={server};'
    f'DATABASE={database};'
    f'UID={username};'
    f'PWD={password}'
)

# Create external table
cursor = conn.cursor()
cursor.execute("""
    CREATE EXTERNAL TABLE dbo.user_events
    (
        event_id BIGINT,
        user_id VARCHAR(50),
        event_type VARCHAR(50),
        timestamp DATETIME2,
        properties VARCHAR(MAX)
    )
    WITH (
        LOCATION = 'processed/events/',
        DATA_SOURCE = 'datalake',
        FILE_FORMAT = 'parquet_format'
    )
""")

Serverless SQL Pool

# Query data directly in storage without moving it
query = """
SELECT
    user_id,
    event_type,
    COUNT(*) as event_count,
    MIN(timestamp) as first_seen,
    MAX(timestamp) as last_seen
FROM OPENROWSET(
    BULK 'processed/events/year=2026/month=03/*.parquet',
    DATA_SOURCE = 'datalake',
    FORMAT = 'PARQUET'
) AS events
GROUP BY user_id, event_type
ORDER BY event_count DESC
"""

df = pd.read_sql(query, conn)

Spark Integration

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SynapseSpark") \
    .config("spark.datasource.adl", "datalake") \
    .config("spark.datasource.adl.account", "storageaccount.dfs.core.windows.net") \
    .getOrCreate()

# Read from storage
df = spark.read \
    .parquet("abfs://datalake@storageaccount.dfs.core.windows.net/processed/events/")

# Process and write to Synapse
df.write \
    .option("spark.synapse.database", "analytics") \
    .option("spark.synapse.table", "events") \
    .mode("append") \
    .saveAsTable("events")

Blob Storage and Data Lake

Azure Data Lake Storage (ADLS) provides tiered blob storage with hierarchical namespace.

ADLS Setup

from azure.storage.filedatalake import DataLakeServiceClient

# Create Data Lake service client
service_client = DataLakeServiceClient.from_connection_string(
    conn_str='DefaultEndpointsProtocol=https;AccountName=storageaccount;AccountKey=...'
)

# Create filesystem (container)
filesystem_client = service_client.create_file_system('datalake')

# Create directory
directory_client = filesystem_client.create_directory('raw/events')

# Upload file
file_client = directory_client.create_file('2026/03/27/events.parquet')
with open('events.parquet', 'rb') as f:
    file_client.append_data(f, 0)
    file_client.flush_data(len(f.read()))

Data Lake Access Patterns

Picking the right storage tier is a decision you make once and pay for forever. Hot Blob is for data your pipelines touch every day — the cost per GB is highest but access is instant. Cool Blob drops that cost by about half but adds a retrieval charge that makes it expensive if accessed frequently. Archive Blob is for compliance retention or data you genuinely expect to never need — the retrieval cost and latency are substantial, so this is not a tier to migrate to casually. Premium Block Blob is for workloads that need single-digit millisecond latency and high IOPS, like boot disks or transaction-intensive databases — it costs 10-20x more than Hot per GB, so reserve it for what actually needs it.

The tier you pick affects your architecture downstream. Data in Hot tier can be read by Spark and Data Factory connectors directly. Data in Cool or Archive requires a tier-change operation first, which adds latency to any pipeline that needs to read it. If your analytics queries constantly “just check” a dataset, keeping it in Hot costs more but avoids pipeline surprises. If you migrate a dataset to Archive because it “hasn’t been used in a month,” the first analyst who needs it will spend hours debugging why their query returns nothing.

Access Pattern	Storage Tier	Use Case
Hot	Blob Hot	Active processing
Cool	Blob Cool	Recent data, infrequent access
Archive	Blob Archive	Historical, rarely accessed
Premium	Block Blob	High IOPS requirements

Lifecycle Management

from azure.storage.blob import BlobServiceClient

client = BlobServiceClient.from_connection_string('conn_str')

# Set lifecycle policy
client.set_container_client('datalake').set_container_access_policy(
    signed_identifiers={
        'raw-policy': {
            'access_policy': {
                'starts_on': '2026-01-01',
                'expires_on': '2027-01-01',
                'permissions': 'r'
            }
        }
    }
)

# Configure tiering rules
account_client = client.account_name()

Azure Stream Analytics

Real-time stream processing with SQL-like queries.

Stream Analytics Job

-- Input: Event Hubs
-- Output: Blob Storage, SQL, Power BI

SELECT
    System.Timestamp() AS event_time,
    user_id,
    event_type,
    properties.device_type,
    properties.browser,
    CASE
        WHEN event_type = 'purchase' THEN CAST(properties.amount AS FLOAT)
        ELSE 0
    END AS purchase_amount
INTO
    [output-blob]
FROM
    [event-hubs-input]
WHERE
    event_type IN ('page_view', 'purchase', 'signup')
    AND timestamp IS NOT NULL

SELECT
    TumblingWindow(minute, 5) AS window,
    COUNT(*) AS event_count,
    COUNT(DISTINCT user_id) AS unique_users,
    SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) AS purchases
INTO
    [real-time-dashboard]
FROM
    [event-hubs-input]
GROUP BY TumblingWindow(minute, 5), event_type

Machine Learning in Stream Analytics

-- Use ML model for real-time scoring
SELECT
    event.*,
    MLmodel1 Predict(event.features) AS prediction
INTO
    [scored-output]
FROM
    [event-hubs-input] AS event
CROSS APPLY MLmodel1(event.features) AS T1

Architecture Patterns

Modern Data Warehouse on Azure

flowchart TD
    A[Apps] -->|Event Hubs| B[Stream Analytics]
    A -->|ADF| C[Blob Storage Raw]
    B --> D[Synapse]
    C -->|ADF| D
    D --> E[Power BI]
    D --> F[Databricks ML]

Lambda Architecture on Azure

# Speed Layer - Stream Analytics
speed_query = """
SELECT
    System.Timestamp() AS window_end,
    count(*) as event_count,
    avg(response_time) as avg_response_time
INTO
    [speed-output]
FROM
    [events-input]
GROUP BY TumblingWindow(second, 5)
"""

# Batch Layer - Data Factory + Synapse
batch_pipeline = {
    "activities": [
        {
            "name": "IngestToRaw",
            "type": "Copy",
            "source": {"type": "BlobSource"},
            "sink": {"type": "AzureSynapseSink"}
        },
        {
            "name": "AggregateBatch",
            "type": "SqlServerStoredProcedure",
            "linkedServiceName": {"type": "AzureSqlDatabase"},
            "storedProcedureName": "sp_aggregate_daily"
        }
    ]
}

Integration with Microsoft Ecosystem

Power BI Integration

# Direct query from Synapse to Power BI
# In Power BI Desktop:
# Get Data > Azure > Azure Synapse Analytics
# Enter server: synapse-workspace.sql.azuresynapse.net
# Select database and tables

# Or use Power BI XMLA endpoint
# Server: synapse-workspace.sql.azuresynapse.net:1433

Dynamics 365 Integration

# Connect to Dynamics 365 data
datafactory = DataFactoryManagementClient(
    credential,
    'subscription-id'
)

# Linked service for Dynamics 365
linked_service = {
    "name": "dynamics365",
    "properties": {
        "type": "Dynamics365",
        "typeProperties": {
            "deploymentType": "Online",
            "organizationServiceUrl": "https://org.api.crm.dynamics.com",
            "authenticationType": "Office365",
            "username": "user@company.com",
            "password": "password"
        }
    }
}

Cost Optimization

Key Azure-specific cost optimization:

Use Synapse serverless for unpredictable workloads (pay per TB scanned)
Switch to dedicated SQL pools for consistent, heavy workloads
Enable Auto-pause on Synapse serverless when not in use
Use Event Hubs capture to batch process instead of real-time when possible
Implement Blob Storage lifecycle policies to auto-tier old data
Consider Reserved Capacity for predictable workloads
Use Data Factory integration runtimes efficiently (Azure IR vs self-hosted)

Azure Services Production Failure Scenarios

Event Hubs throttle from traffic spike

Event Hubs has throughput units (TUs) with a hard limit of 1MB/s ingress and 2MB/s egress per TU. A product launch doubles traffic. TUs are exceeded. Incoming events are throttled with QuotaExceededException. Producers receive 503 errors. Data is buffered client-side until capacity frees up, creating a backlog that takes hours to drain.

Impact: Throttling blocks producers from sending events; they get HTTP 503 responses back. Buffered events pile up in client-side queues, which eats memory on producer services. Downstream consumers see a gap in the event stream — real-time dashboards drop data points, batch sinks get late-arriving records. If the backlog exceeds the producer’s buffer limit (usually 10-50MB per partition), events drop silently. Once traffic normalizes, clearing the backlog takes 2-4 hours because the incoming rate has to outpace the processing rate to catch up.

Root cause: TUs are under-provisioned for peak traffic. Event Hubs namespaces start with 1 TU by default, good for ~1MB/s ingress. Product launches, marketing pushes, or IoT firmware updates can blow past that by orders of magnitude. A second factor is retry logic that fights the throttling response: producers with aggressive retry loops resubmit failed events on top of new ones, amplifying the load during recovery.

Warning signs: Azure Monitor shows throttled_requests climbing before server_busy_errors show up. The incoming_messages metric flattens at a line that exactly matches 1MB/s per TU. Producer-side metrics show send latency going from milliseconds to seconds. When autoscale is off, active_routing_rules stays flat while captured_messages falls behind. Event Hubs Capture starts queuing because it cannot keep pace with ingress rate.

Mitigation: Monitor incoming_messages, throttled_requests, and server_busy_errors in Azure Monitor. Set autoscale rules on Event Hubs namespaces to add TUs based on traffic. Always implement retry logic with exponential backoff in producers.

Data Factory pipeline runs for days and fails at the last step

A complex Data Factory pipeline with 40 activities runs for 48 hours. The final Databricks notebook step fails due to a workspace configuration change made 3 days earlier. The entire pipeline must be re-run from the beginning. No intermediate checkpoints exist. Business reports are delayed 2 days.

Impact: BI teams run morning reports off this pipeline’s output. A 48-hour run that fails at the last step means 2 days of stale dashboards. Synapse data models that depend on this output go stale too, cascading into other reports. Nobody finds out about the failure until someone notices the Data Factory UI showing red — there is no alert tied to the business outcome itself. By the time engineers get paged, hours have already passed.

Root cause: Data Factory pipelines are monolithic by default. Forty activities with no checkpointing between them means the entire run must restart on any failure. A configuration change to a linked service does not cause an immediate failure — it waits until the activity that uses that service actually runs. The 3-day gap between the change and the failing activity means nobody connects the two without knowing what changed. The pipeline has no built-in way to validate linked service connectivity before running dependent activities.

Warning signs: PipelineRunDuration in Azure Monitor climbs week over week as data volume grows, but nobody set an alert on duration thresholds. Activity logs are not aggregated — an engineer has to open each failed activity manually to read the error. The Databricks workspace configuration change went unnoticed because there is no Log Analytics alert for changes to linked service resources.

Mitigation: Use Data Factory’s native checkpointing features. Break long pipelines into smaller pipelines with trigger-based chaining. Store intermediate outputs to Blob Storage. Use validation activities before critical downstream steps.

Synapse DWU exhaustion during month-end batch window

Month-end batch jobs run in parallel with BI dashboards, consuming all Data Warehouse Units (DWUs). Analytical queries start timing out. Engineers do not discover the issue until batch window completes and DWU utilization drops. Meanwhile, data freshness SLA is missed.

Impact: BI dashboards feed executive reporting. When DWU exhaustion causes timeouts, dashboards show stale data or errors for 4-6 hours during the batch window. Executives pulling real-time metrics see outdated numbers, which erodes trust in the data platform. The data freshness SLA is contractual — missing it triggers a post-mortem. Batch jobs that finish under DWU pressure often produce incorrect aggregates because they compete for resources mid-computation, which means the results have to be re-run.

Root cause: Synapse dedicated SQL pools allocate DWUs statically. A 1000 DWU pool gives consistent compute but does not scale automatically between batch and interactive workloads. Month-end batch jobs are heavy — large CTAS statements, index rebuilds, fact table merges — and they run at the same time as interactive BI queries. Without workload classification to prioritize interactive queries, they all compete equally. The WarehouseLiveUsage metric is not monitored in real-time, so nobody catches the exhaustion until queries start queuing.

Warning signs: SQLPoolUsage in Azure Monitor climbs above 90% for 15+ minutes before queries start timing out. RunningQueries spikes as batch CTAS statements queue behind interactive queries. Log Analytics shows EXECms climbing 300-500% for queries that ran fine an hour earlier. QueuedQueries goes from 0 to non-zero, which means the scheduler is holding queries waiting for resources.

Mitigation: Use Synapse workload importance to prioritize critical queries over batch jobs. Classify workloads with workload groups. Monitor SQLPoolUsage metrics. Consider dedicated capacity for batch processing separate from interactive workloads.

Blob Storage lifecycle rule deletes data prematurely

A lifecycle rule is configured to delete raw data after 30 days. The rule runs on schedule and deletes files from the raw/events/ prefix. However, a compliance requirement is discovered 3 months later that mandates 7-year retention. The deleted data cannot be recovered. Audit findings and regulatory penalties follow.

Impact: Raw event data is the source of truth for historical reconstructions. When lifecycle rules delete data before the retention requirement is known, the loss is permanent. GDPR Article 5(1)(e) requires data to be kept only as long as necessary, but also prohibits deleting data that is still needed. A 7-year retention mandate means financial transaction logs, user activity records, and audit trails must be recoverable for the full period. The gap between the 30-day deletion and the 7-year requirement means 3 months of operational data is gone for good, which triggers compliance findings and potential fines.

Root cause: Blob Storage lifecycle rules evaluate against the blob’s Last-Modified timestamp, not against any compliance-granted retention period. An engineer set a 30-day window based on typical data lake growth management, not on regulatory requirements. There is no mechanism linking compliance requirements to lifecycle rule parameters. By the time the compliance team found the 7-year requirement, the rule had already been running for 3 months. The storage account did not have Blob soft delete enabled, so there was no way to recover the deleted data.

Warning signs: BlobCount in Azure Monitor drops suddenly on the 30th day after a large ingestion batch, which might indicate the rule fired. There is no metric that specifically tracks lifecycle rule execution outcomes. The storage account diagnostic settings do not include StorageLog events for delete operations, so deletion events do not appear in Log Analytics. Without Blob soft delete enabled, there is no safety net — the data vanishes when the rule runs.

Mitigation: Always use a cool or cold tier before deletion rules. Enable Blob soft delete (minimum 7 days). Test lifecycle rules in a staging environment before applying to production. Get sign-off on retention policies from legal before configuring deletion.

Stream Analytics job falls behind due to watermark on empty partition

A Stream Analytics job processes events from multiple Event Hubs partitions. One partition goes silent for 2 hours (no events from a device category). The watermark for that partition does not advance. All windowed aggregates hold partial data and do not output. When the silent partition resumes, late events from 2 hours ago arrive and are dropped by default.

Impact: Stream Analytics does not emit windowed aggregate output until all partitions in the window have advanced their watermarks. A silent partition blocks the entire job’s output, not just its own slice. Real-time dashboards fed by these aggregates show nothing for the 2-hour silent period, then resume with a gap and no late-arriving events (which get dropped by default). The watermark delay metric climbs to match the silent period duration, which fires the Azure Monitor alert if one is configured.

Root cause: Stream Analytics uses event time from the event timestamp, not arrival time. The watermark is the highest observed event timestamp per partition. When a partition stops receiving events, its watermark freezes at the last event’s timestamp. Tumbling windows, hopping windows, and session windows all wait for the watermark to advance before closing. If the silent partition belongs to an Event Hubs consumer group shared with another job, that other job’s processing does not move this job’s watermark.

Warning signs: watermark delay in Azure Monitor climbs steadily during the silent period. InputEventCount for the affected partition drops to zero while other partitions keep receiving events. OutputEventCount for the job drops to zero because no windows are closing. Log Analytics queries for Stream Analytics events with IsEmptyPartition=true show which partition is silent. A mismatch between the partition count in the Event Hubs namespace and the partition count in the Stream Analytics job means one partition picks up all events from a scaled-out producer.

Mitigation: Use MAX(outOfOrderDuration) tolerance settings. Set lateInputHandling to adjust instead of drop. Monitor watermark delay metric in Azure Monitor. When partitions go permanently inactive, update the partition count in the Stream Analytics job.

Azure Services Trade-Offs

Service	When to Use	Key Risk
Event Hubs	High-throughput Kafka-compatible streaming	TU limits; throttling under surprise load
Data Factory	Visual ETL with 90+ connectors	Complex pipelines hard to debug; verbose JSON
Synapse Dedicated	Predictable DW at scale, complex joins	Cost; cold resume after pause
Synapse Serverless	Unpredictable ad-hoc queries	Per-TB scanning cost surprises
Blob Storage	Data lake with hierarchical namespace	Lifecycle rules irreversible without soft delete
Databricks	ML workloads, advanced Spark	Cost; requires specific expertise
Stream Analytics	Simple real-time SQL on streams	Limited windowing flexibility vs Dataflow

Azure vs AWS vs GCP Data Services

Capability	Azure	AWS	GCP
Streaming	Event Hubs	Kinesis Data Streams	Pub/Sub
ETL/Orchestration	Data Factory	Glue	Dataflow (templates)
Data Warehouse	Synapse	Redshift	BigQuery
Object Storage	Blob Storage	S3	Cloud Storage
Stream Processing	Stream Analytics	Kinesis Data Analytics	Dataflow
Spark Analytics	Databricks	EMR	Dataproc

Azure Services Capacity Estimation

Event Hubs Throughput Units

def estimate_event_hubs_tu(
    messages_per_second: int,
    avg_message_size_kb: float
) -> dict:
    """
    Estimate Event Hubs throughput units (TUs).
    1 TU = 1MB/s ingress, 2MB/s egress.
    """
    ingress_mbps = (messages_per_second * avg_message_size_kb) / 1024
    tus_needed = max(1, ingress_mbps)  # 1MB/s per TU

    return {
        'messages_per_second': messages_per_second,
        'ingress_mbps': round(ingress_mbps, 2),
        'tus_recommended': int(tus_needed * 1.5),  # 50% headroom
        'autoscale_max': int(tus_needed * 3),  # Burst to 3x
        'estimated_monthly_cost': int(tus_needed * 1.5) * 11,  # ~$11/TU-hour
        'note': 'Use autoscale to handle traffic spikes'
    }

# Example:
result = estimate_event_hubs_tu(5000, 2)
# ~10 MB/s ingress, 10 TUs recommended, ~$165/month

Synapse DWU Planning

def estimate_synapse_dwu(
    data_size_tb: float,
    concurrent_queries: int = 5,
    warehouse_type: str = 'dedicated'
) -> dict:
    """
    Estimate Synapse DWU requirements.
    DWU = Data Warehouse Unit. 100 DWU = dsv2 node.
    """
    if warehouse_type == 'serverless':
        return {
            'type': 'serverless',
            'cost_model': 'per TB scanned',
            'estimated_cost_per_tb': 5,  # ~$5/TB scanned
            'note': 'Pay per query; no upfront capacity planning needed'
        }

    # Dedicated: 100 DWU = ~$3/hr
    dwu_needed = max(100, concurrent_queries * 100)  # ~100 DWU per concurrent query
    monthly_cost = dwu_needed * 3 * 730 / 100  # ~$21.90 per 100 DWU per month

    return {
        'data_size_tb': data_size_tb,
        'concurrent_queries': concurrent_queries,
        'dwu_recommended': dwu_needed,
        'estimated_monthly_cost': round(monthly_cost, 0),
        'note': 'Pause when idle to save costs (auto-pause recommended)'
    }

# Example:
result = estimate_synapse_dwu(10, 8, 'dedicated')
# ~800 DWU, ~$175/month

Azure Services Observability Hooks

Service	Metric	Alert Threshold
Event Hubs	`throttled_requests`	> 0 for 5m
Event Hubs	`incoming_messages`	sudden drop to 0
Data Factory	`ActivityRunsPending`	> 100 runs queued
Data Factory	`PipelineRunDuration`	> expected duration × 2
Synapse	`SQLPoolUsage`	> 90% for 15m
Synapse	`WarehouseLiveUsage`	> available DWU
Blob Storage	`BlobCount`	unexpected growth > 20%
Stream Analytics	`watermarkDelay`	> 30s
Stream Analytics	`InputEventCount`	sudden drop to 0

Azure Services Security Checklist

Azure Services Anti-Patterns

Using Data Factory for complex transformations. Data Factory is an orchestrator, not a transformation engine. Complex multi-step logic in Data Flow scripts is hard to debug, version, and test. Use Data Factory to orchestrate; push transformations to Spark (Databricks, Synapse Spark) or SQL.

Running Synapse dedicated pools 24/7. Dedicated SQL pools can be paused when not in use. Running a DWU-1000 pool continuously costs ~$1,600/month. Pausing it outside business hours cuts that to ~$500/month.

Not using Event Hubs Capture. Without Capture configured, raw events are lost after retention period. Capture to Blob Storage provides a free, durable backup that also enables replay and batch processing.

Over-provisioning Synapse DWUs. Starting at high DWU levels for small datasets is wasteful. Scale down and measure actual utilization before scaling up.

Ignoring Stream Analytics ordering guarantees. Stream Analytics provides at-least-once processing. Designing pipelines that assume exactly-once leads to duplicate records in sinks. Always design for idempotent writes.

Quick Recap

Event Hubs provides Kafka-compatible high-throughput streaming; monitor throttled requests and set autoscale rules
Data Factory orchestrates pipelines with 90+ connectors; break long pipelines into smaller units with checkpointing
Synapse Analytics offers dedicated pools for predictable heavy workloads and serverless for unpredictable ad-hoc queries
Blob Storage with hierarchical namespace is the foundation; always enable soft delete before configuring lifecycle deletion
Monitor Event Hubs TU utilization, Data Factory queue depth, Synapse DWU usage, and Stream Analytics watermark delay
Use Azure AD for all authentication; Key Vault for all secrets; never hardcode credentials

For more on data pipeline patterns, see our Event-Driven Architecture and Pub-Sub Patterns guides.

Key Takeaways:

Event Hubs for high-throughput Kafka-compatible event streaming
Data Factory for visual pipeline orchestration with 90+ connectors
Synapse for unified analytics (SQL warehouse + Spark + streaming)
Blob Storage with hierarchical namespace for modern data lake

Architecture Decision Guide:

Real-time ingestion: Event Hubs with Kafka protocol
ETL/orchestration: Data Factory with Mapping Data Flows
Enterprise DW: Synapse dedicated pools for heavy workloads
Ad-hoc analytics: Synapse serverless on Data Lake
Real-time analytics: Stream Analytics with SQL interface

Integration Benefits:

Native Power BI integration for reporting
Dynamics 365 connectors for CRM data
Azure Active Directory for authentication
Microsoft 365 data integration