Azure Data Services: Data Factory, Synapse, and Event Hubs

Build data pipelines on Azure with Data Factory, Synapse Analytics, and Event Hubs. Learn integration patterns, streaming setup, and data architecture.

published: reading time: 15 min read

Azure Data Services: Data Factory, Synapse, Event Hubs, and Blob Storage

Azure has a comprehensive data platform that integrates well with the Microsoft ecosystem. If your organization already uses Microsoft products, Azure data services connect naturally with Power BI, Excel, and the rest of the stack.

This guide covers the key Azure data services and shows how to build modern data architectures on Azure.

Service Overview

ServiceTypeUse Case
Data FactoryOrchestrationPipeline orchestration, ETL
Synapse AnalyticsData WarehouseEnterprise analytics, BI
Event HubsStreamingHigh-throughput event ingestion
Blob StorageObject StorageData lake, unstructured storage
DatabricksSpark AnalyticsML, advanced analytics
Stream AnalyticsStream ProcessingReal-time analytics on streams

Azure Event Hubs

Event Hubs handles durable, ordered event streaming. It also speaks the Kafka protocol, which is useful if you have Kafka tooling already.

Event Hubs Namespace and Hubs

from azure.eventhub import EventHubProducerClient, EventHubConsumerClient
import json

# Create producer
producer = EventHubProducerClient.from_connection_string(
    conn_str='Endpoint=sb://namespace.servicebus.windows.net/;SharedAccessKeyName=...',
    eventhub_name='click-events'
)

# Send events
def send_events(events):
    with producer:
        event_batch = producer.create_batch()
        for event in events:
            event_batch.add(EventData(json.dumps(event)))
        producer.send_batch(event_batch)

# Consumer
consumer = EventHubConsumerClient.from_connection_string(
    conn_str='Endpoint=sb://namespace.servicebus.windows.net/;SharedAccessKeyName=...',
    consumer_group='$Default',
    eventhub_name='click-events'
)

def on_event(partition_context, event):
    data = json.loads(event.body_as_str())
    process_event(data)
    partition_context.update_checkpoint(event)

with consumer:
    consumer.receive(on_event=on_event, starting_position="-1")

Event Hubs Kafka Compatibility

from kafka import KafkaProducer, KafkaConsumer

# Use Kafka protocol to interact with Event Hubs
producer = KafkaProducer(
    bootstrap_servers='namespace.servicebus.windows.net:9093',
    security_protocol='SASL_SSL',
    sasl_mechanism='PLAIN',
    sasl_plain_username='$ConnectionString',
    sasl_plain_password='Endpoint=sb://namespace.servicebus.windows.net/;SharedAccessKeyName=...',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

producer.send('click-events', {'user_id': '123', 'action': 'page_view'})

Azure Data Factory

Data Factory provides visual pipeline orchestration with 90+ connectors.

Data Factory Pipeline

{
  "name": "ingest_and_transform",
  "properties": {
    "activities": [
      {
        "name": "CopyFromBlobToADLS",
        "type": "Copy",
        "inputs": [
          {
            "referenceName": "BlobSourceDataset",
            "type": "DatasetReference"
          }
        ],
        "outputs": [
          {
            "referenceName": "ADLSSinkDataset",
            "type": "DatasetReference"
          }
        ],
        "typeProperties": {
          "source": { "type": "BlobSource" },
          "sink": { "type": "DelimitedTextSink" }
        }
      },
      {
        "name": "RunDatabricksNotebook",
        "type": "DatabricksNotebook",
        "linkedServiceName": {
          "referenceName": "DatabricksLinkedService",
          "type": "LinkedServiceReference"
        },
        "parameters": {
          "sourcePath": "events.parquet",
          "outputPath": "processed/events"
        }
      }
    ]
  }
}

Data Flow (Azure Data Flow)

# Data Flow transformation (equivalent in ADF visual)
# Source: Blob Storage
# Derive: Add computed columns
# Select: Choose output columns
# Sink: Azure Synapse

# Mapping Data Flow script
dataflow = """
source(
    allowSchemaDrift: true,
    validateSchema: false,
    ignoreNoFilesFound: true
)
~> BlobSource

BlobSource derive(
    event_date = toDate(timestamp),
    event_hour = hour(timestamp),
    user_segment = iif(total_purchases > 10, 'high_value', 'regular')
)
~> DerivedColumns

DerivedColumns select(
    allowSchemaDrift: true,
    validateSchema: false,
    skipDuplicateMapInputs: true,
    skipDuplicateMapOutputs: true
)
(
    output(
        event_id,
        user_id,
        event_type,
        timestamp,
        event_date,
        event_hour,
        user_segment
    )
)
~> SelectColumns

SelectColumns sink(
    type: 'synapse',
    LinkedServiceName: 'SynapseWorkspace',
    Table: 'dbo.events'
)
"""

Triggering Pipelines

from azure.mgmt.datafactory import DataFactoryManagementClient
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
client = DataFactoryManagementClient(credential, 'subscription-id')

# Trigger pipeline run
run_response = client.pipelines.create_run(
    resource_group_name='data-rg',
    factory_name='datafactory',
    pipeline_name='ingest_and_transform',
    parameters={
        'inputPath': 'raw/events/2026/03/27',
        'outputPath': 'processed/events/2026/03/27'
    }
)

# Wait for completion
while True:
    run = client.pipeline_runs.get(
        resource_group_name='data-rg',
        factory_name='datafactory',
        run_id=run_response.run_id
    )
    if run.status in ['Succeeded', 'Failed', 'Cancelled']:
        break
    time.sleep(30)

Azure Synapse Analytics

Synapse combines data warehousing and big data analytics in one platform.

Dedicated SQL Pool

import pyodbc

# Connect to Synapse
server = 'synapse-workspace.sql.azuresynapse.net'
database = 'analytics'
conn = pyodbc.connect(
    f'DRIVER={{ODBC Driver 17 for SQL Server}};'
    f'SERVER={server};'
    f'DATABASE={database};'
    f'UID={username};'
    f'PWD={password}'
)

# Create external table
cursor = conn.cursor()
cursor.execute("""
    CREATE EXTERNAL TABLE dbo.user_events
    (
        event_id BIGINT,
        user_id VARCHAR(50),
        event_type VARCHAR(50),
        timestamp DATETIME2,
        properties VARCHAR(MAX)
    )
    WITH (
        LOCATION = 'processed/events/',
        DATA_SOURCE = 'datalake',
        FILE_FORMAT = 'parquet_format'
    )
""")

Serverless SQL Pool

# Query data directly in storage without moving it
query = """
SELECT
    user_id,
    event_type,
    COUNT(*) as event_count,
    MIN(timestamp) as first_seen,
    MAX(timestamp) as last_seen
FROM OPENROWSET(
    BULK 'processed/events/year=2026/month=03/*.parquet',
    DATA_SOURCE = 'datalake',
    FORMAT = 'PARQUET'
) AS events
GROUP BY user_id, event_type
ORDER BY event_count DESC
"""

df = pd.read_sql(query, conn)

Spark Integration

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SynapseSpark") \
    .config("spark.datasource.adl", "datalake") \
    .config("spark.datasource.adl.account", "storageaccount.dfs.core.windows.net") \
    .getOrCreate()

# Read from storage
df = spark.read \
    .parquet("abfs://datalake@storageaccount.dfs.core.windows.net/processed/events/")

# Process and write to Synapse
df.write \
    .option("spark.synapse.database", "analytics") \
    .option("spark.synapse.table", "events") \
    .mode("append") \
    .saveAsTable("events")

Blob Storage and Data Lake

Azure Data Lake Storage (ADLS) provides tiered blob storage with hierarchical namespace.

ADLS Setup

from azure.storage.filedatalake import DataLakeServiceClient

# Create Data Lake service client
service_client = DataLakeServiceClient.from_connection_string(
    conn_str='DefaultEndpointsProtocol=https;AccountName=storageaccount;AccountKey=...'
)

# Create filesystem (container)
filesystem_client = service_client.create_file_system('datalake')

# Create directory
directory_client = filesystem_client.create_directory('raw/events')

# Upload file
file_client = directory_client.create_file('2026/03/27/events.parquet')
with open('events.parquet', 'rb') as f:
    file_client.append_data(f, 0)
    file_client.flush_data(len(f.read()))

Data Lake Access Patterns

Access PatternStorage TierUse Case
HotBlob HotActive processing
CoolBlob CoolRecent data, infrequent access
ArchiveBlob ArchiveHistorical, rarely accessed
PremiumBlock BlobHigh IOPS requirements

Lifecycle Management

from azure.storage.blob import BlobServiceClient

client = BlobServiceClient.from_connection_string('conn_str')

# Set lifecycle policy
client.set_container_client('datalake').set_container_access_policy(
    signed_identifiers={
        'raw-policy': {
            'access_policy': {
                'starts_on': '2026-01-01',
                'expires_on': '2027-01-01',
                'permissions': 'r'
            }
        }
    }
)

# Configure tiering rules
account_client = client.account_name()

Azure Stream Analytics

Real-time stream processing with SQL-like queries.

Stream Analytics Job

-- Input: Event Hubs
-- Output: Blob Storage, SQL, Power BI

SELECT
    System.Timestamp() AS event_time,
    user_id,
    event_type,
    properties.device_type,
    properties.browser,
    CASE
        WHEN event_type = 'purchase' THEN CAST(properties.amount AS FLOAT)
        ELSE 0
    END AS purchase_amount
INTO
    [output-blob]
FROM
    [event-hubs-input]
WHERE
    event_type IN ('page_view', 'purchase', 'signup')
    AND timestamp IS NOT NULL

SELECT
    TumblingWindow(minute, 5) AS window,
    COUNT(*) AS event_count,
    COUNT(DISTINCT user_id) AS unique_users,
    SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) AS purchases
INTO
    [real-time-dashboard]
FROM
    [event-hubs-input]
GROUP BY TumblingWindow(minute, 5), event_type

Machine Learning in Stream Analytics

-- Use ML model for real-time scoring
SELECT
    event.*,
    MLmodel1 Predict(event.features) AS prediction
INTO
    [scored-output]
FROM
    [event-hubs-input] AS event
CROSS APPLY MLmodel1(event.features) AS T1

Architecture Patterns

Modern Data Warehouse on Azure

flowchart TD
    A[Apps] -->|Event Hubs| B[Stream Analytics]
    A -->|ADF| C[Blob Storage Raw]
    B --> D[Synapse]
    C -->|ADF| D
    D --> E[Power BI]
    D --> F[Databricks ML]

Lambda Architecture on Azure

# Speed Layer - Stream Analytics
speed_query = """
SELECT
    System.Timestamp() AS window_end,
    count(*) as event_count,
    avg(response_time) as avg_response_time
INTO
    [speed-output]
FROM
    [events-input]
GROUP BY TumblingWindow(second, 5)
"""

# Batch Layer - Data Factory + Synapse
batch_pipeline = {
    "activities": [
        {
            "name": "IngestToRaw",
            "type": "Copy",
            "source": {"type": "BlobSource"},
            "sink": {"type": "AzureSynapseSink"}
        },
        {
            "name": "AggregateBatch",
            "type": "SqlServerStoredProcedure",
            "linkedServiceName": {"type": "AzureSqlDatabase"},
            "storedProcedureName": "sp_aggregate_daily"
        }
    ]
}

Integration with Microsoft Ecosystem

Power BI Integration

# Direct query from Synapse to Power BI
# In Power BI Desktop:
# Get Data > Azure > Azure Synapse Analytics
# Enter server: synapse-workspace.sql.azuresynapse.net
# Select database and tables

# Or use Power BI XMLA endpoint
# Server: synapse-workspace.sql.azuresynapse.net:1433

Dynamics 365 Integration

# Connect to Dynamics 365 data
datafactory = DataFactoryManagementClient(
    credential,
    'subscription-id'
)

# Linked service for Dynamics 365
linked_service = {
    "name": "dynamics365",
    "properties": {
        "type": "Dynamics365",
        "typeProperties": {
            "deploymentType": "Online",
            "organizationServiceUrl": "https://org.api.crm.dynamics.com",
            "authenticationType": "Office365",
            "username": "user@company.com",
            "password": "password"
        }
    }
}

Cost Optimization

Key Azure-specific cost optimization:

  • Use Synapse serverless for unpredictable workloads (pay per TB scanned)
  • Switch to dedicated SQL pools for consistent, heavy workloads
  • Enable Auto-pause on Synapse serverless when not in use
  • Use Event Hubs capture to batch process instead of real-time when possible
  • Implement Blob Storage lifecycle policies to auto-tier old data
  • Consider Reserved Capacity for predictable workloads
  • Use Data Factory integration runtimes efficiently (Azure IR vs self-hosted)

Azure Services Production Failure Scenarios

Event Hubs throttle from traffic spike

Event Hubs has throughput units (TUs) with a hard limit of 1MB/s ingress and 2MB/s egress per TU. A product launch doubles traffic. TUs are exceeded. Incoming events are throttled with QuotaExceededException. Producers receive 503 errors. Data is buffered client-side until capacity frees up, creating a backlog that takes hours to drain.

Mitigation: Monitor incoming_messages, throttled_requests, and server_busy_errors in Azure Monitor. Set autoscale rules on Event Hubs namespaces to add TUs based on traffic. Always implement retry logic with exponential backoff in producers.

Data Factory pipeline runs for days and fails at the last step

A complex Data Factory pipeline with 40 activities runs for 48 hours. The final Databricks notebook step fails due to a workspace configuration change made 3 days earlier. The entire pipeline must be re-run from the beginning. No intermediate checkpoints exist. Business reports are delayed 2 days.

Mitigation: Use Data Factory’s native checkpointing features. Break long pipelines into smaller pipelines with trigger-based chaining. Store intermediate outputs to Blob Storage. Use validation activities before critical downstream steps.

Synapse DWU exhaustion during month-end batch window

Month-end batch jobs run in parallel with BI dashboards, consuming all Data Warehouse Units (DWUs). Analytical queries start timing out. Engineers do not discover the issue until batch window completes and DWU utilization drops. Meanwhile, data freshness SLA is missed.

Mitigation: Use Synapse workload importance to prioritize critical queries over batch jobs. Classify workloads with workload groups. Monitor SQLPoolUsage metrics. Consider dedicated capacity for batch processing separate from interactive workloads.

Blob Storage lifecycle rule deletes data prematurely

A lifecycle rule is configured to delete raw data after 30 days. The rule runs on schedule and deletes files from the raw/events/ prefix. However, a compliance requirement is discovered 3 months later that mandates 7-year retention. The deleted data cannot be recovered. Audit findings and regulatory penalties follow.

Mitigation: Always use a cool or cold tier before deletion rules. Enable Blob soft delete (minimum 7 days). Test lifecycle rules in a staging environment before applying to production. Get sign-off on retention policies from legal before configuring deletion.

Stream Analytics job falls behind due to watermark on empty partition

A Stream Analytics job processes events from multiple Event Hubs partitions. One partition goes silent for 2 hours (no events from a device category). The watermark for that partition does not advance. All windowed aggregates hold partial data and do not output. When the silent partition resumes, late events from 2 hours ago arrive and are dropped by default.

Mitigation: Use MAX(outOfOrderDuration) tolerance settings. Set lateInputHandling to adjust instead of drop. Monitor watermark delay metric in Azure Monitor. When partitions go permanently inactive, update the partition count in the Stream Analytics job.

Azure Services Trade-Offs

ServiceWhen to UseKey Risk
Event HubsHigh-throughput Kafka-compatible streamingTU limits; throttling under surprise load
Data FactoryVisual ETL with 90+ connectorsComplex pipelines hard to debug; verbose JSON
Synapse DedicatedPredictable DW at scale, complex joinsCost; cold resume after pause
Synapse ServerlessUnpredictable ad-hoc queriesPer-TB scanning cost surprises
Blob StorageData lake with hierarchical namespaceLifecycle rules irreversible without soft delete
DatabricksML workloads, advanced SparkCost; requires specific expertise
Stream AnalyticsSimple real-time SQL on streamsLimited windowing flexibility vs Dataflow

Azure vs AWS vs GCP Data Services

CapabilityAzureAWSGCP
StreamingEvent HubsKinesis Data StreamsPub/Sub
ETL/OrchestrationData FactoryGlueDataflow (templates)
Data WarehouseSynapseRedshiftBigQuery
Object StorageBlob StorageS3Cloud Storage
Stream ProcessingStream AnalyticsKinesis Data AnalyticsDataflow
Spark AnalyticsDatabricksEMRDataproc

Azure Services Capacity Estimation

Event Hubs Throughput Units

def estimate_event_hubs_tu(
    messages_per_second: int,
    avg_message_size_kb: float
) -> dict:
    """
    Estimate Event Hubs throughput units (TUs).
    1 TU = 1MB/s ingress, 2MB/s egress.
    """
    ingress_mbps = (messages_per_second * avg_message_size_kb) / 1024
    tus_needed = max(1, ingress_mbps)  # 1MB/s per TU

    return {
        'messages_per_second': messages_per_second,
        'ingress_mbps': round(ingress_mbps, 2),
        'tus_recommended': int(tus_needed * 1.5),  # 50% headroom
        'autoscale_max': int(tus_needed * 3),  # Burst to 3x
        'estimated_monthly_cost': int(tus_needed * 1.5) * 11,  # ~$11/TU-hour
        'note': 'Use autoscale to handle traffic spikes'
    }

# Example:
result = estimate_event_hubs_tu(5000, 2)
# ~10 MB/s ingress, 10 TUs recommended, ~$165/month

Synapse DWU Planning

def estimate_synapse_dwu(
    data_size_tb: float,
    concurrent_queries: int = 5,
    warehouse_type: str = 'dedicated'
) -> dict:
    """
    Estimate Synapse DWU requirements.
    DWU = Data Warehouse Unit. 100 DWU = dsv2 node.
    """
    if warehouse_type == 'serverless':
        return {
            'type': 'serverless',
            'cost_model': 'per TB scanned',
            'estimated_cost_per_tb': 5,  # ~$5/TB scanned
            'note': 'Pay per query; no upfront capacity planning needed'
        }

    # Dedicated: 100 DWU = ~$3/hr
    dwu_needed = max(100, concurrent_queries * 100)  # ~100 DWU per concurrent query
    monthly_cost = dwu_needed * 3 * 730 / 100  # ~$21.90 per 100 DWU per month

    return {
        'data_size_tb': data_size_tb,
        'concurrent_queries': concurrent_queries,
        'dwu_recommended': dwu_needed,
        'estimated_monthly_cost': round(monthly_cost, 0),
        'note': 'Pause when idle to save costs (auto-pause recommended)'
    }

# Example:
result = estimate_synapse_dwu(10, 8, 'dedicated')
# ~800 DWU, ~$175/month

Azure Services Observability Hooks

ServiceMetricAlert Threshold
Event Hubsthrottled_requests> 0 for 5m
Event Hubsincoming_messagessudden drop to 0
Data FactoryActivityRunsPending> 100 runs queued
Data FactoryPipelineRunDuration> expected duration × 2
SynapseSQLPoolUsage> 90% for 15m
SynapseWarehouseLiveUsage> available DWU
Blob StorageBlobCountunexpected growth > 20%
Stream AnalyticswatermarkDelay> 30s
Stream AnalyticsInputEventCountsudden drop to 0

Azure Services Security Checklist

  • Azure AD (Entra ID) used for authentication — no local credentials
  • Event Hubs uses Microsoft Entra ID for authorization, not SAS policies
  • Blob Storage containers with hierarchical namespace block public access; use RBAC
  • Synapse workspace uses managed private endpoints for Azure Storage
  • Data Factory managed Virtual Network enabled for ADF integration runtime
  • Azure Key Vault used for secrets (connection strings, keys); not in code or ADF linked services
  • Azure Defender for Storage enabled on production storage accounts
  • Azure Monitor diagnostic settings route to Log Analytics with 90-day retention
  • Synapse audit logs enabled and exported to Azure Sentinel
  • No firewall rules set to “Allow all” on Event Hubs or Synapse endpoints

Azure Services Anti-Patterns

Using Data Factory for complex transformations. Data Factory is an orchestrator, not a transformation engine. Complex multi-step logic in Data Flow scripts is hard to debug, version, and test. Use Data Factory to orchestrate; push transformations to Spark (Databricks, Synapse Spark) or SQL.

Running Synapse dedicated pools 24/7. Dedicated SQL pools can be paused when not in use. Running a DWU-1000 pool continuously costs ~$1,600/month. Pausing it outside business hours cuts that to ~$500/month.

Not using Event Hubs Capture. Without Capture configured, raw events are lost after retention period. Capture to Blob Storage provides a free, durable backup that also enables replay and batch processing.

Over-provisioning Synapse DWUs. Starting at high DWU levels for small datasets is wasteful. Scale down and measure actual utilization before scaling up.

Ignoring Stream Analytics ordering guarantees. Stream Analytics provides at-least-once processing. Designing pipelines that assume exactly-once leads to duplicate records in sinks. Always design for idempotent writes.

Quick Recap

  • Event Hubs provides Kafka-compatible high-throughput streaming; monitor throttled requests and set autoscale rules
  • Data Factory orchestrates pipelines with 90+ connectors; break long pipelines into smaller units with checkpointing
  • Synapse Analytics offers dedicated pools for predictable heavy workloads and serverless for unpredictable ad-hoc queries
  • Blob Storage with hierarchical namespace is the foundation; always enable soft delete before configuring lifecycle deletion
  • Monitor Event Hubs TU utilization, Data Factory queue depth, Synapse DWU usage, and Stream Analytics watermark delay
  • Use Azure AD for all authentication; Key Vault for all secrets; never hardcode credentials

For more on data pipeline patterns, see our Event-Driven Architecture and Pub-Sub Patterns guides.

Key Takeaways:

  • Event Hubs for high-throughput Kafka-compatible event streaming
  • Data Factory for visual pipeline orchestration with 90+ connectors
  • Synapse for unified analytics (SQL warehouse + Spark + streaming)
  • Blob Storage with hierarchical namespace for modern data lake

Architecture Decision Guide:

  • Real-time ingestion: Event Hubs with Kafka protocol
  • ETL/orchestration: Data Factory with Mapping Data Flows
  • Enterprise DW: Synapse dedicated pools for heavy workloads
  • Ad-hoc analytics: Synapse serverless on Data Lake
  • Real-time analytics: Stream Analytics with SQL interface

Integration Benefits:

  • Native Power BI integration for reporting
  • Dynamics 365 connectors for CRM data
  • Azure Active Directory for authentication
  • Microsoft 365 data integration

For more on data pipeline patterns, see our Event-Driven Architecture and Pub-Sub Patterns guides.

Category

Related Posts

AWS Data Services: Kinesis, Glue, Redshift, and S3

Guide to AWS data services for building data pipelines. Compare Kinesis vs Kafka, use Glue for ETL, query with Athena, and design S3 data lakes.

#data-engineering #aws #kinesis

Data Migration: Strategies and Patterns for Moving Data

Learn proven strategies for migrating data between systems with minimal downtime. Covers bulk migration, CDC patterns, validation, and rollback.

#data-engineering #data-migration #cdc

GCP Data Services: Dataflow, BigQuery, and Pub/Sub

Guide to Google Cloud data services for building pipelines. Compare Dataflow vs Kafka, leverage BigQuery for analytics, use Pub/Sub, and design data lakes.

#data-engineering #gcp #bigquery