Azure Data Services: Data Factory, Synapse, and Event Hubs
Build data pipelines on Azure with Data Factory, Synapse Analytics, and Event Hubs. Learn integration patterns, streaming setup, and data architecture.
Azure Data Services: Data Factory, Synapse, Event Hubs, and Blob Storage
Azure has a comprehensive data platform that integrates well with the Microsoft ecosystem. If your organization already uses Microsoft products, Azure data services connect naturally with Power BI, Excel, and the rest of the stack.
This guide covers the key Azure data services and shows how to build modern data architectures on Azure.
Service Overview
| Service | Type | Use Case |
|---|---|---|
| Data Factory | Orchestration | Pipeline orchestration, ETL |
| Synapse Analytics | Data Warehouse | Enterprise analytics, BI |
| Event Hubs | Streaming | High-throughput event ingestion |
| Blob Storage | Object Storage | Data lake, unstructured storage |
| Databricks | Spark Analytics | ML, advanced analytics |
| Stream Analytics | Stream Processing | Real-time analytics on streams |
Azure Event Hubs
Event Hubs handles durable, ordered event streaming. It also speaks the Kafka protocol, which is useful if you have Kafka tooling already.
Event Hubs Namespace and Hubs
from azure.eventhub import EventHubProducerClient, EventHubConsumerClient
import json
# Create producer
producer = EventHubProducerClient.from_connection_string(
conn_str='Endpoint=sb://namespace.servicebus.windows.net/;SharedAccessKeyName=...',
eventhub_name='click-events'
)
# Send events
def send_events(events):
with producer:
event_batch = producer.create_batch()
for event in events:
event_batch.add(EventData(json.dumps(event)))
producer.send_batch(event_batch)
# Consumer
consumer = EventHubConsumerClient.from_connection_string(
conn_str='Endpoint=sb://namespace.servicebus.windows.net/;SharedAccessKeyName=...',
consumer_group='$Default',
eventhub_name='click-events'
)
def on_event(partition_context, event):
data = json.loads(event.body_as_str())
process_event(data)
partition_context.update_checkpoint(event)
with consumer:
consumer.receive(on_event=on_event, starting_position="-1")
Event Hubs Kafka Compatibility
from kafka import KafkaProducer, KafkaConsumer
# Use Kafka protocol to interact with Event Hubs
producer = KafkaProducer(
bootstrap_servers='namespace.servicebus.windows.net:9093',
security_protocol='SASL_SSL',
sasl_mechanism='PLAIN',
sasl_plain_username='$ConnectionString',
sasl_plain_password='Endpoint=sb://namespace.servicebus.windows.net/;SharedAccessKeyName=...',
value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
producer.send('click-events', {'user_id': '123', 'action': 'page_view'})
Azure Data Factory
Data Factory provides visual pipeline orchestration with 90+ connectors.
Data Factory Pipeline
{
"name": "ingest_and_transform",
"properties": {
"activities": [
{
"name": "CopyFromBlobToADLS",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobSourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "ADLSSinkDataset",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": { "type": "BlobSource" },
"sink": { "type": "DelimitedTextSink" }
}
},
{
"name": "RunDatabricksNotebook",
"type": "DatabricksNotebook",
"linkedServiceName": {
"referenceName": "DatabricksLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"sourcePath": "events.parquet",
"outputPath": "processed/events"
}
}
]
}
}
Data Flow (Azure Data Flow)
# Data Flow transformation (equivalent in ADF visual)
# Source: Blob Storage
# Derive: Add computed columns
# Select: Choose output columns
# Sink: Azure Synapse
# Mapping Data Flow script
dataflow = """
source(
allowSchemaDrift: true,
validateSchema: false,
ignoreNoFilesFound: true
)
~> BlobSource
BlobSource derive(
event_date = toDate(timestamp),
event_hour = hour(timestamp),
user_segment = iif(total_purchases > 10, 'high_value', 'regular')
)
~> DerivedColumns
DerivedColumns select(
allowSchemaDrift: true,
validateSchema: false,
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true
)
(
output(
event_id,
user_id,
event_type,
timestamp,
event_date,
event_hour,
user_segment
)
)
~> SelectColumns
SelectColumns sink(
type: 'synapse',
LinkedServiceName: 'SynapseWorkspace',
Table: 'dbo.events'
)
"""
Triggering Pipelines
from azure.mgmt.datafactory import DataFactoryManagementClient
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
client = DataFactoryManagementClient(credential, 'subscription-id')
# Trigger pipeline run
run_response = client.pipelines.create_run(
resource_group_name='data-rg',
factory_name='datafactory',
pipeline_name='ingest_and_transform',
parameters={
'inputPath': 'raw/events/2026/03/27',
'outputPath': 'processed/events/2026/03/27'
}
)
# Wait for completion
while True:
run = client.pipeline_runs.get(
resource_group_name='data-rg',
factory_name='datafactory',
run_id=run_response.run_id
)
if run.status in ['Succeeded', 'Failed', 'Cancelled']:
break
time.sleep(30)
Azure Synapse Analytics
Synapse combines data warehousing and big data analytics in one platform.
Dedicated SQL Pool
import pyodbc
# Connect to Synapse
server = 'synapse-workspace.sql.azuresynapse.net'
database = 'analytics'
conn = pyodbc.connect(
f'DRIVER={{ODBC Driver 17 for SQL Server}};'
f'SERVER={server};'
f'DATABASE={database};'
f'UID={username};'
f'PWD={password}'
)
# Create external table
cursor = conn.cursor()
cursor.execute("""
CREATE EXTERNAL TABLE dbo.user_events
(
event_id BIGINT,
user_id VARCHAR(50),
event_type VARCHAR(50),
timestamp DATETIME2,
properties VARCHAR(MAX)
)
WITH (
LOCATION = 'processed/events/',
DATA_SOURCE = 'datalake',
FILE_FORMAT = 'parquet_format'
)
""")
Serverless SQL Pool
# Query data directly in storage without moving it
query = """
SELECT
user_id,
event_type,
COUNT(*) as event_count,
MIN(timestamp) as first_seen,
MAX(timestamp) as last_seen
FROM OPENROWSET(
BULK 'processed/events/year=2026/month=03/*.parquet',
DATA_SOURCE = 'datalake',
FORMAT = 'PARQUET'
) AS events
GROUP BY user_id, event_type
ORDER BY event_count DESC
"""
df = pd.read_sql(query, conn)
Spark Integration
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("SynapseSpark") \
.config("spark.datasource.adl", "datalake") \
.config("spark.datasource.adl.account", "storageaccount.dfs.core.windows.net") \
.getOrCreate()
# Read from storage
df = spark.read \
.parquet("abfs://datalake@storageaccount.dfs.core.windows.net/processed/events/")
# Process and write to Synapse
df.write \
.option("spark.synapse.database", "analytics") \
.option("spark.synapse.table", "events") \
.mode("append") \
.saveAsTable("events")
Blob Storage and Data Lake
Azure Data Lake Storage (ADLS) provides tiered blob storage with hierarchical namespace.
ADLS Setup
from azure.storage.filedatalake import DataLakeServiceClient
# Create Data Lake service client
service_client = DataLakeServiceClient.from_connection_string(
conn_str='DefaultEndpointsProtocol=https;AccountName=storageaccount;AccountKey=...'
)
# Create filesystem (container)
filesystem_client = service_client.create_file_system('datalake')
# Create directory
directory_client = filesystem_client.create_directory('raw/events')
# Upload file
file_client = directory_client.create_file('2026/03/27/events.parquet')
with open('events.parquet', 'rb') as f:
file_client.append_data(f, 0)
file_client.flush_data(len(f.read()))
Data Lake Access Patterns
| Access Pattern | Storage Tier | Use Case |
|---|---|---|
| Hot | Blob Hot | Active processing |
| Cool | Blob Cool | Recent data, infrequent access |
| Archive | Blob Archive | Historical, rarely accessed |
| Premium | Block Blob | High IOPS requirements |
Lifecycle Management
from azure.storage.blob import BlobServiceClient
client = BlobServiceClient.from_connection_string('conn_str')
# Set lifecycle policy
client.set_container_client('datalake').set_container_access_policy(
signed_identifiers={
'raw-policy': {
'access_policy': {
'starts_on': '2026-01-01',
'expires_on': '2027-01-01',
'permissions': 'r'
}
}
}
)
# Configure tiering rules
account_client = client.account_name()
Azure Stream Analytics
Real-time stream processing with SQL-like queries.
Stream Analytics Job
-- Input: Event Hubs
-- Output: Blob Storage, SQL, Power BI
SELECT
System.Timestamp() AS event_time,
user_id,
event_type,
properties.device_type,
properties.browser,
CASE
WHEN event_type = 'purchase' THEN CAST(properties.amount AS FLOAT)
ELSE 0
END AS purchase_amount
INTO
[output-blob]
FROM
[event-hubs-input]
WHERE
event_type IN ('page_view', 'purchase', 'signup')
AND timestamp IS NOT NULL
SELECT
TumblingWindow(minute, 5) AS window,
COUNT(*) AS event_count,
COUNT(DISTINCT user_id) AS unique_users,
SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) AS purchases
INTO
[real-time-dashboard]
FROM
[event-hubs-input]
GROUP BY TumblingWindow(minute, 5), event_type
Machine Learning in Stream Analytics
-- Use ML model for real-time scoring
SELECT
event.*,
MLmodel1 Predict(event.features) AS prediction
INTO
[scored-output]
FROM
[event-hubs-input] AS event
CROSS APPLY MLmodel1(event.features) AS T1
Architecture Patterns
Modern Data Warehouse on Azure
flowchart TD
A[Apps] -->|Event Hubs| B[Stream Analytics]
A -->|ADF| C[Blob Storage Raw]
B --> D[Synapse]
C -->|ADF| D
D --> E[Power BI]
D --> F[Databricks ML]
Lambda Architecture on Azure
# Speed Layer - Stream Analytics
speed_query = """
SELECT
System.Timestamp() AS window_end,
count(*) as event_count,
avg(response_time) as avg_response_time
INTO
[speed-output]
FROM
[events-input]
GROUP BY TumblingWindow(second, 5)
"""
# Batch Layer - Data Factory + Synapse
batch_pipeline = {
"activities": [
{
"name": "IngestToRaw",
"type": "Copy",
"source": {"type": "BlobSource"},
"sink": {"type": "AzureSynapseSink"}
},
{
"name": "AggregateBatch",
"type": "SqlServerStoredProcedure",
"linkedServiceName": {"type": "AzureSqlDatabase"},
"storedProcedureName": "sp_aggregate_daily"
}
]
}
Integration with Microsoft Ecosystem
Power BI Integration
# Direct query from Synapse to Power BI
# In Power BI Desktop:
# Get Data > Azure > Azure Synapse Analytics
# Enter server: synapse-workspace.sql.azuresynapse.net
# Select database and tables
# Or use Power BI XMLA endpoint
# Server: synapse-workspace.sql.azuresynapse.net:1433
Dynamics 365 Integration
# Connect to Dynamics 365 data
datafactory = DataFactoryManagementClient(
credential,
'subscription-id'
)
# Linked service for Dynamics 365
linked_service = {
"name": "dynamics365",
"properties": {
"type": "Dynamics365",
"typeProperties": {
"deploymentType": "Online",
"organizationServiceUrl": "https://org.api.crm.dynamics.com",
"authenticationType": "Office365",
"username": "user@company.com",
"password": "password"
}
}
}
Cost Optimization
Key Azure-specific cost optimization:
- Use Synapse serverless for unpredictable workloads (pay per TB scanned)
- Switch to dedicated SQL pools for consistent, heavy workloads
- Enable Auto-pause on Synapse serverless when not in use
- Use Event Hubs capture to batch process instead of real-time when possible
- Implement Blob Storage lifecycle policies to auto-tier old data
- Consider Reserved Capacity for predictable workloads
- Use Data Factory integration runtimes efficiently (Azure IR vs self-hosted)
Azure Services Production Failure Scenarios
Event Hubs throttle from traffic spike
Event Hubs has throughput units (TUs) with a hard limit of 1MB/s ingress and 2MB/s egress per TU. A product launch doubles traffic. TUs are exceeded. Incoming events are throttled with QuotaExceededException. Producers receive 503 errors. Data is buffered client-side until capacity frees up, creating a backlog that takes hours to drain.
Mitigation: Monitor incoming_messages, throttled_requests, and server_busy_errors in Azure Monitor. Set autoscale rules on Event Hubs namespaces to add TUs based on traffic. Always implement retry logic with exponential backoff in producers.
Data Factory pipeline runs for days and fails at the last step
A complex Data Factory pipeline with 40 activities runs for 48 hours. The final Databricks notebook step fails due to a workspace configuration change made 3 days earlier. The entire pipeline must be re-run from the beginning. No intermediate checkpoints exist. Business reports are delayed 2 days.
Mitigation: Use Data Factory’s native checkpointing features. Break long pipelines into smaller pipelines with trigger-based chaining. Store intermediate outputs to Blob Storage. Use validation activities before critical downstream steps.
Synapse DWU exhaustion during month-end batch window
Month-end batch jobs run in parallel with BI dashboards, consuming all Data Warehouse Units (DWUs). Analytical queries start timing out. Engineers do not discover the issue until batch window completes and DWU utilization drops. Meanwhile, data freshness SLA is missed.
Mitigation: Use Synapse workload importance to prioritize critical queries over batch jobs. Classify workloads with workload groups. Monitor SQLPoolUsage metrics. Consider dedicated capacity for batch processing separate from interactive workloads.
Blob Storage lifecycle rule deletes data prematurely
A lifecycle rule is configured to delete raw data after 30 days. The rule runs on schedule and deletes files from the raw/events/ prefix. However, a compliance requirement is discovered 3 months later that mandates 7-year retention. The deleted data cannot be recovered. Audit findings and regulatory penalties follow.
Mitigation: Always use a cool or cold tier before deletion rules. Enable Blob soft delete (minimum 7 days). Test lifecycle rules in a staging environment before applying to production. Get sign-off on retention policies from legal before configuring deletion.
Stream Analytics job falls behind due to watermark on empty partition
A Stream Analytics job processes events from multiple Event Hubs partitions. One partition goes silent for 2 hours (no events from a device category). The watermark for that partition does not advance. All windowed aggregates hold partial data and do not output. When the silent partition resumes, late events from 2 hours ago arrive and are dropped by default.
Mitigation: Use MAX(outOfOrderDuration) tolerance settings. Set lateInputHandling to adjust instead of drop. Monitor watermark delay metric in Azure Monitor. When partitions go permanently inactive, update the partition count in the Stream Analytics job.
Azure Services Trade-Offs
| Service | When to Use | Key Risk |
|---|---|---|
| Event Hubs | High-throughput Kafka-compatible streaming | TU limits; throttling under surprise load |
| Data Factory | Visual ETL with 90+ connectors | Complex pipelines hard to debug; verbose JSON |
| Synapse Dedicated | Predictable DW at scale, complex joins | Cost; cold resume after pause |
| Synapse Serverless | Unpredictable ad-hoc queries | Per-TB scanning cost surprises |
| Blob Storage | Data lake with hierarchical namespace | Lifecycle rules irreversible without soft delete |
| Databricks | ML workloads, advanced Spark | Cost; requires specific expertise |
| Stream Analytics | Simple real-time SQL on streams | Limited windowing flexibility vs Dataflow |
Azure vs AWS vs GCP Data Services
| Capability | Azure | AWS | GCP |
|---|---|---|---|
| Streaming | Event Hubs | Kinesis Data Streams | Pub/Sub |
| ETL/Orchestration | Data Factory | Glue | Dataflow (templates) |
| Data Warehouse | Synapse | Redshift | BigQuery |
| Object Storage | Blob Storage | S3 | Cloud Storage |
| Stream Processing | Stream Analytics | Kinesis Data Analytics | Dataflow |
| Spark Analytics | Databricks | EMR | Dataproc |
Azure Services Capacity Estimation
Event Hubs Throughput Units
def estimate_event_hubs_tu(
messages_per_second: int,
avg_message_size_kb: float
) -> dict:
"""
Estimate Event Hubs throughput units (TUs).
1 TU = 1MB/s ingress, 2MB/s egress.
"""
ingress_mbps = (messages_per_second * avg_message_size_kb) / 1024
tus_needed = max(1, ingress_mbps) # 1MB/s per TU
return {
'messages_per_second': messages_per_second,
'ingress_mbps': round(ingress_mbps, 2),
'tus_recommended': int(tus_needed * 1.5), # 50% headroom
'autoscale_max': int(tus_needed * 3), # Burst to 3x
'estimated_monthly_cost': int(tus_needed * 1.5) * 11, # ~$11/TU-hour
'note': 'Use autoscale to handle traffic spikes'
}
# Example:
result = estimate_event_hubs_tu(5000, 2)
# ~10 MB/s ingress, 10 TUs recommended, ~$165/month
Synapse DWU Planning
def estimate_synapse_dwu(
data_size_tb: float,
concurrent_queries: int = 5,
warehouse_type: str = 'dedicated'
) -> dict:
"""
Estimate Synapse DWU requirements.
DWU = Data Warehouse Unit. 100 DWU = dsv2 node.
"""
if warehouse_type == 'serverless':
return {
'type': 'serverless',
'cost_model': 'per TB scanned',
'estimated_cost_per_tb': 5, # ~$5/TB scanned
'note': 'Pay per query; no upfront capacity planning needed'
}
# Dedicated: 100 DWU = ~$3/hr
dwu_needed = max(100, concurrent_queries * 100) # ~100 DWU per concurrent query
monthly_cost = dwu_needed * 3 * 730 / 100 # ~$21.90 per 100 DWU per month
return {
'data_size_tb': data_size_tb,
'concurrent_queries': concurrent_queries,
'dwu_recommended': dwu_needed,
'estimated_monthly_cost': round(monthly_cost, 0),
'note': 'Pause when idle to save costs (auto-pause recommended)'
}
# Example:
result = estimate_synapse_dwu(10, 8, 'dedicated')
# ~800 DWU, ~$175/month
Azure Services Observability Hooks
| Service | Metric | Alert Threshold |
|---|---|---|
| Event Hubs | throttled_requests | > 0 for 5m |
| Event Hubs | incoming_messages | sudden drop to 0 |
| Data Factory | ActivityRunsPending | > 100 runs queued |
| Data Factory | PipelineRunDuration | > expected duration × 2 |
| Synapse | SQLPoolUsage | > 90% for 15m |
| Synapse | WarehouseLiveUsage | > available DWU |
| Blob Storage | BlobCount | unexpected growth > 20% |
| Stream Analytics | watermarkDelay | > 30s |
| Stream Analytics | InputEventCount | sudden drop to 0 |
Azure Services Security Checklist
- Azure AD (Entra ID) used for authentication — no local credentials
- Event Hubs uses Microsoft Entra ID for authorization, not SAS policies
- Blob Storage containers with hierarchical namespace block public access; use RBAC
- Synapse workspace uses managed private endpoints for Azure Storage
- Data Factory managed Virtual Network enabled for ADF integration runtime
- Azure Key Vault used for secrets (connection strings, keys); not in code or ADF linked services
- Azure Defender for Storage enabled on production storage accounts
- Azure Monitor diagnostic settings route to Log Analytics with 90-day retention
- Synapse audit logs enabled and exported to Azure Sentinel
- No firewall rules set to “Allow all” on Event Hubs or Synapse endpoints
Azure Services Anti-Patterns
Using Data Factory for complex transformations. Data Factory is an orchestrator, not a transformation engine. Complex multi-step logic in Data Flow scripts is hard to debug, version, and test. Use Data Factory to orchestrate; push transformations to Spark (Databricks, Synapse Spark) or SQL.
Running Synapse dedicated pools 24/7. Dedicated SQL pools can be paused when not in use. Running a DWU-1000 pool continuously costs ~$1,600/month. Pausing it outside business hours cuts that to ~$500/month.
Not using Event Hubs Capture. Without Capture configured, raw events are lost after retention period. Capture to Blob Storage provides a free, durable backup that also enables replay and batch processing.
Over-provisioning Synapse DWUs. Starting at high DWU levels for small datasets is wasteful. Scale down and measure actual utilization before scaling up.
Ignoring Stream Analytics ordering guarantees. Stream Analytics provides at-least-once processing. Designing pipelines that assume exactly-once leads to duplicate records in sinks. Always design for idempotent writes.
Quick Recap
- Event Hubs provides Kafka-compatible high-throughput streaming; monitor throttled requests and set autoscale rules
- Data Factory orchestrates pipelines with 90+ connectors; break long pipelines into smaller units with checkpointing
- Synapse Analytics offers dedicated pools for predictable heavy workloads and serverless for unpredictable ad-hoc queries
- Blob Storage with hierarchical namespace is the foundation; always enable soft delete before configuring lifecycle deletion
- Monitor Event Hubs TU utilization, Data Factory queue depth, Synapse DWU usage, and Stream Analytics watermark delay
- Use Azure AD for all authentication; Key Vault for all secrets; never hardcode credentials
For more on data pipeline patterns, see our Event-Driven Architecture and Pub-Sub Patterns guides.
Key Takeaways:
- Event Hubs for high-throughput Kafka-compatible event streaming
- Data Factory for visual pipeline orchestration with 90+ connectors
- Synapse for unified analytics (SQL warehouse + Spark + streaming)
- Blob Storage with hierarchical namespace for modern data lake
Architecture Decision Guide:
- Real-time ingestion: Event Hubs with Kafka protocol
- ETL/orchestration: Data Factory with Mapping Data Flows
- Enterprise DW: Synapse dedicated pools for heavy workloads
- Ad-hoc analytics: Synapse serverless on Data Lake
- Real-time analytics: Stream Analytics with SQL interface
Integration Benefits:
- Native Power BI integration for reporting
- Dynamics 365 connectors for CRM data
- Azure Active Directory for authentication
- Microsoft 365 data integration
For more on data pipeline patterns, see our Event-Driven Architecture and Pub-Sub Patterns guides.
Category
Related Posts
AWS Data Services: Kinesis, Glue, Redshift, and S3
Guide to AWS data services for building data pipelines. Compare Kinesis vs Kafka, use Glue for ETL, query with Athena, and design S3 data lakes.
Data Migration: Strategies and Patterns for Moving Data
Learn proven strategies for migrating data between systems with minimal downtime. Covers bulk migration, CDC patterns, validation, and rollback.
GCP Data Services: Dataflow, BigQuery, and Pub/Sub
Guide to Google Cloud data services for building pipelines. Compare Dataflow vs Kafka, leverage BigQuery for analytics, use Pub/Sub, and design data lakes.