AWS Data Services: Kinesis, Glue, Redshift, and S3
Guide to AWS data services for building data pipelines. Compare Kinesis vs Kafka, use Glue for ETL, query with Athena, and design S3 data lakes.
AWS Data Services: Kinesis, Glue, Redshift, Athena, and S3 for Data Engineering
AWS has a sprawling set of data services. Streaming ingestion, batch processing, data warehousing, ad-hoc querying. Picking the right combination for your use case can be overwhelming.
This guide breaks down each service, explains when to use which, and shows how to connect them into a coherent data architecture.
Service Overview
| Service | Type | Use Case |
|---|---|---|
| Kinesis | Streaming | Real-time data ingestion |
| Glue | ETL | Data transformation and cataloging |
| Redshift | Data Warehouse | Analytical queries, BI |
| Athena | Query Service | Ad-hoc SQL on S3 |
| S3 | Object Storage | Data lake, raw storage |
Amazon Kinesis
Kinesis is AWS’s managed streaming platform. If you need real-time processing of data streams, this is where you start.
Kinesis Data Streams vs Kafka
Kinesis Data Streams is often compared to Apache Kafka. They solve similar problems but have different trade-offs:
| Feature | Kinesis | Kafka (MSK) |
|---|---|---|
| Pricing | Shard-hour + data | Instance-hour + storage |
| Scaling | Split/merge shards | Add partitions |
| Retention | Up to 365 days | Configurable, unlimited |
| Throughput | 1MB/s write per shard | 1MB/s per partition |
| Encryption | Server-side | TLS in transit |
Kinesis Data Streams
import boto3
import json
kinesis = boto3.client('kinesis')
def produce_to_kinesis(stream_name, data):
"""Produce a single record to Kinesis"""
kinesis.put_record(
StreamName=stream_name,
Data=json.dumps(data),
PartitionKey=data.get('id', 'default')
)
def produce_batch(stream_name, records):
"""Produce multiple records"""
kinesis.put_records(
StreamName=stream_name,
Records=[
{
'Data': json.dumps(r['data']),
'PartitionKey': r.get('partition_key', 'default')
}
for r in records
]
)
# Consumer with KCL
from amazon_kinesis_client import KinesisClientLibrary
class MyProcessor(KinesisClientLibrary.RecordProcessor):
def process_records(self, records):
for record in records:
data = json.loads(record['data'])
process_data(data)
Kinesis Data Firehose
Firehose automatically loads streaming data to destinations:
import boto3
firehose = boto3.client('firehose')
# Create delivery stream
firehose.create_delivery_stream(
DeliveryStreamName='click-events',
DeliveryStreamType='DirectPut',
ExtendedS3DestinationConfiguration={
'BucketARN': 'arn:aws:s3:::data-lake-raw',
'Prefix': 'click-events/year=!{timestamp:YYYY}/month=!{timestamp:MM}/day=!{timestamp:dd}/',
'BufferingHints': {
'SizeInMBs': 128,
'IntervalInSeconds': 900
},
'CompressionFormat': 'GZIP',
'EncryptionConfiguration': {
'NoEncryptionConfig': ' encryption not needed'
},
'RoleARN': 'arn:aws:iam::123456789:role/firehose-role'
}
)
When to Use Kinesis
Use Kinesis when you need real-time data ingestion (click streams, IoT, logs), want built-in consumers with KCL, or want AWS to handle shard management. It integrates naturally with other AWS services.
Consider Kafka when you need exactly-once semantics (Kinesis is at-least-once), schema registry integration, existing Kafka expertise on your team, or access to the broader Kafka ecosystem and tooling.
AWS Glue
Glue provides serverless ETL with a managed Spark environment.
Glue Data Catalog
The Glue Data Catalog is a central metadata repository:
import boto3
glue = boto3.client('glue')
# Create database
glue.create_database(
DatabaseInput={
'Name': 'analytics',
'Description': 'Analytics data warehouse'
}
)
# Create table
glue.create_table(
DatabaseName='analytics',
TableInput={
'Name': 'page_views',
'StorageDescriptor': {
'Location': 's3://data-lake/processed/page_views/',
'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
'SerdeInfo': {
'SerializationLibrary': 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
},
'Columns': [
{'Name': 'user_id', 'Type': 'string'},
{'Name': 'page_url', 'Type': 'string'},
{'Name': 'timestamp', 'Type': 'timestamp'},
{'Name': 'duration_ms', 'Type': 'int'}
]
},
'PartitionKeys': [
{'Name': 'year', 'Type': 'string'},
{'Name': 'month', 'Type': 'string'},
{'Name': 'day', 'Type': 'string'}
]
}
)
Glue ETL Jobs
# Glue job script (PySpark)
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Read from Glue Data Catalog
datasource = glueContext.create_dynamic_frame.from_catalog(
database="raw",
table_name="events",
transformation_ctx="datasource0"
)
# Transform
transformed = ApplyMapping.apply(
frame=datasource,
mappings=[
("user_id", "string", "user_id", "string"),
("event_type", "string", "event_type", "string"),
("timestamp", "string", "timestamp", "timestamp"),
("properties", "map<string,string>", "properties", "map<string,string>")
],
transformation_ctx="transformed"
)
# Write to S3
datasink = glueContext.write_dynamic_frame.from_options(
frame=transformed,
connection_type="s3",
connection_options={
"path": "s3://data-lake/processed/events/",
"partitionBy": ["year", "month", "day"]
},
format="parquet",
transformation_ctx="datasink"
)
job.commit()
Glue Crawlers
Crawlers automatically discover schema:
glue.create_crawler(
Name='events-crawler',
Role='arn:aws:iam::123456789:role/glue-crawler-role',
DatabaseName='raw',
Targets={
'S3Targets': [
{
'Path': 's3://data-lake/raw/events/',
'Exclusions': ['**/*.tmp']
}
]
},
Schedule='cron(0 2 * * ? *)',
SchemaChangePolicy={
'UpdateBehavior': 'LOG',
'DeleteBehavior': 'LOG'
}
)
# Run crawler
glue.start_crawler(Name='events-crawler')
Amazon Redshift
Redshift is a petabyte-scale data warehouse.
Redshift Architecture
import boto3
# Create Redshift cluster
redshift = boto3.client('redshift')
cluster = redshift.create_cluster(
ClusterIdentifier='analytics-cluster',
NodeType='dc2.large',
NumberOfNodes=3,
MasterUsername='admin',
MasterUserPassword='SecurePassword123',
DBName='analytics',
ClusterSubnetGroupName='private-subnet',
VpcSecurityGroupIds=['sg-12345678'],
EnhancedVpcRouting=True,
MaintenanceTrackName='current',
Tags=[
{'Key': 'Environment', 'Value': 'production'}
]
)
Redshift Serverless
For simpler workloads, Redshift Serverless manages capacity automatically:
# Create Redshift serverless namespace
redshift_serverless = boto3.client('redshift-serverless')
namespace = redshift_serverless.create_namespace(
namespaceName='analytics',
adminUsername='admin',
adminUserPassword='SecurePassword123',
iamRoles=['arn:aws:iam::123456789:role/redshift-role'],
tags=[
{'key': 'Environment', 'value': 'production'}
]
)
# Create workgroup
workgroup = redshift_serverless.create_workgroup(
workgroupName='analytics-wg',
namespaceName='analytics',
baseCapacity=32, # RPU
enhancedVpcRouting=True
)
Loading Data to Redshift
# COPY from S3
redshift.copy(
sql="""
COPY page_views
FROM 's3://data-lake/processed/page_views/'
CREDENTIALS 'aws_iam_role=arn:aws:iam::123456789:role/redshift-role'
FORMAT AS PARQUET
timeformat 'auto';
"""
)
# UNLOAD to S3
redshift.unload(
sql="SELECT * FROM page_views WHERE year = '2026'",
bucket='data-lake-archive',
iamRole='arn:aws:iam::123456789:role/redshift-role',
file_format='parquet',
manifest=True
)
Amazon Athena
Athena lets you query data directly in S3 using SQL.
Querying S3 Data
import boto3
import pandas as pd
athena = boto3.client('athena')
def query_athena(sql, database='analytics'):
"""Execute query and return results as DataFrame"""
response = athena.start_query_execution(
QueryString=sql,
QueryExecutionContext={'Database': database},
ResultConfiguration={
'OutputLocation': 's3://athena-results/'
}
)
query_execution_id = response['QueryExecutionId']
# Wait for completion
while True:
status = athena.get_query_execution(QueryExecutionId=query_execution_id)
state = status['QueryExecution']['Status']['State']
if state in ['SUCCEEDED', 'FAILED', 'CANCELLED']:
break
if state == 'SUCCEEDED':
result = athena.get_query_results(QueryExecutionId=query_execution_id)
# Process results
return result
else:
raise Exception(f"Query failed: {status['QueryExecution']['Status']['StateChangeReason']}")
Athena with Glue Data Catalog
-- Query using Glue catalog
SELECT
date_trunc('hour', timestamp) as hour,
count(*) as page_views,
count(distinct user_id) as unique_users
FROM analytics.page_views
WHERE timestamp >= current_date - interval '7' day
GROUP BY date_trunc('hour', timestamp)
ORDER BY hour DESC
LIMIT 100
Performance Tuning
-- Partition pruning
SELECT * FROM page_views
WHERE year = '2026' AND month = '03' AND day = '27'
-- Use columnar formats (Parquet)
CREATE TABLE page_views_parquet WITH (
format = 'PARQUET',
external_location = 's3://data-lake/processed/page_views/'
) AS
SELECT * FROM page_views
-- Optimize joins
SELECT a.user_id, b.purchase_count
FROM users a
JOIN (
SELECT user_id, count(*) as purchase_count
FROM purchases
GROUP BY user_id
) b ON a.user_id = b.user_id
Amazon S3 for Data Lakes
S3 is the foundation of most AWS data architectures.
S3 Data Lake Structure
import boto3
s3 = boto3.client('s3')
# Create data lake bucket structure
buckets = [
'data-lake-raw',
'data-lake-processed',
'data-lake-archive',
'data-lake-analytics'
]
for bucket in buckets:
s3.create_bucket(Bucket=bucket)
# Configure lifecycle rules
s3.put_bucket_lifecycle_configuration(
Bucket='data-lake-raw',
LifecycleConfiguration={
'Rules': [
{
'ID': 'Raw data lifecycle',
'Status': 'Enabled',
'Filter': {'Prefix': ''},
'Transitions': [
{'Days': 30, 'StorageClass': 'STANDARD_IA'},
{'Days': 90, 'StorageClass': 'GLACIER'},
{'Days': 365, 'StorageClass': 'DEEP_ARCHIVE'}
]
}
]
}
)
S3 Access Patterns
| Access Pattern | Recommended Storage | Service |
|---|---|---|
| Raw ingestion | S3 Standard | Kinesis Firehose |
| Immediate processing | S3 Standard | Glue, EMR |
| Processed data | S3 Standard-IA | Athena, Redshift |
| Historical archive | S3 Glacier | Athena (via Glacier) |
| Analytics output | S3 Standard | BI tools |
Architecture Patterns
Lambda-Kinesis-Athena Pattern
flowchart TD
A[Application] -->|Kinesis| B[Kinesis Data Streams]
B --> C[Lambda Consumer]
C -->|Raw Data| D[S3 Raw Bucket]
D --> E[Glue Crawler]
E --> F[Glue Data Catalog]
F --> G[Athena Queries]
G --> H[Redshift]
Data Lake with Redshift Spectrum
# Create external schema for Redshift Spectrum
redshift.execute(
"""
CREATE EXTERNAL SCHEMA spectrum
FROM DATA CATALOG
DATABASE 'analytics'
IAM_ROLE 'arn:aws:iam::123456789:role/redshift-role'
CREATE EXTERNAL DATABASE IF NOT EXISTS;
"""
)
# Query S3 data directly from Redshift
query = """
SELECT
a.user_id,
a.total_purchases,
b.page_view_count
FROM spectrum.user_purchases a
JOIN spectrum.user_page_views b ON a.user_id = b.user_id
WHERE a.signup_date >= '2026-01-01'
"""
Cost Optimization
See our Cost Optimization guide for detailed strategies, but key AWS-specific tips:
- Use Kinesis Data Streams shard hours efficiently (consolidate low-traffic streams)
- Enable S3 Intelligent Tiering for unpredictable access patterns
- Use Glacier for data that is rarely accessed but must be retained
- Consider Redshift Serverless for variable workloads
- Use Athena workgroups to isolate query costs by team
AWS Services Production Failure Scenarios
Kinesis shard limit causes data loss during traffic spike
A marketing campaign doubles traffic overnight. A Kinesis stream with 10 shards hits its 10MB/s write limit. ProvisionedThroughputExceededException errors start appearing. Records are successfully written to the shard but consumer reads are throttled. Consumer lag grows from 0 to 50 million records. By the time engineers scale shards manually, 3 hours of data has queued and downstream dashboards show stale numbers.
Mitigation: Monitor PutRecords.SuccessfullyRecords and PutRecords.FailedRecords metrics. Use adaptive shard splitting before hitting limits. Set CloudWatch alarms on ReadProvisionedThroughputExceeded and WriteProvisionedThroughputExceeded. Consider Kinesis Data Streams Enhanced Fan-out for high-throughput consumers.
Redshift cluster fails during peak BI query window
A Redshift DC2.Large 3-node cluster loses one node due to a hardware issue. The cluster enters recovery mode, remapping slices and rebuilding data. During the 15-minute recovery, BI queries queue up. Users see timeouts. After recovery, table statistics are stale and query performance is degraded for hours until ANALYZE runs.
Mitigation: Enable Multi-AZ deployment for production clusters. Run ANALYZE as part of your pipeline after data loads. Monitor HealthStatus and MaintenanceMode in CloudWatch. Use Redshift Serverless for automatic failover if operational overhead of managing clusters is a concern.
Glue job timeout leaves data in inconsistent state
A Glue job running a 6-hour transformation hits the 12-hour timeout limit and terminates mid-write. The target S3 prefix now contains partial data — 70% of expected records with no indicator of incompleteness. Downstream Athena queries return incomplete results with no error. Engineers do not discover the issue until a business user notices Q3 numbers are wrong.
Mitigation: Implement checkpointing in Glue jobs. Write to a staging prefix first, then use an S3 event trigger to only promote data after a validation step confirms completeness. Add row count validation after every Glue write.
S3 DeleteMarkers proliferate from versioning misconfiguration
S3 versioning was enabled on a data lake bucket. A daily Glue job overwrites files in place instead of writing to date-partitioned paths. Each overwrite creates a new version. Within 3 months, 2 million versions exist. S3 LIST operations slow to 30+ seconds. Storage costs are 15x expected. Engineers cannot identify which versions are active without scanning all 2 million.
Mitigation: Use S3 Intelligent Tiering or lifecycle rules to clean up noncurrent versions automatically. Prefer write-to-new-path patterns over overwrite-in-place when using versioning. Monitor NumberOfObjects and BucketSizeBytes per prefix.
Athena query times out on large unpartitioned table
An analyst runs a SELECT * on a 500GB unpartitioned table expecting to filter results. Athena scans all 500GB, hits the 30-minute query timeout, and returns no results. The same query run against a properly partitioned table (by date) completes in 8 seconds.
Mitigation: Enforce partition usage via Lake Formation or table policies. Set Athena workgroup query result limits. Create CloudWatch alarms on QueryFailure metrics. Run MSCK REPAIR TABLE after adding new partitions.
AWS Services Capacity Estimation
Kinesis Shard Calculations
def estimate_kinesis_shards(
records_per_second: int,
avg_record_size_kb: float,
consumers: int = 1
) -> dict:
"""
Estimate required Kinesis shards.
Throughput per shard: 1MB/s write, 2MB/s read.
"""
write_throughput_mbps = (records_per_second * avg_record_size_kb) / 1024
shards_for_write = max(1, write_throughput_mbps) # 1MB/s per shard
# Read throughput depends on consumers (Enhanced Fan-out: 2MB/s per consumer)
read_per_consumer_mbps = 2 # MB/s
shards_for_read_per_consumer = max(1, write_throughput_mbps / read_per_consumer_mbps)
total_shards = max(shards_for_write, shards_for_read_per_consumer * consumers)
return {
'records_per_second': records_per_second,
'write_throughput_mbps': round(write_throughput_mbps, 2),
'shards_for_write': shards_for_write,
'shards_for_read_per_consumer': shards_for_read_per_consumer,
'recommended_shards': max(1, int(total_shards * 1.2)), # 20% headroom
'estimated_monthly_cost': max(1, int(total_shards * 1.2)) * 11.25 # ~$11.25/shard-month
}
# Example: 5,000 records/sec, 2KB each, 3 consumers
result = estimate_kinesis_shards(5000, 2, 3)
# Recommended shards: 48
# Estimated monthly cost: ~$540
Redshift Node Sizing
def estimate_redshift_nodes(
data_size_tb: float,
query_concurrency: int = 5,
compressed: bool = True
) -> dict:
"""
Estimate Redshift node count.
DC2 nodes: 2TB HDD per node (0.2TB SSD)
RA3 nodes: 64TB managed storage per node
"""
compression_factor = 4 if compressed else 1 # Typical columnar compression
effective_size_tb = data_size_tb * compression_factor
# DC2: ~0.2TB usable per node (after overhead)
dc2_nodes = max(2, int(effective_size_tb / 0.2))
# RA3: managed storage handles ~48TB usable per node
ra3_nodes = max(1, int(effective_size_tb / 48))
return {
'raw_data_tb': data_size_tb,
'effective_size_tb': effective_size_tb,
'dc2_xlarge_nodes': dc2_nodes,
'ra3_nodes': ra3_nodes,
'note': 'DC2 for predictable workloads, RA3 for variable/unpredictable'
}
# Example: 10TB raw data, columnar compression expected
result = estimate_redshift_nodes(10, 5, True)
# DC2: 50 nodes, RA3: 1 node
AWS Services Observability Hooks
| Service | Metric | Alert Threshold |
|---|---|---|
| Kinesis | ReadProvisionedThroughputExceeded | > 0 for 5m |
| Kinesis | WriteProvisionedThroughputExceeded | > 0 for 5m |
| Kinesis | ConsumerLag | > 100,000 records |
| Glue | glue.driver.ExecutorMetrics.jvmHeap | > 80% for 10m |
| Glue | glue.job.running | job running > 12h |
| Redshift | CPUUtilization | > 90% for 15m |
| Redshift | NetworkReceiveThroughput | sudden drop = node failure |
| Redshift | HealthStatus | != 1 (healthy) |
| Athena | QueryFailure | any failure |
| S3 | BucketSizeBytes | unexpected growth > 20% |
| S3 | NumberOfObjects | unexpected growth > 10% |
AWS Services Security Checklist
- IAM roles follow least-privilege — no
*actions on production resources - S3 buckets block public access; use bucket policies for cross-account access
- S3 data encrypted at rest (SSE-S3 or SSE-KMS); TLS in transit for all API calls
- Redshift cluster in private subnet via VPC; no public IPs
- Glue jobs run in VPC with PrivateSubnets (not default VPC)
- Athena query results in dedicated S3 bucket with encryption
- CloudTrail enabled on all AWS accounts for audit trail
- Kinesis streams encrypted at rest (SSE-KMS)
- Cost Explorer alerts set for > 20% spend increase
- Resource Access Manager (RAM) used for cross-account sharing with explicit trust boundaries
AWS Services Anti-Patterns
Over-sharding Kinesis. Creating 100 shards for a workload that genuinely needs 10 wastes money and increases operational complexity. Right-size shards based on measured throughput.
Running long Glue jobs that cannot be retried. A 4-hour Glue job that fails at hour 3 wastes significant compute. Break long jobs into smaller units with checkpointing.
Using Redshift for small data. If your dataset is under 100GB, Redshift is overkill and expensive. Use Athena on S3 or a managed OLTP database.
Leaving S3 versioning on for write-in-place patterns. Versioning + overwrite-in-place = version explosion and cost surprise. Use versioning only with append-or-replace patterns.
Not partitioning Athena tables. Unpartitioned large tables mean full table scans on every query. Partition by date or relevant business key.
Quick Recap
- Kinesis handles real-time streaming; size shards by throughput (1MB/s write per shard) and monitor
ProvisionedThroughputExceeded - Glue runs serverless Spark ETL; use crawlers for schema discovery and implement checkpointing for long jobs
- Redshift suits analytical workloads at scale; enable Multi-AZ for production and run
ANALYZEafter data loads - Athena queries S3 directly with SQL; always partition data and use columnar formats (Parquet) for performance
- S3 is the foundation; use lifecycle rules to manage storage costs, enable versioning only with structured overwrite patterns
- Monitor per-service metrics in CloudWatch; set alarms on throughput exceeded, job failures, cluster health, and query failures
For more on distributed systems patterns that underpin these services, see our Distributed Tracing and Event-Driven Architecture guides.
Key Takeaways:
- Kinesis for real-time streaming ingestion, Kafka for complex event streaming needs
- Glue ETL jobs run serverless Spark; use crawlers to auto-discover schema
- Redshift for analytical workloads needing complex joins and aggregations
- Athena for ad-hoc queries on data already in S3
- S3 forms the foundation; structure your data lake with clear layer separation
Architecture Decision Guide:
- Streaming + SQL analysis: Kinesis -> Firehose -> S3 -> Athena/Redshift
- ETL + catalog: Glue with Data Catalog
- Heavy BI + complex queries: Redshift or Redshift Spectrum
- Ad-hoc exploration: Athena on S3 data lake
- Mixed workloads: Redshift + Spectrum for unified querying
For more on distributed systems patterns that underpin these services, see our Distributed Tracing and Event-Driven Architecture guides.
Category
Related Posts
Azure Data Services: Data Factory, Synapse, and Event Hubs
Build data pipelines on Azure with Data Factory, Synapse Analytics, and Event Hubs. Learn integration patterns, streaming setup, and data architecture.
Data Migration: Strategies and Patterns for Moving Data
Learn proven strategies for migrating data between systems with minimal downtime. Covers bulk migration, CDC patterns, validation, and rollback.
GCP Data Services: Dataflow, BigQuery, and Pub/Sub
Guide to Google Cloud data services for building pipelines. Compare Dataflow vs Kafka, leverage BigQuery for analytics, use Pub/Sub, and design data lakes.