AWS Data Services: Kinesis, Glue, Redshift, and S3

Guide to AWS data services for building data pipelines. Compare Kinesis vs Kafka, use Glue for ETL, query with Athena, and design S3 data lakes.

published: March 27, 2026 reading time: 20 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Kinesis ingests streaming data, Glue runs serverless Spark ETL, Redshift handles analytical queries at petabyte scale, Athena lets you SQL-query data sitting in S3, and S3 forms the base of the data lake. They slot together into complete architectures.

AWS Data Services: Kinesis, Glue, Redshift, Athena, and S3 for Data Engineering

AWS has a sprawling set of data services. Streaming ingestion, batch processing, data warehousing, ad-hoc querying. Picking the right combination for your use case can be overwhelming.

This guide breaks down each service, explains when to use which, and shows how to connect them into a coherent data architecture.

Service	Type	Use Case
Kinesis	Streaming	Real-time data ingestion
Glue	ETL	Data transformation and cataloging
Redshift	Data Warehouse	Analytical queries, BI
Athena	Query Service	Ad-hoc SQL on S3
S3	Object Storage	Data lake, raw storage

Amazon Kinesis

Kinesis is AWS’s managed streaming platform. If you need real-time processing of data streams, this is where you start.

Kinesis Data Streams vs Kafka

Kinesis Data Streams is often compared to Apache Kafka. They solve similar problems but have different trade-offs:

Feature	Kinesis	Kafka (MSK)
Pricing	Shard-hour + data	Instance-hour + storage
Scaling	Split/merge shards	Add partitions
Retention	Up to 365 days	Configurable, unlimited
Throughput	1MB/s write per shard	1MB/s per partition
Encryption	Server-side	TLS in transit

Kinesis Data Streams

import boto3
import json

kinesis = boto3.client('kinesis')

def produce_to_kinesis(stream_name, data):
    """Produce a single record to Kinesis"""
    kinesis.put_record(
        StreamName=stream_name,
        Data=json.dumps(data),
        PartitionKey=data.get('id', 'default')
    )

def produce_batch(stream_name, records):
    """Produce multiple records"""
    kinesis.put_records(
        StreamName=stream_name,
        Records=[
            {
                'Data': json.dumps(r['data']),
                'PartitionKey': r.get('partition_key', 'default')
            }
            for r in records
        ]
    )

# Consumer with KCL
from amazon_kinesis_client import KinesisClientLibrary

class MyProcessor(KinesisClientLibrary.RecordProcessor):
    def process_records(self, records):
        for record in records:
            data = json.loads(record['data'])
            process_data(data)

Kinesis Data Firehose

Firehose automatically loads streaming data to destinations:

import boto3

firehose = boto3.client('firehose')

# Create delivery stream
firehose.create_delivery_stream(
    DeliveryStreamName='click-events',
    DeliveryStreamType='DirectPut',
    ExtendedS3DestinationConfiguration={
        'BucketARN': 'arn:aws:s3:::data-lake-raw',
        'Prefix': 'click-events/year=!{timestamp:YYYY}/month=!{timestamp:MM}/day=!{timestamp:dd}/',
        'BufferingHints': {
            'SizeInMBs': 128,
            'IntervalInSeconds': 900
        },
        'CompressionFormat': 'GZIP',
        'EncryptionConfiguration': {
            'NoEncryptionConfig': ' encryption not needed'
        },
        'RoleARN': 'arn:aws:iam::123456789:role/firehose-role'
    }
)

When to Use Kinesis

Kinesis works well for clickstream ingestion, IoT telemetry, and app logging when you need sub-second latency and do not want to manage consumer infrastructure yourself. If your team already lives in AWS, Kinesis drops in without requiring Kafka knowledge. The KCL handles checkpointing and consumer group management out of the box — that alone saves weeks of scaffolding work that most teams end up building on Kafka anyway.

Kafka is the better fit when you need exactly-once delivery for financial transactions or order processing, already run Confluent or MSK with Kafka-experienced engineers, or need cross-cloud or hybrid deployments. Kafka’s schema registry makes it straightforward to enforce contracts across teams publishing to the same topics. At higher data volumes, Kafka’s partition-based model tends to be more cost-predictable than Kinesis shard-hour billing, especially with bursty write patterns.

Switching later is not clean. If you start on Kinesis and eventually migrate to Kafka, you will need to rework consumer group logic, reimplement checkpointing, and update any dashboards built on Kinesis CloudWatch metrics. For greenfield projects with unclear scale, Kinesis is the lower-friction entry point. For projects with known high-throughput requirements or multi-cloud targets, Kafka is worth the operational overhead from day one.

AWS Glue

Glue provides serverless ETL with a managed Spark environment.

Glue Data Catalog

The Glue Data Catalog is a central metadata repository:

import boto3

glue = boto3.client('glue')

# Create database
glue.create_database(
    DatabaseInput={
        'Name': 'analytics',
        'Description': 'Analytics data warehouse'
    }
)

# Create table
glue.create_table(
    DatabaseName='analytics',
    TableInput={
        'Name': 'page_views',
        'StorageDescriptor': {
            'Location': 's3://data-lake/processed/page_views/',
            'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
            'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
            'SerdeInfo': {
                'SerializationLibrary': 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
            },
            'Columns': [
                {'Name': 'user_id', 'Type': 'string'},
                {'Name': 'page_url', 'Type': 'string'},
                {'Name': 'timestamp', 'Type': 'timestamp'},
                {'Name': 'duration_ms', 'Type': 'int'}
            ]
        },
        'PartitionKeys': [
            {'Name': 'year', 'Type': 'string'},
            {'Name': 'month', 'Type': 'string'},
            {'Name': 'day', 'Type': 'string'}
        ]
    }
)

Glue ETL Jobs

# Glue job script (PySpark)
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read from Glue Data Catalog
datasource = glueContext.create_dynamic_frame.from_catalog(
    database="raw",
    table_name="events",
    transformation_ctx="datasource0"
)

# Transform
transformed = ApplyMapping.apply(
    frame=datasource,
    mappings=[
        ("user_id", "string", "user_id", "string"),
        ("event_type", "string", "event_type", "string"),
        ("timestamp", "string", "timestamp", "timestamp"),
        ("properties", "map<string,string>", "properties", "map<string,string>")
    ],
    transformation_ctx="transformed"
)

# Write to S3
datasink = glueContext.write_dynamic_frame.from_options(
    frame=transformed,
    connection_type="s3",
    connection_options={
        "path": "s3://data-lake/processed/events/",
        "partitionBy": ["year", "month", "day"]
    },
    format="parquet",
    transformation_ctx="datasink"
)

job.commit()

Glue Crawlers

Crawlers automatically discover schema:

glue.create_crawler(
    Name='events-crawler',
    Role='arn:aws:iam::123456789:role/glue-crawler-role',
    DatabaseName='raw',
    Targets={
        'S3Targets': [
            {
                'Path': 's3://data-lake/raw/events/',
                'Exclusions': ['**/*.tmp']
            }
        ]
    },
    Schedule='cron(0 2 * * ? *)',
    SchemaChangePolicy={
        'UpdateBehavior': 'LOG',
        'DeleteBehavior': 'LOG'
    }
)

# Run crawler
glue.start_crawler(Name='events-crawler')

Amazon Redshift

Redshift is a petabyte-scale data warehouse.

Redshift Architecture

import boto3

# Create Redshift cluster
redshift = boto3.client('redshift')

cluster = redshift.create_cluster(
    ClusterIdentifier='analytics-cluster',
    NodeType='dc2.large',
    NumberOfNodes=3,
    MasterUsername='admin',
    MasterUserPassword='SecurePassword123',
    DBName='analytics',
    ClusterSubnetGroupName='private-subnet',
    VpcSecurityGroupIds=['sg-12345678'],
    EnhancedVpcRouting=True,
    MaintenanceTrackName='current',
    Tags=[
        {'Key': 'Environment', 'Value': 'production'}
    ]
)

Redshift Serverless

For simpler workloads, Redshift Serverless manages capacity automatically:

# Create Redshift serverless namespace
redshift_serverless = boto3.client('redshift-serverless')

namespace = redshift_serverless.create_namespace(
    namespaceName='analytics',
    adminUsername='admin',
    adminUserPassword='SecurePassword123',
    iamRoles=['arn:aws:iam::123456789:role/redshift-role'],
    tags=[
        {'key': 'Environment', 'value': 'production'}
    ]
)

# Create workgroup
workgroup = redshift_serverless.create_workgroup(
    workgroupName='analytics-wg',
    namespaceName='analytics',
    baseCapacity=32,  # RPU
    enhancedVpcRouting=True
)

Loading Data to Redshift

# COPY from S3
redshift.copy(
    sql="""
    COPY page_views
    FROM 's3://data-lake/processed/page_views/'
    CREDENTIALS 'aws_iam_role=arn:aws:iam::123456789:role/redshift-role'
    FORMAT AS PARQUET
    timeformat 'auto';
    """
)

# UNLOAD to S3
redshift.unload(
    sql="SELECT * FROM page_views WHERE year = '2026'",
    bucket='data-lake-archive',
    iamRole='arn:aws:iam::123456789:role/redshift-role',
    file_format='parquet',
    manifest=True
)

Amazon Athena

Athena lets you query data directly in S3 using SQL.

Querying S3 Data

import boto3
import pandas as pd

athena = boto3.client('athena')

def query_athena(sql, database='analytics'):
    """Execute query and return results as DataFrame"""
    response = athena.start_query_execution(
        QueryString=sql,
        QueryExecutionContext={'Database': database},
        ResultConfiguration={
            'OutputLocation': 's3://athena-results/'
        }
    )

    query_execution_id = response['QueryExecutionId']

    # Wait for completion
    while True:
        status = athena.get_query_execution(QueryExecutionId=query_execution_id)
        state = status['QueryExecution']['Status']['State']
        if state in ['SUCCEEDED', 'FAILED', 'CANCELLED']:
            break

    if state == 'SUCCEEDED':
        result = athena.get_query_results(QueryExecutionId=query_execution_id)
        # Process results
        return result
    else:
        raise Exception(f"Query failed: {status['QueryExecution']['Status']['StateChangeReason']}")

Athena with Glue Data Catalog

-- Query using Glue catalog
SELECT
    date_trunc('hour', timestamp) as hour,
    count(*) as page_views,
    count(distinct user_id) as unique_users
FROM analytics.page_views
WHERE timestamp >= current_date - interval '7' day
GROUP BY date_trunc('hour', timestamp)
ORDER BY hour DESC
LIMIT 100

Performance Tuning

-- Partition pruning
SELECT * FROM page_views
WHERE year = '2026' AND month = '03' AND day = '27'

-- Use columnar formats (Parquet)
CREATE TABLE page_views_parquet WITH (
    format = 'PARQUET',
    external_location = 's3://data-lake/processed/page_views/'
) AS
SELECT * FROM page_views

-- Optimize joins
SELECT a.user_id, b.purchase_count
FROM users a
JOIN (
    SELECT user_id, count(*) as purchase_count
    FROM purchases
    GROUP BY user_id
) b ON a.user_id = b.user_id

Amazon S3 for Data Lakes

S3 is the foundation of most AWS data architectures.

S3 Data Lake Structure

import boto3

s3 = boto3.client('s3')

# Create data lake bucket structure
buckets = [
    'data-lake-raw',
    'data-lake-processed',
    'data-lake-archive',
    'data-lake-analytics'
]

for bucket in buckets:
    s3.create_bucket(Bucket=bucket)

# Configure lifecycle rules
s3.put_bucket_lifecycle_configuration(
    Bucket='data-lake-raw',
    LifecycleConfiguration={
        'Rules': [
            {
                'ID': 'Raw data lifecycle',
                'Status': 'Enabled',
                'Filter': {'Prefix': ''},
                'Transitions': [
                    {'Days': 30, 'StorageClass': 'STANDARD_IA'},
                    {'Days': 90, 'StorageClass': 'GLACIER'},
                    {'Days': 365, 'StorageClass': 'DEEP_ARCHIVE'}
                ]
            }
        ]
    }
)

S3 Access Patterns

Access Pattern	Recommended Storage	Service
Raw ingestion	S3 Standard	Kinesis Firehose
Immediate processing	S3 Standard	Glue, EMR
Processed data	S3 Standard-IA	Athena, Redshift
Historical archive	S3 Glacier	Athena (via Glacier)
Analytics output	S3 Standard	BI tools

Architecture Patterns

Lambda-Kinesis-Athena Pattern

flowchart TD
    A[Application] -->|Kinesis| B[Kinesis Data Streams]
    B --> C[Lambda Consumer]
    C -->|Raw Data| D[S3 Raw Bucket]
    D --> E[Glue Crawler]
    E --> F[Glue Data Catalog]
    F --> G[Athena Queries]
    G --> H[Redshift]

Data Lake with Redshift Spectrum

# Create external schema for Redshift Spectrum
redshift.execute(
    """
    CREATE EXTERNAL SCHEMA spectrum
    FROM DATA CATALOG
    DATABASE 'analytics'
    IAM_ROLE 'arn:aws:iam::123456789:role/redshift-role'
    CREATE EXTERNAL DATABASE IF NOT EXISTS;
    """
)

# Query S3 data directly from Redshift
query = """
SELECT
    a.user_id,
    a.total_purchases,
    b.page_view_count
FROM spectrum.user_purchases a
JOIN spectrum.user_page_views b ON a.user_id = b.user_id
WHERE a.signup_date >= '2026-01-01'
"""

Cost Optimization

See our Cost Optimization guide for detailed strategies, but key AWS-specific tips:

Use Kinesis Data Streams shard hours efficiently (consolidate low-traffic streams)
Enable S3 Intelligent Tiering for unpredictable access patterns
Use Glacier for data that is rarely accessed but must be retained
Consider Redshift Serverless for variable workloads
Use Athena workgroups to isolate query costs by team

AWS Services Production Failure Scenarios

Kinesis shard limit causes data loss during traffic spike

The impact goes beyond failed writes. When ProvisionedThroughputExceededException kicks in, producers buffer records in memory, and if the spike lasts more than a few minutes without strict error handling, they start dropping records silently. This causes consumer lag that compounds — a one-hour spike can create three to four hours of catch-up work downstream. Real-time dashboards and alerting pipelines built on the stream show stale or missing data the entire time.

The root cause is undersized shards combined with partition keys that concentrate writes on a subset of shards. Teams often size shards during load testing at steady-state traffic, then get blindsided when a campaign, product launch, or DDoS protection test doubles or triples write throughput overnight. The 1MB/s write limit per shard is a hard constraint — it does not elastic scale.

Early warning signs appear in CloudWatch before the situation becomes critical. A rising WriteProvisionedThroughputExceeded metric during normal traffic, even if still near zero, combined with an increasing ConsumerLag metric, signals that the stream is approaching its ceiling. Sudden shifts in partition key cardinality can also trigger hot shard problems even at modest total throughput — a flash sale where many anonymous users collapse under a single user_id = 'anonymous' partition key is a common example. Monitor IncomingRecords and WriteThroughput per shard to catch hot partitions before they hit limits.

Mitigation: Monitor PutRecords.SuccessfullyRecords and PutRecords.FailedRecords metrics. Use adaptive shard splitting before hitting limits. Set CloudWatch alarms on ReadProvisionedThroughputExceeded and WriteProvisionedThroughputExceeded. Consider Kinesis Data Streams Enhanced Fan-out for high-throughput consumers.

Redshift cluster fails during peak BI query window

The actual impact is a failed reporting window. When a node fails during the Monday morning BI rush at 9am, queries start queuing immediately. Tableau, Looker, and any tool using Redshift connections return timeout errors to business users who have come to rely on morning dashboards. The cluster recovery itself takes 10-20 minutes, but query queue depths can take another 30-60 minutes to drain once the cluster is back online. Any scheduled reports that ran before the failure may contain incomplete data with no easy way to replay them.

The failure is almost always a hardware issue on the underlying EC2 instance — memory error, disk failure, or network interface flapping. Single-AZ clusters have no standby to fail over to, so the cluster must rebuild from parity slices on the remaining nodes. The rebuild process competes with normal query execution for I/O and CPU, further degrading performance even after the cluster reports healthy.

You can see this coming in CloudWatch metrics before the failure actually occurs. The CPUUtilization on one node starts rising ahead of the others as memory errors accumulate. NetworkReceiveThroughput may drop intermittently as the hardware degrades. The HealthStatus metric will transition to FALSE once the cluster detects the failure, but by then recovery has already started. Monitor HealthStatus on a 1-minute granularity and set alerts on any transition away from 1. Also watch MaintenanceMode — clusters in maintenance cannot accept queries and will return errors even though no hardware has failed yet.

Mitigation: Enable Multi-AZ deployment for production clusters. Run ANALYZE as part of your pipeline after data loads. Monitor HealthStatus and MaintenanceMode in CloudWatch. Use Redshift Serverless for automatic failover if operational overhead of managing clusters is a concern.

Glue job timeout leaves data in inconsistent state

The data integrity problem is the real damage. When a Glue job terminates mid-write, S3 receives an incomplete Parquet or ORC file. These files have internal footer structures that encode row counts and column statistics — a partially written file is still readable by Athena, but with truncated data and no indication that anything is wrong. The file size on S3 may be 70% of expected, but Athena treats it as a valid table partition. Business intelligence tools query the partition and silently return wrong aggregates. In a financial reporting context, Q3 revenue figures could be understated by 30% with no error thrown.

The root cause is usually an unbounded transformation on a growing dataset. A Glue job that runs comfortably in 4 hours on 100GB may run 11 hours on 500GB when joins explode the data volume. Glue has a default 12-hour job timeout, and once that limit is reached, the job is forcibly terminated mid-execution. Teams that set up the job once and do not revisit it as data volumes grow are the most common victims.

Warning signs appear in the Glue console and CloudWatch before timeout occurs. Watch glue.job.running — a job that normally completes in 4 hours but is approaching 8-10 hours on a new run is on a dangerous trajectory. The glue.driver.ExecutorMetrics.jvmHeap metric climbing above 80% indicates memory pressure that will slow processing and extend runtime. If your Glue job writes to the same S3 prefix on each run without overwriting (common with date-partitioned outputs), a timeout mid-write leaves behind an incomplete partition that overwrites nothing and creates an orphan fragment.

Mitigation: Implement checkpointing in Glue jobs. Write to a staging prefix first, then use an S3 event trigger to only promote data after a validation step confirms completeness. Add row count validation after every Glue write.

S3 DeleteMarkers proliferate from versioning misconfiguration

The cost and performance impact is severe and silent. With 2 million versions accumulating in a single bucket, every s3 ls command or S3 LIST API call traverses the entire version index. What should take 500ms drags on for 30+ seconds, and any application code doing prefix-level listing — Glue crawlers, Athena table inference, data catalog scrapers — starts timing out. Storage costs do not just double; they scale with every overwrite. A bucket that should cost $500/month quietly runs $7,500/month because each overwrite preserves the old version in the same storage class.

The misconfiguration is almost always the same: versioning is enabled on a bucket, but the pipeline is designed around overwrite-in-place semantics. Glue jobs, s3 cp commands, and S3Sync tools that update files in the same prefix trigger a new version on every write. The old version becomes a noncurrent object version that is no longer charged at standard rates but still incurs storage. DeleteMarkers compound the problem — when a file is deleted or overwritten, S3 treats this as a delete + put, creating a DeleteMarker as the current version while all previous versions remain.

Watch NumberOfObjects and BucketSizeBytes in CloudWatch for sudden jumps that do not correlate with actual data growth. A 20% increase in NumberOfObjects when your pipeline should only be adding 5% more data is a clear signal. The S3 inventory feature can audit version counts per prefix on a daily schedule. If NumberOfObjects is growing faster than your actual data volume, a versioning misconfiguration is the most likely culprit — not a bug in your pipeline.

Mitigation: Use S3 Intelligent Tiering or lifecycle rules to clean up noncurrent versions automatically. Prefer write-to-new-path patterns over overwrite-in-place when using versioning. Monitor NumberOfObjects and BucketSizeBytes per prefix.

Athena query times out on large unpartitioned table

The failure is not immediately obvious. The query runs for 30 minutes — Athena’s default query timeout — and then returns a Query timeout error with no partial results. The analyst assumes the table is empty or the query is wrong, when in reality Athena scanned all 500GB of data, could not finish processing, and discarded the partial work. The same query against a table partitioned by date with a date filter in the WHERE clause completes in under 10 seconds because Athena prunes 99% of the partitions. Without partition pruning, Athena has no choice but to full-table scan.

The root cause is a table created without partition columns at ingestion time. This happens routinely when teams use Glue crawlers that auto-discover schema but do not include partition keys, or when data is loaded via CREATE TABLE AS SELECT (CTAS) queries that omit the PARTITIONED BY clause. The table works fine at 10GB. At 500GB with a full scan required per query, the architecture breaks down.

Warning signs appear in query history before users start complaining. The QueryQueueTime in CloudWatch starts climbing as queries wait for scan capacity. Queries that previously completed in 30 seconds begin taking 5-8 minutes as data volume grows, signaling that the table has passed the threshold where full scans are still tolerable. If QueryFailure metrics begin appearing and correlate with queries that do not include partition filter predicates, you have a clear signal. Running EXPLAIN on slow queries before they timeout will reveal whether Athena is planning a full table scan — if the explain plan shows a TableScan operation without any PartitionFilter, the query is unpartitioned.

Mitigation: Enforce partition usage via Lake Formation or table policies. Set Athena workgroup query result limits. Create CloudWatch alarms on QueryFailure metrics. Run MSCK REPAIR TABLE after adding new partitions.

AWS Services Capacity Estimation

Kinesis Shard Calculations

def estimate_kinesis_shards(
    records_per_second: int,
    avg_record_size_kb: float,
    consumers: int = 1
) -> dict:
    """
    Estimate required Kinesis shards.
    Throughput per shard: 1MB/s write, 2MB/s read.
    """
    write_throughput_mbps = (records_per_second * avg_record_size_kb) / 1024
    shards_for_write = max(1, write_throughput_mbps)  # 1MB/s per shard

    # Read throughput depends on consumers (Enhanced Fan-out: 2MB/s per consumer)
    read_per_consumer_mbps = 2  # MB/s
    shards_for_read_per_consumer = max(1, write_throughput_mbps / read_per_consumer_mbps)
    total_shards = max(shards_for_write, shards_for_read_per_consumer * consumers)

    return {
        'records_per_second': records_per_second,
        'write_throughput_mbps': round(write_throughput_mbps, 2),
        'shards_for_write': shards_for_write,
        'shards_for_read_per_consumer': shards_for_read_per_consumer,
        'recommended_shards': max(1, int(total_shards * 1.2)),  # 20% headroom
        'estimated_monthly_cost': max(1, int(total_shards * 1.2)) * 11.25  # ~$11.25/shard-month
    }

# Example: 5,000 records/sec, 2KB each, 3 consumers
result = estimate_kinesis_shards(5000, 2, 3)
# Recommended shards: 48
# Estimated monthly cost: ~$540

Redshift Node Sizing

def estimate_redshift_nodes(
    data_size_tb: float,
    query_concurrency: int = 5,
    compressed: bool = True
) -> dict:
    """
    Estimate Redshift node count.
    DC2 nodes: 2TB HDD per node (0.2TB SSD)
    RA3 nodes: 64TB managed storage per node
    """
    compression_factor = 4 if compressed else 1  # Typical columnar compression
    effective_size_tb = data_size_tb * compression_factor

    # DC2: ~0.2TB usable per node (after overhead)
    dc2_nodes = max(2, int(effective_size_tb / 0.2))

    # RA3: managed storage handles ~48TB usable per node
    ra3_nodes = max(1, int(effective_size_tb / 48))

    return {
        'raw_data_tb': data_size_tb,
        'effective_size_tb': effective_size_tb,
        'dc2_xlarge_nodes': dc2_nodes,
        'ra3_nodes': ra3_nodes,
        'note': 'DC2 for predictable workloads, RA3 for variable/unpredictable'
    }

# Example: 10TB raw data, columnar compression expected
result = estimate_redshift_nodes(10, 5, True)
# DC2: 50 nodes, RA3: 1 node

AWS Services Observability Hooks

Service	Metric	Alert Threshold
Kinesis	`ReadProvisionedThroughputExceeded`	> 0 for 5m
Kinesis	`WriteProvisionedThroughputExceeded`	> 0 for 5m
Kinesis	`ConsumerLag`	> 100,000 records
Glue	`glue.driver.ExecutorMetrics.jvmHeap`	> 80% for 10m
Glue	`glue.job.running`	job running > 12h
Redshift	`CPUUtilization`	> 90% for 15m
Redshift	`NetworkReceiveThroughput`	sudden drop = node failure
Redshift	`HealthStatus`	!= 1 (healthy)
Athena	`QueryFailure`	any failure
S3	`BucketSizeBytes`	unexpected growth > 20%
S3	`NumberOfObjects`	unexpected growth > 10%

AWS Services Security Checklist

AWS Services Anti-Patterns

Over-sharding Kinesis. Creating 100 shards for a workload that genuinely needs 10 wastes money and increases operational complexity. Right-size shards based on measured throughput.

Running long Glue jobs that cannot be retried. A 4-hour Glue job that fails at hour 3 wastes significant compute. Break long jobs into smaller units with checkpointing.

Using Redshift for small data. If your dataset is under 100GB, Redshift is overkill and expensive. Use Athena on S3 or a managed OLTP database.

Leaving S3 versioning on for write-in-place patterns. Versioning + overwrite-in-place = version explosion and cost surprise. Use versioning only with append-or-replace patterns.

Not partitioning Athena tables. Unpartitioned large tables mean full table scans on every query. Partition by date or relevant business key.

Quick Recap

Kinesis handles real-time streaming; size shards by throughput (1MB/s write per shard) and monitor ProvisionedThroughputExceeded
Glue runs serverless Spark ETL; use crawlers for schema discovery and implement checkpointing for long jobs
Redshift suits analytical workloads at scale; enable Multi-AZ for production and run ANALYZE after data loads
Athena queries S3 directly with SQL; always partition data and use columnar formats (Parquet) for performance
S3 is the foundation; use lifecycle rules to manage storage costs, enable versioning only with structured overwrite patterns
Monitor per-service metrics in CloudWatch; set alarms on throughput exceeded, job failures, cluster health, and query failures

For more on distributed systems patterns that underpin these services, see our Distributed Tracing and Event-Driven Architecture guides.

Key Takeaways:

Kinesis for real-time streaming ingestion, Kafka for complex event streaming needs
Glue ETL jobs run serverless Spark; use crawlers to auto-discover schema
Redshift for analytical workloads needing complex joins and aggregations
Athena for ad-hoc queries on data already in S3
S3 forms the foundation; structure your data lake with clear layer separation

Architecture Decision Guide:

Streaming + SQL analysis: Kinesis -> Firehose -> S3 -> Athena/Redshift
ETL + catalog: Glue with Data Catalog
Heavy BI + complex queries: Redshift or Redshift Spectrum
Ad-hoc exploration: Athena on S3 data lake
Mixed workloads: Redshift + Spectrum for unified querying