AWS SQS and SNS: Cloud Messaging Services

Learn AWS SQS for point-to-point queues and SNS for pub/sub notifications, including FIFO ordering, message filtering, and common use cases.

published: March 22, 2026 reading time: 36 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

AWS SQS and SNS are managed messaging services that handle asynchronous communication in cloud applications. SQS provides point-to-point queues with pull-based delivery, while SNS offers pub/sub with push-based fan-out to multiple subscribers. SQS Standard gives unlimited throughput with at-least-once delivery; FIFO guarantees exactly-once processing and ordering within message groups. Long polling reduces empty API responses and costs, and the SNS fan-out to SQS pattern combines pub/sub flexibility with queuing durability. You need SQS for work distribution where each message goes to one consumer; you need SNS when multiple independent systems must process the same events.

SQS and SNS are managed messaging services that eliminate the operational burden of running your own broker. SQS gives you point-to-point queues; SNS gives you pub/sub topics. AWS handles the infrastructure — automatic scaling, high availability, pay-per-use pricing. No capacity planning, no clusters to maintain. This makes them useful for production workloads where you need durability without dedicated ops effort. The Message Queue Types post covers the underlying patterns.

Introduction

AWS offers two managed messaging services that handle most asynchronous communication needs in cloud applications: SQS for point-to-point queues and SNS for pub/sub notifications. Both are fully managed — no servers to provision, no clusters to maintain — and scale automatically from a single message per day to millions per second.

Core Concepts

AWS SQS: Point-to-Point Queues

SQS gives you managed message queues without running your own broker. You create a queue, send messages, and consume them at your own pace.

The queue sits between your producer and consumer, decoupling them. The producer sends a message and moves on. The consumer picks it up when it is ready. If the consumer goes down, messages pile up in the queue safely. When it comes back, they get processed.

SQS uses a pull-based delivery model: consumers poll for messages instead of receiving them pushed. This lets each consumer control its own pace. Common uses include:

Background job queues (image processing, report generation)
Work distribution across multiple workers
Buffer layers between services with different throughput rates
Decoupling microservices in event-driven architectures

The durability model is straightforward: SQS replicates messages across multiple Availability Zones, so a single AZ failure does not lose data. Standard queues deliver at least once. FIFO queues guarantee exactly-once processing within ordered groups.

SQS: Simple Queue Service

Queue Types

SQS has two queue types: Standard and FIFO. Pick wrong and you lose ordering or choke your throughput.

Standard queues handle unlimited throughput with at-least-once delivery. Duplicates happen because SQS replicates across AZs. A consumer picks up a message, processes it, deletes it. But if the ACK does not make it before a failover, the message reappears. Your consumer needs idempotent processing — a dedup key in the payload usually does it.

Standard queues fit:

Image resizing or video transcoding where a duplicate wastes compute but does not corrupt state
Work distribution: multiple consumers pull from one queue, each picking up independent messages
Load smoothing: buffer requests during traffic spikes, drain at a controlled rate

FIFO queues guarantee exactly-once processing and preserve message order within a group. They enforce a 5-minute deduplication window — messages with the same dedup ID inside that window are silently skipped. The tradeoff: a single message group processes sequentially, capping at 300 messages per second (3000 with batching).

FIFO queues fit:

Order processing: payment, inventory, and shipment for the same order must sequence
Audit log ingestion: events for the same entity must hit the database in order
Chat or notification: messages to the same user must appear in sequence

Feature	Standard	FIFO
Ordering	Best-effort	Strict per message group
Delivery	At-least-once	Exactly-once
Throughput	Unlimited	300/s (3000/s with batching)
Message groups	Not supported	Groups for ordered processing
Duplicates	Possible	Eliminated (5-min window)

The trick to scaling FIFO is message groups. Each group ID acts as an independent ordering stream. Messages for order abc process in order; messages for order def process in parallel. Group by an entity ID — order ID, user ID, session ID — and you get ordering within the entity with horizontal scaling across entities.

Message Lifecycle

A producer sends a message to the queue. A consumer polls for messages. After processing, the consumer deletes the message from the queue. SQS holds messages until deletion — failing to delete means the message reappears after the visibility timeout.

Working with SQS

import boto3

sqs = boto3.client('sqs')

# Create a queue
queue_url = sqs.create_queue(
    QueueName='tasks.fifo',
    Attributes={'FifoQueue': 'true', 'ContentBasedDeduplication': 'true'}
)['QueueUrl']

# Send a message
sqs.send_message(
    QueueUrl=queue_url,
    MessageBody=json.dumps({'task': 'process', 'data': value}),
    MessageGroupId='task-group'
)

# Receive messages
response = sqs.receive_message(
    QueueUrl=queue_url,
    MaxNumberOfMessages=10,
    WaitTimeSeconds=20
)

for msg in response['Messages']:
    process(json.loads(msg['Body']))
    sqs.delete_message(QueueUrl=queue_url, ReceiptHandle=msg['ReceiptHandle'])

Key SQS Features

Visibility timeout is how long a message stays invisible after your consumer picks it up. If your consumer crashes mid-processing, the message reappears once the timeout expires. The catch: your consumers need to handle duplicates.

Dead letter queues catch messages that fail repeatedly. Set a redrive policy and messages go to your DLQ after N failed attempts. You can inspect what went wrong without blocking the queue.

Message retention spans up to 14 days by default. Your consumers can be down for a weekend and messages survive. This buffers for downtime without losing work.

Long polling cuts down on empty responses. SQS waits up to 20 seconds for messages to arrive before replying. Fewer API calls, lower costs, less waiting.

SNS is a managed pub/sub service. You create a topic, subscribe endpoints to it, then publish messages. SNS fans each message out to every subscriber on the topic.

This is push-based delivery: SNS sends the message to subscribers as soon as it arrives. Subscribers do not need to poll. Supported subscriber types include:

SQS queues for durable, queue-based processing after fan-out
Lambda functions for serverless event processing
HTTP/HTTPS endpoints for webhook-style delivery
Email (JSON or plain text) for notifications
SMS for text message alerts
Mobile push for app notifications (APNs, FCM)

The fan-out model means one event can trigger multiple downstream actions. A single order event published to an SNS topic could simultaneously update analytics, send a notification to the user, and trigger an audit log entry. Each subscriber handles the message independently, so one slow subscriber does not affect the others.

How Topics Work

You create an SNS topic. Subscribe endpoints to it (email, SMS, HTTP, Lambda, SQS, mobile push). Publish a message, SNS fans it out to all subscribers. That’s the whole model.

Message Filtering

Subscribers can use filter policies to receive only messages they care about:

sns.subscribe(
    TopicArn=topic_arn,
    Protocol='sqs',
    Endpoint='sqs-arn',
    Attributes={'FilterPolicy': json.dumps({
        'event': ['order.placed', 'order.cancelled'],
        'region': ['us-west', 'us-east']
    })}
)

Messages not matching the filter policy are not delivered to that subscriber.

SNS supports message batching to lower costs and handle more throughput. The PublishBatch API lets you send up to 10 messages at once.

# Send batch of messages (up to 10 per batch)
entries = [
    {'Id': '1', 'Message': json.dumps({'event': 'order.placed', 'order_id': '1001'})},
    {'Id': '2', 'Message': json.dumps({'event': 'order.placed', 'order_id': '1002'})},
    {'Id': '3', 'Message': json.dumps({'event': 'order.placed', 'order_id': '1003'})},
]
sns.publish_batch(TopicArn=topic_arn, PublishBatchRequestEntries=entries)

Batching reduces costs at scale: 100 messages means 10 API calls instead of 100.

SNS supports FIFO (First-In-First-Out) topics that provide strict ordering and exactly-once delivery. FIFO topics are designed for scenarios where message order matters, such as financial transactions or inventory updates.

# Create FIFO topic
fifo_topic_arn = sns.create_topic(
    Name='order-events.fifo',
    Attributes={'FifoTopic': 'true', 'ContentBasedDeduplication': 'true'}
)['TopicArn']

# Publish with message group ID for ordering
sns.publish(
    TopicArn=fifo_topic_arn,
    Message=json.dumps({'event': 'order.placed', 'order_id': '12345'}),
    MessageGroupId='order-processing'  # Ensures ordering within group
)

Feature	SNS Standard	SNS FIFO
Ordering	No guarantee	Per message group
Deduplication	None	5-minute window
Throughput	Unlimited	300 messages/sec per topic
Message group	N/A	Groups messages for ordering
Cost	Per message + delivery	Higher (per message)

SNS FIFO is a good fit when you need messages for the same entity (same order, same user) processed in order.

Capacity and Scaling

SQS and SNS both scale without provisioning: no clusters to size, no partitions to configure. But “automatic” does not mean unbounded. Each service has architectural limits that matter when you push serious volume.

SQS partitions internally. Every queue is spread across servers and Availability Zones. Standard queues have no throughput ceiling because of this distributed design: more partitions handle more traffic. The practical limit to watch is in-flight messages: 120,000 for standard queues and 20,000 for FIFO. Each consumed-but-not-deleted message occupies an in-flight slot. Exceed the limit and ReceiveMessage returns fewer messages until the count drops. This usually means consumers need to keep up or scale out.

FIFO throughput depends on message groups. To guarantee ordering, FIFO queues process one message at a time per group. One group means sequential processing regardless of how many consumers you attach. Throughput caps at 300 messages/second without batching and 3,000 messages/second with batching. Adding more message groups is how you scale FIFO. Each group becomes an independent ordering stream that consumers can process in parallel.

SNS separates ingest from delivery. Publishing to a topic is near-instant and scales to any rate. Delivery is the bottleneck: each subscriber protocol has its own concurrency model. Lambda subscribers hit the account-level concurrency limit (1,000 by default). HTTP endpoints have a 30,000 delivery/second per-endpoint cap. SQS subscribers inherit SQS throughput. A slow or failing subscriber does not block deliveries to other subscribers on the same topic. Each delivery path is independent.

Monitor the right signals. ApproximateNumberOfMessagesNotVisible shows in-flight SQS messages. NumberOfNotificationsFailed shows SNS delivery failures. ApproximateAgeOfOldestMessage is the early warning that consumers are falling behind, whether on SQS or after an SNS fan-out.

The sections below cover volume estimation, backpressure handling, and scaling patterns in more detail.

Message Volume Estimation

For SQS, figure out your peak messages per second and pick Standard (unlimited throughput) or FIFO (3000 messages/sec with batching). FIFO message groups let you parallelize while keeping order within each group.

For SNS, throughput is effectively unlimited, but delivery costs scale with subscriber count. If you have 100K subscribers and publish 1M messages/day, delivery costs dominate your bill.

Backpressure Handling

SQS consumers control their own polling rate. Long polling with WaitTimeSeconds=20 naturally throttles when the queue is empty. Set MaxNumberOfMessages based on your processing capacity.

For burst traffic, SQS buffers automatically. But if your consumers fall behind, messages pile up and ApproximateAgeOfOldestMessage climbs. Watch this metric and scale consumers out.

# Adaptive polling based on queue depth
response = sqs.receive_message(
    QueueUrl=queue_url,
    MaxNumberOfMessages=10 if queue_depth > 100 else 1,
    WaitTimeSeconds=5 if queue_depth > 100 else 20
)

SNS limits deliveries per subscriber to 30K/second by default. For Lambda, that maps to concurrent invocations. If you need higher rates, file a service quota increase.

Cross-account SNS subscriptions add data transfer charges. Keep publishers and subscribers in the same region when you can.

Scaling Patterns

Horizontal consumer scaling: Each SQS queue supports multiple consumers across different machines. SQS visibility timeout lets failed processing recover without duplicates.

SNS fan-out scaling: Add more SQS queues rather than more subscribers to one queue. This avoids head-of-line blocking where one slow consumer throttles the whole queue.

FIFO scaling: FIFO queues with message group IDs let you order messages within groups while processing groups in parallel. Keep message groups around independent entities (order IDs, user IDs).

A common pattern is SNS fan-out to multiple SQS queues. One event, multiple consumers, each with its own queue.

graph LR
    Publisher -->|publish| SNS[SNS Topic]
    SNS -->|deliver| Q1[SQS Queue: Analytics]
    SNS -->|deliver| Q2[SQS Queue: Notifications]
    SNS -->|deliver| Q3[SQS Queue: Audit]

This combines SNS’s pub/sub with SQS’s queuing. Each consumer gets its own queue, so retry logic and parallel processing work independently.

# SNS publishes to multiple SQS queues (configured via topic subscription)
sns.publish(TopicArn=topic_arn, Message=json.dumps(event))

# Each consumer group has its own queue
# Analytics queue consumer
for msg in sqs.receive_message(QueueUrl=analytics_queue_url):
    run_analytics(msg)

# Notifications queue consumer
for msg in sqs.receive_message(QueueUrl=notifications_queue_url):
    send_notification(msg)

This gives you SNS topic-based routing, SQS per-consumer queuing and retry, and independent scaling per consumer group.

Fan-Out to SQS

The fan-out pattern uses SNS topics to publish a message once, then routes copies to multiple SQS queues. Each consumer processes from its own queue independently.

Setting up SNS-to-SQS fan-out needs an IAM policy on each destination queue that allows the SNS topic to send messages. Without this, SNS cannot deliver and the subscription stays in pending confirmation. Scope the policy with aws:SourceArn to lock it to the specific topic:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "sns.amazonaws.com" },
      "Action": "sqs:SendMessage",
      "Resource": "arn:aws:sqs:us-east-1:123456789:analytics-queue",
      "Condition": {
        "ArnEquals": {
          "aws:SourceArn": "arn:aws:sns:us-east-1:123456789:order-events"
        }
      }
    }
  ]
}

Filter policies make fan-out efficient. Without filters every queue receives every message from the topic. With filter policies on each subscription you route only relevant events to each queue. The analytics queue gets order.placed and order.cancelled. The notifications queue gets order.shipped and order.delivered. Each queue stays focused and SNS does not charge for filtered-out messages.

Ordered fan-out combines FIFO topics with FIFO queues. A FIFO topic fans out to multiple FIFO queues preserving message order based on message group IDs across all subscribers. This guarantees downstream systems see events in the same sequence per entity — useful when consumers need to reconcile state across services.

Cross-account fan-out needs a resource policy on the SNS topic granting the subscriber account subscription access plus the IAM policy on the subscriber’s queue. Data transfer charges apply between accounts; keep publishers and subscribers in the same region to minimize costs.

Monitor per-queue NumberOfMessagesReceived across the fan-out to catch imbalances. A sudden drop in one queue while others stay stable usually means a filter policy was changed or a subscription was deleted.

Cost Optimization

SQS and SNS costs scale with API calls and message delivery. Optimizing both brings down your bill.

SQS Cost Factors

SQS charges per API request:

Request Type	Standard Queue	FIFO Queue
Send, Receive, Delete	$0.40 per million	$0.40 per million
Other operations	$0.40 per million	$0.40 per million

Reducing SQS Costs

Long polling is the main way to reduce SQS costs. Short polling (the default) bills per request regardless of whether a message arrives. Long polling waits up to 20 seconds, batching multiple empty responses into one billable request.

# Enable long polling to reduce costs
sqs.receive_message(
    QueueUrl=queue_url,
    MaxNumberOfMessages=10,
    WaitTimeSeconds=20,  # Long polling - wait up to 20s
    ReceiveRequestAttemptM=3  # For FIFO, helps with ordering
)

For high-throughput queues, batching with ReceiveMessage (up to 10 messages per call) cuts down billable requests.

SNS charges per message published plus per message delivered:

Operation	Cost per Million
Publish	$0.50
Subscribe/Confirm	$0.40
Delivery to SQS/HTTP/Lambda	$0.50
Delivery to Mobile Push	$1.50 - $6.00

Use message batching with PublishBatch to cut publish costs. For delivery, use filter policies so you do not send messages to subscribers who will just discard them.

# Batch publish to reduce costs
entries = [
    {'Id': str(i), 'Message': json.dumps({'event': f'event-{i}'})}
    for i in range(10)
]
sns.publish_batch(TopicArn=topic_arn, PublishBatchRequestEntries=entries)
# 10 messages for the price of 1 publish call + 10 deliveries

Cross-region SNS subscriptions add data transfer costs. Keep subscribers in the same region when you can.

Aspect	SQS	SNS
Pattern	Point-to-point queue	Pub/sub
Delivery	Pull (consumers poll)	Push (subscribers receive)
Multiple consumers	Single consumer per message	All subscribers receive
Ordering	FIFO option available	No ordering guarantee
Throughput	Unlimited (standard), 3000/s (FIFO)	Unlimited
Cost	Per API call	Per message published + per delivery

Pick SQS when you need work distribution across consumers, each processing a message once. It handles burst traffic well, with visibility timeout and redrive built in.

Pick SNS when multiple consumers need the same message, when you want push-based delivery, or when broadcasting events to many subscribers. Simpler than running your own pub/sub.

When Not to Use SQS and SNS

When Not to Use SQS

When you need push-based delivery: SQS is pull-based; consumers must poll
When you need message ordering across queues: Standard queues do not guarantee ordering
When you need exactly-once delivery without deduplication logic: Standard queues deliver at-least-once
When you need multiple consumers on same stream: Each message goes to one queue only

When Not to Use SNS

When you need message persistence beyond 14 days: SNS does not persist messages (subscribers must be available)
When you need strict ordering: SNS does not guarantee ordering across subscribers
When you need exactly-once without client deduplication: SNS delivers at-least-once
When you have many small subscribers: Each subscription incurs delivery costs

SNS and EventBridge share the same underlying notification system, which is why the two services look similar at first glance. The architectural difference is what matters: SNS is a pub/sub topic with filter policies. EventBridge is an event bus with rule-based routing, a built-in schema registry, and a curated catalog of SaaS integrations. When you publish to an EventBridge topic, the service evaluates your message against routing rules before delivering. When you publish to SNS, the service fans the message out to all subscribers that match the filter policy. The extra layer EventBridge adds is useful when you need to validate message structure, route events across account boundaries, or ingest from third-party SaaS tools without writing custom integration code.

Aspect	SNS	EventBridge
Architecture	Pub/sub topic	Event bus with routing rules
Schema registry	None	Built-in schema registry
SaaS integrations	None	200+ SaaS sources
Event routing	Topic-based	Rule-based with filtering
Archive and replay	No	Yes (up to 24 hours)
Cost	Per message + delivery	Per event + processing
Dead letter handling	DLQ per subscription	Via API destinations

EventBridge shines when you need SaaS integrations (Salesforce, Zendesk, third-party webhooks) or schema validation. SNS is simpler and cheaper for pure point-to-point fan-out within AWS.

# EventBridge rule-based routing example
import boto3

events = boto3.client('events')

# Create rule with multiple targets based on detail type
events.put_rule(
    Name='order-events',
    EventPattern='{"source": ["aws.ec2"], "detail-type": ["EC2 Instance State Change"]}',
    State='ENABLED'
)

# Add targets
events.put_targets(
    Rule='order-events',
    Targets=[
        {'Id': '1', 'Arn': 'lambda-arn', 'RoleArn': 'execution-role-arn'},
        {'Id': '2', 'Arn': 'sqs-arn'}
    ]
)

Within your application stack, SNS handles most messaging well. EventBridge costs more but adds value when you need event routing, schema management, or SaaS ingestion.

Comparison to Self-Managed Solutions

Managed services like SQS and SNS remove operational burden. You do not provision servers, manage replication, or tune performance. AWS handles availability and durability. The tradeoff is vendor lock-in — your code depends on AWS APIs, and migrating to another platform means rewriting the messaging layer.

Kafka targets high-throughput streaming. It persists messages to disk in ordered segments, keeping them for as long as you set. SQS and SNS are ephemeral — messages disappear once consumed, or after 14 days for SNS when subscribers are unavailable. Kafka separates ingest from processing in a way the AWS services do not: you control partition counts, replication factors, and consumer group offsets. The ops burden is larger. You manage brokers, ZooKeeper or KRaft for controller consensus, and partition rebalancing under load.

RabbitMQ speaks AMQP, STOMP, and MQTT — protocols SQS and SNS do not expose. If your stack already runs RabbitMQ internally, moving to SNS means rewriting the protocol layer. RabbitMQ also supports routing topologies SNS cannot match: exchanges with bindings, alternate exchanges, dead letter exchanges. These give you routing flexibility that goes well beyond SNS filter policies. The clustering model differs too. RabbitMQ builds on the Erlang VM’s distributed model. SQS and SNS are fully managed with no cluster concept.

The real migration question is not just porting API calls. It is the assumptions your code makes. SQS visibility timeout has no direct Kafka equivalent. SNS fan-out to SQS queues maps to Kafka consumer group offsets, but the retry semantics differ. FIFO message groups map to partitions keyed by entity ID, but Kafka partition counts are fixed at topic creation — repartitioning is expensive. Evaluate these gaps before committing to a platform.

For understanding messaging patterns that apply regardless of platform, see message queue types and pub/sub patterns.

AWS PrivateLink/VPC Endpoint Configuration

PrivateLink keeps SQS and SNS traffic inside the AWS network, avoiding the public internet and giving you private connectivity from within a VPC.

SQS VPC Endpoints

# Create VPC endpoint for SQS
aws ec2 create-vpc-endpoint \
    --vpc-id vpc-012345678 \
    --service-name com.amazonaws.us-east-1.sqs \
    --vpc-endpoint-type Interface \
    --subnet-ids subnet-012345678 subnet-876543210 \
    --security-group-ids sg-012345678

VPC endpoints use ENIs in your subnets. Set up the security group to allow traffic on port 443 from your application servers.

# Create VPC endpoint for SNS
aws ec2 create-vpc-endpoint \
    --vpc-id vpc-012345678 \
    --service-name com.amazonaws.us-east-1.sns \
    --vpc-endpoint-type Interface \
    --subnet-ids subnet-012345678 subnet-876543210 \
    --security-group-ids sg-012345678

IAM Policies for VPC Endpoints

VPC endpoints need IAM policies that allow access from the VPC endpoint, not just the public internet:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": "*",
      "Action": ["sqs:ReceiveMessage", "sqs:DeleteMessage"],
      "Resource": "arn:aws:sqs:us-east-1:123456789:my-queue",
      "Condition": {
        "StringEquals": {
          "aws:sourceVpce": "vpce-012345678"
        }
      }
    }
  ]
}

This policy limits access to messages in the queue to traffic coming through your VPC endpoint.

Trade-off Analysis

Scenario	SQS Standard	SQS FIFO	SNS Standard	SNS FIFO
Message ordering	Best-effort	Per message group	None	Per message group
Deduplication	At-least-once	Exactly-once (5-min window)	At-least-once	Exactly-once (5-min window)
Throughput	Unlimited	3000/sec with batching	Unlimited	300/sec
Delivery model	Pull (consumer polls)	Pull (consumer polls)	Push (fan-out)	Push (fan-out)
Multiple consumers per message	Single	Single	All subscribers	All subscribers
Cost efficiency at scale	Low (API calls)	Higher (per message)	Higher (per delivery)	Highest (per message + delivery)
Complexity	Low	Medium	Low	Medium
Best for	Background jobs, task queues	Order-critical processing	Event broadcasting, notifications	Order-critical fan-out

Choosing Between SQS and SNS

Prefer SQS when:

Work items must be processed exactly once and in order within an entity
Consumer needs exclusive access to messages
You need visibility timeout and automatic retry per message
Buffering for burst traffic is important

Prefer SNS when:

Multiple independent systems need the same event
Push-based notification delivery is required
Broadcasting to many subscribers at once
Decoupling producer from consumer processing time

Prefer SNS + SQS (fan-out) when:

You want broadcast semantics with queue-based processing
Different consumer groups need independent retry and throttle handling
You want SNS simplicity with SQS durability guarantees

Production Failure Scenarios

Failure	Impact	Mitigation
SQS broker failure	Queue temporarily unavailable; messages not sent or received	SQS manages replication; multi-AZ deployment is automatic
SNS broker failure	Messages not delivered to subscribers	Use SNS topic ARN retries; implement dead letter handling
Consumer crash mid-processing	Message becomes visible again after visibility timeout	Use visibility timeout appropriately; implement idempotent processing
SNS subscription deleted	Messages silently dropped for that subscriber	Use CloudWatch to monitor subscription status
SQS queue deletion	All messages permanently lost	Use SQS lifecycle policies; backup critical messages
Throughput limit exceeded	Messages rejected or throttled	Request service quota increase; use exponential backoff
FIFO ordering violation	Messages processed out of order	Use message group IDs correctly; single consumer per group
SNS filter policy misconfiguration	Subscribers receive no messages or wrong messages	Test filter policies; monitor filtered-out message counts
Lambda throttling	SNS retries with exponential backoff	Request more concurrent executions
Lambda timeout	Treated as invocation failure	Set appropriate timeout + DLQ
Lambda crash	Message not processed	DLQ for failed messages
Invalid payload to Lambda	Lambda throws on unmarshal	SNS rejects before invocation
Lambda permission denied	SNS retries with exponential backoff up to max retries, then DLQ	Verify IAM role has lambda:InvokeFunction permission

When SNS delivers to Lambda, configure a delivery policy and DLQ to handle invocation failures:

# SNS-to-Lambda DLQ configuration
sns.subscribe(
    TopicArn=topic_arn,
    Protocol='lambda',
    Endpoint=lambda_arn,
    Attributes={
        'DeliveryPolicy': json.dumps({
            'healthyRetryPolicy': {
                'minDelayTarget': 60,
                'maxDelayTarget': 600,
                'numRetries': 3,
                'numNoDelayRetries': 0,
                'backoffFunction': 'exponential'
            }
        })
    }
)

# In Lambda, send failures to DLQ
def handler(event, context):
    try:
        process_event(event)
    except Exception as e:
        # Send to DLQ via SNS
        sns.publish(
            TopicArn=dlq_arn,
            Message=json.dumps({'original': event, 'error': str(e)})
        )
        raise  # Re-raise so SNS marks as failed

Configure Lambda async invocation settings to align with SNS retry behavior:

# Configure Lambda async settings via boto3
lambda_client.put_function_event_invoke_config(
    FunctionName='my-function',
    MaximumRetryAttempts=2,
    MaximumEventAgeInSeconds=3600,
    DestinationConfig={
        'OnFailure': {
            'Destination': 'arn:aws:sqs:us-east-1:123456789:my-dlq'
        }
    }
)

SNS-to-Lambda chains the SNS retry on top of Lambda’s own retry. Set both to avoid duplicate processing or message loss.

Common Pitfalls / Anti-Patterns

Pitfall 1: Not Setting Visibility Timeout Correctly

If visibility timeout is too short, messages are reprocessed before the consumer finishes. If too long, poison messages block the queue. Set it based on expected processing time plus a buffer.

The right value depends on what your consumer does. Estimate your typical message processing time including worst-case scenarios. Add a safety margin of at least 50-100%. For a task that takes 30 seconds normally, set the visibility timeout to 60-90 seconds.

Scenario	Recommended Timeout	Reasoning
Quick processing (API call)	30 seconds	Short tasks need minimal buffer
Medium task (file processing)	5-10 minutes	Network delays and retries add time
Heavy task (video transcoding)	15-30 minutes	Long processing needs large headroom
Lambda-triggered SQS	6x Lambda timeout	Lambda timeout max is 15 min; visibility timeout should cover Lambda retries

For SQS queues that trigger Lambda functions, set the visibility timeout higher than the Lambda function timeout plus the function’s retry budget. Lambda polls the queue on your behalf. If the function takes 5 minutes and the visibility timeout is 4 minutes, the message becomes visible again before processing finishes, and another Lambda invocation picks it up, causing duplicate processing.

Monitor the ApproximateNumberOfMessagesNotVisible metric. A high count combined with rising message age is a sign your visibility timeout is too short for actual processing time.

Queue Configuration Pitfalls

Using Standard Queues When FIFO Is Needed

Standard queues offer best-effort ordering. If your business requires ordering, use FIFO queues with message group IDs.

Not Polling Efficiently

Short polling (default) wastes API calls. Use long polling (WaitTimeSeconds > 0) to reduce costs and latency.

Forgetting to Delete Messages After Processing

SQS does not auto-delete. Always call DeleteMessage after successful processing or messages will be reprocessed.

Mixing Message Types in One Queue

Different consumers processing different message types in one queue leads to coupling and processing errors. Use separate queues per message type.

If a Lambda subscriber throws or an HTTP endpoint goes down, SNS retries based on its delivery policy. Without a DLQ, once those retries run out, the message is gone. No trace of what failed, no way to get it back.

The default SNS delivery policy gives you up to 3 retries with exponential backoff starting at 1 second. And no DLQ. That means a 10-minute endpoint outage exhausts the retry budget in about 30 seconds and silently drops the message. You will not know unless you monitor NumberOfNotificationsFailed.

Always attach a DLQ to every SNS subscription. The DLQ catches messages after retries are exhausted. You inspect the payload, fix the issue, and replay. Without it, you are working blind.

# Attach delivery policy and DLQ on subscribe
sns.subscribe(
    TopicArn=topic_arn,
    Protocol='lambda',
    Endpoint=lambda_arn,
    Attributes={
        'DeliveryPolicy': json.dumps({
            'healthyRetryPolicy': {
                'minDelayTarget': 60,
                'maxDelayTarget': 600,
                'numRetries': 5,
                'numNoDelayRetries': 0,
                'backoffFunction': 'exponential'
            }
        }),
        'RedrivePolicy': json.dumps({
            'deadLetterTargetArn': dlq_queue_arn
        })
    }
)

The retry chain between SNS and Lambda trips people up. SNS retries first. Then Lambda’s own async invocation retries kick in, depending on whether SNS treated the result as delivered. Configure both sides so they do not step on each other. The SNS to Lambda: DLQ Configuration section above has the full picture.

Set a CloudWatch alarm on NumberOfNotificationsFailed. Non-zero means deliveries are failing. Also watch DLQ depth via ApproximateNumberOfMessagesVisible — messages piling up there mean something downstream is persistently broken.

Interview Questions

1. What is the difference between SQS Standard and FIFO queues in terms of delivery guarantees and ordering?

Expected answer points:

Standard queues provide at-least-once delivery (messages may be delivered more than once); FIFO provides exactly-once processing
Standard queues offer best-effort ordering; FIFO preserves message order within message groups
FIFO throughput is lower (300 messages/sec without batching, 3000 with); Standard has unlimited throughput
FIFO uses message group IDs to enable parallel ordering within groups

2. How does SQS visibility timeout work and why is it important for message processing reliability?

Expected answer points:

After a consumer receives a message, it becomes invisible to other consumers for the visibility timeout duration
If the consumer crashes before deleting the message, it reappears after the visibility timeout expires
Setting too short: message reprocessed before consumer finishes; setting too long: poison messages block the queue
Best practice: set timeout to longer than expected processing time plus buffer

3. What is a dead letter queue (DLQ) and when would you use one with SQS or SNS?

Expected answer points:

A DLQ captures messages that fail processing after a configured number of attempts
For SQS: configure a redrive policy to send messages to DLQ after maxReceiveCount failures
For SNS: configure delivery policy with DLQ for failed Lambda invocations
DLQs enable failure analysis without blocking the main queue or losing messages

4. Explain the SNS fan-out to SQS pattern. What problem does it solve?

Expected answer points:

One SNS topic fans out to multiple SQS queues, one per consumer group
Solves the pub/sub to point-to-point conversion: SNS broadcasts, SQS queues deliver
Each consumer group processes independently with its own retry and visibility timeout
Example: analytics, notifications, and audit each get their own queue processing the same events

5. What is long polling and how does it reduce SQS costs?

Expected answer points:

Short polling (default): SQS responds immediately even if queue is empty, billing per request
Long polling: SQS waits up to 20 seconds for messages to arrive before responding
Batches multiple empty responses into one billable request
Set WaitTimeSeconds > 0 to enable long polling

6. How does SNS message filtering work and what are its benefits?

Expected answer points:

Subscribers define filter policies in JSON; SNS only delivers messages matching the policy
Reduces unnecessary processing: subscribers receive only relevant messages
Messages not matching filter are not delivered (not charged)
Example: an SQS queue only interested in order.placed events can filter out order.cancelled

7. What are the key differences between SNS FIFO topics and SQS FIFO queues?

Expected answer points:

SNS FIFO is a pub/sub topic; SQS FIFO is a point-to-point queue
SNS FIFO delivers to all matching subscribers; SQS FIFO delivers to one consumer
SNS FIFO ordering is per message group; SQS FIFO ordering is per queue
Both provide exactly-once delivery and 5-minute deduplication window

8. When would you choose EventBridge over SNS for event routing?

Expected answer points:

EventBridge has built-in schema registry for event validation and discovery
EventBridge supports 200+ SaaS integrations (Salesforce, Zendesk, etc.)
EventBridge has rule-based routing with filtering (not just topic-based)
EventBridge supports archive and replay (up to 24 hours); SNS does not
EventBridge is more expensive: per event + processing vs SNS per message + delivery

9. What happens when SNS delivers to Lambda and the Lambda function throws an error?

Expected answer points:

SNS retries with exponential backoff based on delivery policy (minDelayTarget, maxDelayTarget, numRetries)
After retries exhausted, message goes to DLQ if configured
Lambda async invocation has separate retry behavior (MaximumRetryAttempts)
Chain: SNS retry → Lambda retry; configure both to avoid duplicates or message loss

10. What are the main cost optimization strategies for SQS and SNS at scale?

Expected answer points:

SQS: Enable long polling to batch empty responses; batch receives with MaxNumberOfMessages=10
SNS: Use PublishBatch API (up to 10 messages per call); use filter policies to avoid unnecessary deliveries
Both: Keep publishers and subscribers in the same region to avoid cross-region data transfer costs
SNS FIFO costs more than standard (per message + delivery)

11. How does SQS ContentBasedDeduplication work and when should you use it instead of manual message deduplication?

Expected answer points:

ContentBasedDeduplication uses SHA-256 hash of the message body as the deduplication ID automatically
Eliminates the need to generate and pass MessageDeduplicationId on every publish
Use when message body is unique per intended message (same content means duplicate)
Use manual MessageDeduplicationId when body may repeat but intent is unique (e.g., same action for different entities)

12. What is the maximum message size in SQS and how do you handle payloads larger than that limit?

Expected answer points:

SQS maximum message size is 256KB (262,144 bytes)
For larger payloads: store payload in S3 or DynamoDB, send reference URL/ID in SQS message
Extended client library (Java) handles this pattern automatically with S3
Tradeoff: adds latency (extra S3/DynamoDB call) but keeps SQS for coordination

13. Explain how SNS message filtering can lead to silent message loss if not configured carefully.

Expected answer points:

If a subscriber has a filter policy but receives a message that does not match, the message is silently discarded
No notification to publisher or subscriber that message was filtered out
Monitor NumberOfMessagesFilteredOut metric in CloudWatch to detect silent drops
Always test filter policies before deployment; consider default subscription without filters

14. What are the differences between SNS HTTP/HTTPS subscription confirmation and Lambda subscription confirmation?

Expected answer points:

HTTP/HTTPS: SNS sends a POST with SubscribeURL that must be visited to confirm; times out in 3 days
Lambda: Confirmation is automatic via the SNS service invoking Lambda with a subscription confirmation event
Lambda function receives and processes the confirmation message (must call ConfirmSubscription API)
Both require valid endpoints that can receive and process the confirmation request

15. How does SQS visibility timeout interact with Lambda function timeout settings?

Expected answer points:

Lambda polls SQS using receive_message; each message becomes invisible for the visibility timeout
If Lambda runs longer than visibility timeout, SQS makes the message visible again and another Lambda can pick it up
Set Lambda timeout slightly shorter than visibility timeout to avoid duplicate processing
If Lambda crashes before deleting, message reappears after visibility timeout; idempotent processing is essential

16. What is the purpose of MessageGroupId in SQS FIFO and how does it enable parallel processing?

Expected answer points:

Messages with the same GroupId are processed in order; messages with different GroupIds can be processed in parallel
Enables ordering guarantee within an entity (same order, same user) while scaling horizontally
Each consumer instance processes messages from one or more groups independently
Design groups around independent entities: one group per order ID, user ID, or entity requiring ordering

17. Describe the steps to configure cross-account SNS subscriptions securely.

Expected answer points:

Publisher account creates SNS topic with resource-based policy allowing subscriber account
Policy must grant sns:Publish to subscriber account or specific IAM roles in subscriber account
Subscriber account creates SQS queue and SNS subscription to the cross-account topic
Data transfer charges apply for cross-region or cross-account data transfer

18. What happens to messages in an SQS FIFO queue if you have multiple message groups and one group has a stuck consumer?

Expected answer points:

Messages in other message groups continue processing normally; they are not blocked
Messages in the stuck group accumulate until the consumer recovers or the message TTL expires
FIFO within a group is preserved; other groups are independent
Monitor ApproximateNumberOfMessagesDelayed metric per queue to detect stuck messages

19. How does SNS delivery policy work and what are the key configuration parameters?

Expected answer points:

Delivery policy controls retry behavior when delivery fails (non-2xx response or timeout)
minDelayTarget: initial retry delay (default 0 seconds)
maxDelayTarget: maximum delay between retries
numRetries: maximum retry attempts before moving to DLQ
backoffFunction: linear, exponential, or arithmetic backoff between retries

20. What are the security implications of enabling SQS server-side encryption (SSE) with CMK vs AWS managed key?

Expected answer points:

AWS managed key (SSE-S3): no additional charge, automatic key rotation, no management overhead
Customer managed CMK: allows key access policies, audit logging via CloudTrail, manual key rotation, costs per key
CMK enables stricter access controls: only authorized consumers can decrypt messages
Use CMK when regulatory or compliance requirements mandate customer control over encryption keys

Conclusion

SQS is pull-based point-to-point queuing. SNS is push-based pub/sub. Standard queues give you unlimited throughput with at-least-once delivery; FIFO gives you exactly-once with ordering. SNS fan-out to multiple SQS queues combines pub/sub flexibility with queuing durability. Visibility timeout controls when messages reappear if not acknowledged. Dead letter queues capture failures. Long polling reduces empty responses and costs. Server-side encryption with KMS protects messages at rest.

Pre-Deployment Checklist

Pre-Deployment and Operations

Metrics to Monitor

SQS queue depth: ApproximateNumberOfMessagesVisible
SQS old message age: ApproximateAgeOfOldestMessage (critical for ordering)
SNS delivery rate: NumberOfMessagesPublished, NumberOfNotificationsDelivered
SNS delivery success/failure: NumberOfNotificationsFailed
SNS filter policy matches: NumberOfMessagesFilteredOut
SQS receive latency: ReceiveMessageWaitTimeSeconds (long polling effectiveness)
FIFO group ordering lag: Monitors per message group

Logs to Capture

SQS sendMessage, receiveMessage, and deleteMessage events
SNS publish and delivery events
SNS subscription creation and deletion
SQS visibility timeout expirations
Dead letter queue arrivals (via DLQ subscription)
CloudTrail API calls for administrative actions

Alerts to Configure

SQS queue depth exceeds threshold
Oldest message age exceeds SLA threshold
SNS delivery failure rate exceeds threshold
SNS filtered-out message rate is abnormal
SQS long polling not effective (empty responses)
FIFO message group lag for critical groups

Security Checklist

Authentication: Use IAM roles for AWS SDK clients; avoid long-term access keys
Authorization: Use IAM policies for SQS/SNS access; principle of least privilege
Encryption in transit: Enable TLS; use VPC endpoints for private access
Encryption at rest: Enable SQS server-side encryption (SSE) with KMS
VPC endpoints: Use AWS PrivateLink to keep traffic within AWS network
Message content: Do not send sensitive data unencrypted; use SNS message encryption or application-level encryption
Cross-account access: Use resource policies for cross-account SNS subscriptions
Audit logging: Enable CloudTrail for all SQS and SNS API operations

AWS SQS and SNS: Cloud Messaging Services

Introduction

Core Concepts

AWS SQS: Point-to-Point Queues

SQS: Simple Queue Service

Queue Types

Message Lifecycle

Working with SQS

Key SQS Features

AWS SNS: Pub/Sub Notifications

SNS Topic Operations

How Topics Work

Message Filtering

SNS Message Batching

SNS FIFO Topics

Capacity and Scaling

Message Volume Estimation

Backpressure Handling

SNS Delivery Rate Limiting

Scaling Patterns

SNS and SQS Patterns

Fan-Out to SQS

Cost Optimization

SQS Cost Factors

Reducing SQS Costs

SNS Cost Factors

Reducing SNS Costs

SQS vs SNS: Choosing and Knowing When Not To

SQS vs SNS: When to Use Which

When Not to Use SQS and SNS

When Not to Use SQS

When Not to Use SNS

SNS vs EventBridge

Comparison to Self-Managed Solutions

AWS PrivateLink/VPC Endpoint Configuration

SQS VPC Endpoints

SNS VPC Endpoints

IAM Policies for VPC Endpoints

Trade-off Analysis

Choosing Between SQS and SNS

Production Failure Scenarios

SNS to Lambda: DLQ Configuration

Common Pitfalls / Anti-Patterns

Pitfall 1: Not Setting Visibility Timeout Correctly

Queue Configuration Pitfalls

Using Standard Queues When FIFO Is Needed

Not Polling Efficiently

Forgetting to Delete Messages After Processing

Mixing Message Types in One Queue

Pitfall 6: Not Handling SNS Delivery Failures

Interview Questions

Further Reading

Conclusion

Pre-Deployment Checklist

Pre-Deployment and Operations

Metrics to Monitor

Logs to Capture

Alerts to Configure

Security Checklist

Category

Tags

Related Posts

Cloud Cost Optimization: Right-Sizing, Reserved Capacity

Object Storage: S3, Blob Storage, and Scale of Data

AWS Data Services: Kinesis, Glue, Redshift, and S3