Cloud Cost Optimization: Right-Sizing, Reserved Capacity, Spot Instances
Control cloud costs without sacrificing reliability. Learn right-sizing compute, reserved capacity planning, spot instance strategies, and cost allocation across teams.
Cloud Cost Optimization: Right-Sizing, Reserved Capacity, and Spot Instances
Cloud bills surprise people. A simple application that should cost $500/month balloons to $5,000. Without visibility and control, cloud spending spirals.
Cloud cost optimization is getting the most from your cloud spend. It involves right-sizing resources, using reserved capacity wisely, handling variable workloads efficiently, and allocating costs across teams.
This article covers practical techniques to reduce cloud spend without sacrificing reliability.
The Problem with Cloud Waste
Most cloud waste comes from overprovisioning. Engineers provision for peak load they never see. They forget development environments running at 3am. They provision for scenarios that never materialize.
AWS publishes that customers typically use 20-30% of their provisioned compute. That means 70-80% of money goes to idle capacity.
Common Sources of Waste
- Overprovisioned instances: Large instances with low CPU utilization
- Unused resources: Test environments left running
- Data transfer: Cross-region transfers that could be avoided
- Idle capacity: Production loads that do not need 24/7 full capacity
- Storage: Backups kept longer than necessary
Right-Sizing Compute
Start with what you actually use. Most instances are too large.
Analyzing Instance Utilization
import boto3
from datetime import datetime, timedelta
def analyze_instance_utilization(instance_id, days=14):
cloudwatch = boto3.client('cloudwatch')
end_time = datetime.now()
start_time = end_time - timedelta(days=days)
metrics = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=start_time,
EndTime=end_time,
Period=3600,
Statistics=['Average', 'Maximum']
)
cpu_values = [p['Average'] for p in metrics['Datapoints']]
avg_cpu = sum(cpu_values) / len(cpu_values)
max_cpu = max(cpu_values)
return {
'instance_id': instance_id,
'avg_cpu': avg_cpu,
'max_cpu': max_cpu,
'recommendation': suggest_instance_type(avg_cpu, max_cpu)
}
def suggest_instance_type(avg_cpu, max_cpu):
# If avg CPU is under 20%, suggest smaller instance
if avg_cpu < 20:
return 'Consider downsizing'
# If max CPU is under 60%, some headroom exists
if max_cpu < 60:
return 'Adequate for current load'
return 'Appropriately sized'
Instance Families by Use Case
| Use Case | Recommended Family | Why |
|---|---|---|
| General web servers | T3, M5 | Balance of cost and performance |
| CPU intensive | C5, C6i | Compute optimized |
| Memory intensive | R5, R6i | Memory optimized |
| Burstable workloads | T3 | Credits handle spikes |
Rightsizing Formula
A practical formula: size for your 95th percentile load, not average. You need headroom for spikes. But not 10x headroom.
def right_size_instance(peak_cpu, peak_memory, current_type):
# Get next smaller instance
smaller = get_smaller_instance_family(current_type)
# Check if it fits peak load
if (peak_cpu < smaller.max_cpu and
peak_memory < smaller.max_memory):
return smaller
return current_type # Current is already minimal
Reserved Capacity
Reserved instances (RIs) offer significant discounts in exchange for commitment. A 1-year reserved instance can save 30-40% compared to on-demand.
When to Use Reserved Instances
- Predictable baseline load
- Steady-state production workloads
- Applications you will run for at least a year
Reserved vs On-Demand Mix
Reserve what you know. Keep on-demand for variability.
def calculate_reservation_strategy(hourly_usage, hours_per_month=730):
baseline = hourly_usage * hours_per_month # Always-on portion
# Reserve baseline
reserved_count = baseline // HOURS_PER_INSTANCE
# Keep on-demand for peak
on_demand_peak = total_peak - baseline
return {
'reserved_instances': reserved_count,
'on_demand_for_peaks': on_demand_peak,
'savings': calculate_savings(reserved_count)
}
Savings Plans
AWS Savings Plans offer flexibility that RIs lack. With a Compute Savings Plan, you commit to a dollar amount per hour, not specific instances. You can use any EC2 instance in the family.
{
"savings_plan_type": "compute",
"commitment": "$50 per hour",
"term": "1_year",
"payment": "partial_upfront"
}
Spot Instances
Spot instances offer 60-90% discounts. The catch: AWS can reclaim them with 2 minutes warning. They work for fault-tolerant, flexible workloads.
Good Fit for Spot
- Batch processing
- CI/CD runners
- Data analysis
- Stateless application servers
- Containerized workloads
Bad Fit for Spot
- Databases with state
- Synchronous API servers
- Workloads needing guaranteed completion
- Strict latency requirements
Spot Instance Strategy
class SpotFleetManager:
def __init__(self, target_capacity, instance_types):
self.target_capacity = target_capacity
self.instance_types = instance_types
def launch_spot_fleet(self):
# Launch capacity across multiple instance types
# If one type becomes unavailable, others handle load
allocation = self.diversify_allocation()
return ec2.create_fleet(
FleetType='instant',
LaunchSpecifications=[
{
'InstanceType': itype,
'SpotPrice': self.get_spot_price(itype),
'WeightedCapacity': weight
}
for itype, weight in allocation.items()
],
TargetCapacitySpecification={
'TargetCapacity': self.target_capacity,
'DefaultTargetCapacityType': 'spot'
}
)
def diversify_allocation(self):
# Spread across instance families
# c5, c5n, c6i, c6in - different sizes and generations
return {
'c5.large': 2,
'c5.xlarge': 4,
'c6i.xlarge': 4
}
Cost Allocation
When multiple teams share infrastructure, show each team their costs. Visibility drives optimization.
Tagging Strategy
Tag all resources consistently:
# Tags to apply to every resource
- Environment: production, staging, development
- Team: payments, identity, platform
- Application: checkout, auth, api-gateway
- CostCenter: engineering, sales, marketing
- Owner: team@company.com
Cost by Team
def get_costs_by_team(start_date, end_date):
cost_explorer = boto3.client('ce')
results = cost_explorer.get_cost_and_usage(
TimePeriod={'Start': start_date, 'End': end_date},
Granularity='MONTHLY',
Metrics=['UnblendedCost'],
GroupBy=[
{'Type': 'TAG', 'Key': 'Team'},
{'Type': 'TAG', 'Key': 'Environment'}
]
)
return format_cost_report(results)
Showback vs Shadow IT
Show teams their costs without gating access. They see spend but can still launch resources. This drives organic optimization without creating bureaucratic bottlenecks.
Storage Optimization
Compute gets attention, but storage costs add up too.
S3 Tiering
Move data to appropriate storage classes automatically:
def configure_s3_lifecycle(bucket):
lifecycle = {
'Rules': [
{
'ID': 'Move-to-IA-after-30-days',
'Status': 'Enabled',
'Filter': {'Prefix': ''},
'Transitions': [
{'Days': 30, 'StorageClass': 'STANDARD_IA'},
{'Days': 90, 'StorageClass': 'GLACIER'}
]
}
]
}
s3.put_bucket_lifecycle_configuration(
Bucket=bucket,
LifecycleConfiguration=lifecycle
)
Database Storage
Monitor database storage growth. Unused indexes and old data accumulate. Regular cleanup reduces storage costs.
-- Find unused indexes
SELECT schemaname, tablename, indexname
FROM pg_stat_user_indexes
WHERE idx_scan = 0;
-- Remove old data partitions
ALTER TABLE events DROP PARTITION events_old;
Architectural Choices
Architecture drives long-term costs. Some patterns cost more than others.
Stateful vs Stateless
Stateless services scale horizontally without session affinity issues. They are cheaper to scale. Keep state in databases, not application servers.
Right Database for the Job
- RDS for transactional workloads needing ACID guarantees
- DynamoDB for high-throughput key-value access
- S3 for object storage, backups, static content
- ElastiCache for caching, not persistent storage
Using the wrong database type wastes money. A general-purpose RDS for a simple cache is expensive. DynamoDB for complex queries requiring joins is painful.
Serverless for Variable Load
For highly variable workloads, Lambda or Cloud Functions can be cheaper than always-on servers. You pay per invocation, not per hour.
def lambda_handler(event, context):
# Only pays for actual invocations
# No idle compute cost
process_event(event)
Measuring Optimization
Track cost over time. Set targets. Celebrate wins.
Cost Flow Architecture
Cloud costs flow through a predictable path from resource usage to optimization decisions:
flowchart TD
subgraph Resources[Resources]
A1[EC2 Instances]
A2[S3 Buckets]
A3[RDS Databases]
A4[Load Balancers]
end
A1 --> B[CloudWatch Metrics]
A2 --> B
A3 --> B
A4 --> B
B --> C[Cost Explorer]
C --> D[Cost Reports]
D --> E{Analyze}
E -->|Right-size| F[Resize Resources]
E -->|Reserve| G[Buy Reserved/Savings Plans]
E -->|Spot| H[Switch to Spot]
E -->|Idle| I[Terminate Resources]
F --> J[Monthly Review]
G --> J
H --> J
I --> J
J --> C
The loop: resources generate metrics, metrics feed cost data, cost data drives optimization decisions, decisions get reviewed monthly. Break the loop at any point and costs drift.
Cost Diversification Trade-offs
| Strategy | Savings | Commitment | Flexibility | Best For |
|---|---|---|---|---|
| On-demand | 0% | None | Highest | Unpredictable, early-stage |
| Reserved (1yr) | 30-40% | Upfront or quarterly | Low | Steady-state baseline |
| Reserved (3yr) | 50-60% | Full upfront | Very Low | Stable long-term workloads |
| Savings Plans | 40-60% | Hourly commitment | Medium | Compute flexibility |
| Spot Instances | 60-90% | None | Very Low | Fault-tolerant batch |
| Spot Fleet | 60-90% | None | Medium | Diversified fleet |
Diversification across strategies reduces risk. Base load covered by reserved capacity, variable load on savings plans, batch workloads on spot.
def monthly_cost_report():
costs = get_monthly_costs()
return {
'total': costs.total,
'vs_last_month': costs.total - costs.last_month,
'vs_budget': costs.total - costs.budget,
'by_service': costs.by_service,
'by_team': costs.by_team,
'recommendations': generate_recommendations(costs)
}
Common Mistakes
Buying Reserved Too Early
Before buying reserved instances, understand your baseline. Buying RIs for load you later reduce locks you into wrong-sized capacity.
Ignoring Data Transfer
Data transfer costs hide in bills. Cross-region replication, internet egress, and CDN transfers add up. Monitor data transfer separately.
Not Using Cost Explorer
AWS Cost Explorer shows where money goes. Without it, you are guessing. Enable it. Review it monthly.
Optimizing for Cost Alone
Cheapest is not always best. An application that fails costs more than an expensive reliable one. Balance cost with reliability requirements.
When to Use / When Not to Use Cost Optimization
Apply aggressive cost optimization when:
- Cloud spend is a significant percentage of operating costs
- Infrastructure needs are stable and predictable
- Engineering team has capacity for optimization projects
- Business can tolerate trade-offs for cost savings
Delay aggressive optimization when:
- Company is in rapid growth mode
- Infrastructure requirements are still evolving
- Team is small and focused on features
- Reliability is more costly to sacrifice than cloud spend
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Reserved instances bought for wrong size | Locked into expensive over-provisioned capacity | Analyze utilization data before purchase; start with 1-year partial upfront |
| Spot instances reclaimed during critical batch | Job fails; processing delayed | Use diverse instance pools; maintain fallback capacity; checkpoint frequently |
| S3 lifecycle policy misconfigured | Data deleted prematurely or never tiered | Test lifecycle policies on non-critical data first; set up versioning |
| Database downsized too aggressively | Query timeouts; application errors | Monitor query performance; right-size incrementally; maintain safety margins |
| Auto-scaling scaled down during traffic spike | Requests queued or rejected | Set appropriate scale-down thresholds; maintain minimum instance counts |
Observability Checklist
-
Metrics:
- Cost per active user or per transaction
- Instance utilization P95 (not just average)
- Cost by service, team, and environment
- Reserved vs on-demand vs spot mix
- Storage utilization and growth rate
- Data transfer volume and cost
-
Logs:
- Cost anomaly alerts (spend spike vs baseline)
- Reserved instance utilization
- Auto-scaling events with context
- Underutilized resource detections
-
Alerts:
- Daily spend exceeds daily budget threshold
- Cost increases more than 20% week-over-week without explanation
- Reserved instance utilization below 70%
- Storage growth rate exceeds 10% per week -某个 resource has been idle for more than 30 days
Security Checklist
- Cost controls prevent unauthorized resource creation (budget alerts, IAM policies)
- Spot instance interruption does not expose sensitive data in process memory
- Lifecycle policies do not delete data before compliance retention period
- Cost optimization does not disable security monitoring or logging
- Shared accounts are not used for cost allocation (proper tagging)
- Reserved instance coverage does not create pressure to keep insecure instances
Common Pitfalls / Anti-Patterns
Right-Sizing Based on Average Utilization
Average CPU at 30% looks like over-provisioning. But if P95 is 80%, the instance is correctly sized. Right-size based on peak, not average.
Maximizing Spot Usage
Spot instances are not suitable for everything. Databases, stateful services, and latency-sensitive operations should stay on on-demand or reserved.
Ignoring Data Transfer Costs
Data transfer is often the second-largest cost category after compute. Do not optimize compute while ignoring egress.
Chasing Small Wins
Reducing waste in development environments saves money. But if production is over-provisioned, fixing one service saves more than eliminating all dev waste.
Optimizing Yesterday’s Architecture
Right-size after understanding current load patterns. But also consider whether the architecture itself needs updating—serverless or managed services might cost less than optimized EC2.
Capacity Estimation and Benchmark Data
Use these numbers for initial capacity planning and cost estimation.
EC2 Spot Price Benchmarks
| Instance Type | On-Demand $/hr | Spot Price Range | Typical Savings |
|---|---|---|---|
| t3.micro | $0.0104 | $0.003 - $0.006 | 50-70% |
| t3.medium | $0.0208 | $0.006 - $0.012 | 50-70% |
| m5.large | $0.096 | $0.029 - $0.058 | 50-70% |
| m5.xlarge | $0.192 | $0.058 - $0.115 | 50-70% |
| c5.large | $0.085 | $0.026 - $0.051 | 50-70% |
| r5.large | $0.126 | $0.038 - $0.076 | 50-70% |
S3 Storage Cost Benchmarks
| Storage Class | $/GB/month | GET/POST/DELETE per 1,000 | Egress per GB |
|---|---|---|---|
| Standard | $0.023 | $0.0004 | $0.090 |
| IA | $0.0125 | $0.001 | $0.090 |
| Glacier | $0.004 | $0.05 (retrieval) | $0.090 |
| Glacier Deep Archive | $0.00099 | $0.10 (retrieval) | $0.090 |
RDS Cost Benchmarks
| Instance Class | $/month (single-AZ) | $/month (multi-AZ) |
|---|---|---|
| db.t3.micro | $14.60 | $28.00 |
| db.t3.medium | $29.20 | $58.40 |
| db.m5.large | $57.60 | $115.20 |
| db.m5.xlarge | $115.20 | $230.40 |
| db.r5.large | $91.20 | $182.40 |
Lambda Cost Benchmarks
| Invocation Pattern | Monthly Cost Estimate |
|---|---|
| 1M requests, 100ms avg | $0.20 |
| 10M requests, 100ms avg | $2.00 |
| 100M requests, 100ms avg | $20.00 |
| With provisioned concurrency (always-on) | ~$0.015/hour per 128MB |
Quick Recap
Key Bullets:
- Right-size based on P95 utilization, not average
- Reserve predictable baseline; keep on-demand for variability
- Use spot for fault-tolerant workloads; never for stateful services
- Tag all resources for cost allocation visibility
- Automate storage tiering; review monthly
Copy/Paste Checklist:
Monthly Cost Review:
[ ] Review Cost Explorer dashboard
[ ] Identify top 5 cost drivers
[ ] Check reserved instance utilization
[ ] Verify all resources have tags (Team, Environment, Application)
[ ] Review idle resources for cleanup
[ ] Check data transfer costs
[ ] Verify S3 lifecycle policies are working
[ ] Review spot instance allocation
[ ] Update cost allocation report for stakeholders
[ ] Identify one optimization to implement this month
For more on infrastructure topics, see Load Balancing, Geo-Distribution, and Database Scaling.
Category
Related Posts
AWS SQS and SNS: Cloud Messaging Services
Learn AWS SQS for point-to-point queues and SNS for pub/sub notifications, including FIFO ordering, message filtering, and common use cases.
AWS Data Services: Kinesis, Glue, Redshift, and S3
Guide to AWS data services for building data pipelines. Compare Kinesis vs Kafka, use Glue for ETL, query with Athena, and design S3 data lakes.
Data Migration: Strategies and Patterns for Moving Data
Learn proven strategies for migrating data between systems with minimal downtime. Covers bulk migration, CDC patterns, validation, and rollback.