Cloud Cost Optimization: Right-Sizing, Reserved Capacity, Spot Instances

Control cloud costs without sacrificing reliability. Learn right-sizing compute, reserved capacity planning, spot instance strategies, and cost allocation across teams.

published: reading time: 13 min read

Cloud Cost Optimization: Right-Sizing, Reserved Capacity, and Spot Instances

Cloud bills surprise people. A simple application that should cost $500/month balloons to $5,000. Without visibility and control, cloud spending spirals.

Cloud cost optimization is getting the most from your cloud spend. It involves right-sizing resources, using reserved capacity wisely, handling variable workloads efficiently, and allocating costs across teams.

This article covers practical techniques to reduce cloud spend without sacrificing reliability.

The Problem with Cloud Waste

Most cloud waste comes from overprovisioning. Engineers provision for peak load they never see. They forget development environments running at 3am. They provision for scenarios that never materialize.

AWS publishes that customers typically use 20-30% of their provisioned compute. That means 70-80% of money goes to idle capacity.

Common Sources of Waste

  • Overprovisioned instances: Large instances with low CPU utilization
  • Unused resources: Test environments left running
  • Data transfer: Cross-region transfers that could be avoided
  • Idle capacity: Production loads that do not need 24/7 full capacity
  • Storage: Backups kept longer than necessary

Right-Sizing Compute

Start with what you actually use. Most instances are too large.

Analyzing Instance Utilization

import boto3
from datetime import datetime, timedelta

def analyze_instance_utilization(instance_id, days=14):
    cloudwatch = boto3.client('cloudwatch')

    end_time = datetime.now()
    start_time = end_time - timedelta(days=days)

    metrics = cloudwatch.get_metric_statistics(
        Namespace='AWS/EC2',
        MetricName='CPUUtilization',
        Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
        StartTime=start_time,
        EndTime=end_time,
        Period=3600,
        Statistics=['Average', 'Maximum']
    )

    cpu_values = [p['Average'] for p in metrics['Datapoints']]
    avg_cpu = sum(cpu_values) / len(cpu_values)
    max_cpu = max(cpu_values)

    return {
        'instance_id': instance_id,
        'avg_cpu': avg_cpu,
        'max_cpu': max_cpu,
        'recommendation': suggest_instance_type(avg_cpu, max_cpu)
    }

def suggest_instance_type(avg_cpu, max_cpu):
    # If avg CPU is under 20%, suggest smaller instance
    if avg_cpu < 20:
        return 'Consider downsizing'
    # If max CPU is under 60%, some headroom exists
    if max_cpu < 60:
        return 'Adequate for current load'
    return 'Appropriately sized'

Instance Families by Use Case

Use CaseRecommended FamilyWhy
General web serversT3, M5Balance of cost and performance
CPU intensiveC5, C6iCompute optimized
Memory intensiveR5, R6iMemory optimized
Burstable workloadsT3Credits handle spikes

Rightsizing Formula

A practical formula: size for your 95th percentile load, not average. You need headroom for spikes. But not 10x headroom.

def right_size_instance(peak_cpu, peak_memory, current_type):
    # Get next smaller instance
    smaller = get_smaller_instance_family(current_type)

    # Check if it fits peak load
    if (peak_cpu < smaller.max_cpu and
        peak_memory < smaller.max_memory):
        return smaller

    return current_type  # Current is already minimal

Reserved Capacity

Reserved instances (RIs) offer significant discounts in exchange for commitment. A 1-year reserved instance can save 30-40% compared to on-demand.

When to Use Reserved Instances

  • Predictable baseline load
  • Steady-state production workloads
  • Applications you will run for at least a year

Reserved vs On-Demand Mix

Reserve what you know. Keep on-demand for variability.

def calculate_reservation_strategy(hourly_usage, hours_per_month=730):
    baseline = hourly_usage * hours_per_month  # Always-on portion

    # Reserve baseline
    reserved_count = baseline // HOURS_PER_INSTANCE

    # Keep on-demand for peak
    on_demand_peak = total_peak - baseline

    return {
        'reserved_instances': reserved_count,
        'on_demand_for_peaks': on_demand_peak,
        'savings': calculate_savings(reserved_count)
    }

Savings Plans

AWS Savings Plans offer flexibility that RIs lack. With a Compute Savings Plan, you commit to a dollar amount per hour, not specific instances. You can use any EC2 instance in the family.

{
  "savings_plan_type": "compute",
  "commitment": "$50 per hour",
  "term": "1_year",
  "payment": "partial_upfront"
}

Spot Instances

Spot instances offer 60-90% discounts. The catch: AWS can reclaim them with 2 minutes warning. They work for fault-tolerant, flexible workloads.

Good Fit for Spot

  • Batch processing
  • CI/CD runners
  • Data analysis
  • Stateless application servers
  • Containerized workloads

Bad Fit for Spot

  • Databases with state
  • Synchronous API servers
  • Workloads needing guaranteed completion
  • Strict latency requirements

Spot Instance Strategy

class SpotFleetManager:
    def __init__(self, target_capacity, instance_types):
        self.target_capacity = target_capacity
        self.instance_types = instance_types

    def launch_spot_fleet(self):
        # Launch capacity across multiple instance types
        # If one type becomes unavailable, others handle load
        allocation = self.diversify_allocation()

        return ec2.create_fleet(
            FleetType='instant',
            LaunchSpecifications=[
                {
                    'InstanceType': itype,
                    'SpotPrice': self.get_spot_price(itype),
                    'WeightedCapacity': weight
                }
                for itype, weight in allocation.items()
            ],
            TargetCapacitySpecification={
                'TargetCapacity': self.target_capacity,
                'DefaultTargetCapacityType': 'spot'
            }
        )

    def diversify_allocation(self):
        # Spread across instance families
        # c5, c5n, c6i, c6in - different sizes and generations
        return {
            'c5.large': 2,
            'c5.xlarge': 4,
            'c6i.xlarge': 4
        }

Cost Allocation

When multiple teams share infrastructure, show each team their costs. Visibility drives optimization.

Tagging Strategy

Tag all resources consistently:

# Tags to apply to every resource
- Environment: production, staging, development
- Team: payments, identity, platform
- Application: checkout, auth, api-gateway
- CostCenter: engineering, sales, marketing
- Owner: team@company.com

Cost by Team

def get_costs_by_team(start_date, end_date):
    cost_explorer = boto3.client('ce')

    results = cost_explorer.get_cost_and_usage(
        TimePeriod={'Start': start_date, 'End': end_date},
        Granularity='MONTHLY',
        Metrics=['UnblendedCost'],
        GroupBy=[
            {'Type': 'TAG', 'Key': 'Team'},
            {'Type': 'TAG', 'Key': 'Environment'}
        ]
    )

    return format_cost_report(results)

Showback vs Shadow IT

Show teams their costs without gating access. They see spend but can still launch resources. This drives organic optimization without creating bureaucratic bottlenecks.

Storage Optimization

Compute gets attention, but storage costs add up too.

S3 Tiering

Move data to appropriate storage classes automatically:

def configure_s3_lifecycle(bucket):
    lifecycle = {
        'Rules': [
            {
                'ID': 'Move-to-IA-after-30-days',
                'Status': 'Enabled',
                'Filter': {'Prefix': ''},
                'Transitions': [
                    {'Days': 30, 'StorageClass': 'STANDARD_IA'},
                    {'Days': 90, 'StorageClass': 'GLACIER'}
                ]
            }
        ]
    }

    s3.put_bucket_lifecycle_configuration(
        Bucket=bucket,
        LifecycleConfiguration=lifecycle
    )

Database Storage

Monitor database storage growth. Unused indexes and old data accumulate. Regular cleanup reduces storage costs.

-- Find unused indexes
SELECT schemaname, tablename, indexname
FROM pg_stat_user_indexes
WHERE idx_scan = 0;

-- Remove old data partitions
ALTER TABLE events DROP PARTITION events_old;

Architectural Choices

Architecture drives long-term costs. Some patterns cost more than others.

Stateful vs Stateless

Stateless services scale horizontally without session affinity issues. They are cheaper to scale. Keep state in databases, not application servers.

Right Database for the Job

  • RDS for transactional workloads needing ACID guarantees
  • DynamoDB for high-throughput key-value access
  • S3 for object storage, backups, static content
  • ElastiCache for caching, not persistent storage

Using the wrong database type wastes money. A general-purpose RDS for a simple cache is expensive. DynamoDB for complex queries requiring joins is painful.

Serverless for Variable Load

For highly variable workloads, Lambda or Cloud Functions can be cheaper than always-on servers. You pay per invocation, not per hour.

def lambda_handler(event, context):
    # Only pays for actual invocations
    # No idle compute cost
    process_event(event)

Measuring Optimization

Track cost over time. Set targets. Celebrate wins.

Cost Flow Architecture

Cloud costs flow through a predictable path from resource usage to optimization decisions:

flowchart TD
    subgraph Resources[Resources]
        A1[EC2 Instances]
        A2[S3 Buckets]
        A3[RDS Databases]
        A4[Load Balancers]
    end
    A1 --> B[CloudWatch Metrics]
    A2 --> B
    A3 --> B
    A4 --> B
    B --> C[Cost Explorer]
    C --> D[Cost Reports]
    D --> E{Analyze}
    E -->|Right-size| F[Resize Resources]
    E -->|Reserve| G[Buy Reserved/Savings Plans]
    E -->|Spot| H[Switch to Spot]
    E -->|Idle| I[Terminate Resources]
    F --> J[Monthly Review]
    G --> J
    H --> J
    I --> J
    J --> C

The loop: resources generate metrics, metrics feed cost data, cost data drives optimization decisions, decisions get reviewed monthly. Break the loop at any point and costs drift.

Cost Diversification Trade-offs

StrategySavingsCommitmentFlexibilityBest For
On-demand0%NoneHighestUnpredictable, early-stage
Reserved (1yr)30-40%Upfront or quarterlyLowSteady-state baseline
Reserved (3yr)50-60%Full upfrontVery LowStable long-term workloads
Savings Plans40-60%Hourly commitmentMediumCompute flexibility
Spot Instances60-90%NoneVery LowFault-tolerant batch
Spot Fleet60-90%NoneMediumDiversified fleet

Diversification across strategies reduces risk. Base load covered by reserved capacity, variable load on savings plans, batch workloads on spot.

def monthly_cost_report():
    costs = get_monthly_costs()

    return {
        'total': costs.total,
        'vs_last_month': costs.total - costs.last_month,
        'vs_budget': costs.total - costs.budget,
        'by_service': costs.by_service,
        'by_team': costs.by_team,
        'recommendations': generate_recommendations(costs)
    }

Common Mistakes

Buying Reserved Too Early

Before buying reserved instances, understand your baseline. Buying RIs for load you later reduce locks you into wrong-sized capacity.

Ignoring Data Transfer

Data transfer costs hide in bills. Cross-region replication, internet egress, and CDN transfers add up. Monitor data transfer separately.

Not Using Cost Explorer

AWS Cost Explorer shows where money goes. Without it, you are guessing. Enable it. Review it monthly.

Optimizing for Cost Alone

Cheapest is not always best. An application that fails costs more than an expensive reliable one. Balance cost with reliability requirements.

When to Use / When Not to Use Cost Optimization

Apply aggressive cost optimization when:

  • Cloud spend is a significant percentage of operating costs
  • Infrastructure needs are stable and predictable
  • Engineering team has capacity for optimization projects
  • Business can tolerate trade-offs for cost savings

Delay aggressive optimization when:

  • Company is in rapid growth mode
  • Infrastructure requirements are still evolving
  • Team is small and focused on features
  • Reliability is more costly to sacrifice than cloud spend

Production Failure Scenarios

FailureImpactMitigation
Reserved instances bought for wrong sizeLocked into expensive over-provisioned capacityAnalyze utilization data before purchase; start with 1-year partial upfront
Spot instances reclaimed during critical batchJob fails; processing delayedUse diverse instance pools; maintain fallback capacity; checkpoint frequently
S3 lifecycle policy misconfiguredData deleted prematurely or never tieredTest lifecycle policies on non-critical data first; set up versioning
Database downsized too aggressivelyQuery timeouts; application errorsMonitor query performance; right-size incrementally; maintain safety margins
Auto-scaling scaled down during traffic spikeRequests queued or rejectedSet appropriate scale-down thresholds; maintain minimum instance counts

Observability Checklist

  • Metrics:

    • Cost per active user or per transaction
    • Instance utilization P95 (not just average)
    • Cost by service, team, and environment
    • Reserved vs on-demand vs spot mix
    • Storage utilization and growth rate
    • Data transfer volume and cost
  • Logs:

    • Cost anomaly alerts (spend spike vs baseline)
    • Reserved instance utilization
    • Auto-scaling events with context
    • Underutilized resource detections
  • Alerts:

    • Daily spend exceeds daily budget threshold
    • Cost increases more than 20% week-over-week without explanation
    • Reserved instance utilization below 70%
    • Storage growth rate exceeds 10% per week -某个 resource has been idle for more than 30 days

Security Checklist

  • Cost controls prevent unauthorized resource creation (budget alerts, IAM policies)
  • Spot instance interruption does not expose sensitive data in process memory
  • Lifecycle policies do not delete data before compliance retention period
  • Cost optimization does not disable security monitoring or logging
  • Shared accounts are not used for cost allocation (proper tagging)
  • Reserved instance coverage does not create pressure to keep insecure instances

Common Pitfalls / Anti-Patterns

Right-Sizing Based on Average Utilization

Average CPU at 30% looks like over-provisioning. But if P95 is 80%, the instance is correctly sized. Right-size based on peak, not average.

Maximizing Spot Usage

Spot instances are not suitable for everything. Databases, stateful services, and latency-sensitive operations should stay on on-demand or reserved.

Ignoring Data Transfer Costs

Data transfer is often the second-largest cost category after compute. Do not optimize compute while ignoring egress.

Chasing Small Wins

Reducing waste in development environments saves money. But if production is over-provisioned, fixing one service saves more than eliminating all dev waste.

Optimizing Yesterday’s Architecture

Right-size after understanding current load patterns. But also consider whether the architecture itself needs updating—serverless or managed services might cost less than optimized EC2.

Capacity Estimation and Benchmark Data

Use these numbers for initial capacity planning and cost estimation.

EC2 Spot Price Benchmarks

Instance TypeOn-Demand $/hrSpot Price RangeTypical Savings
t3.micro$0.0104$0.003 - $0.00650-70%
t3.medium$0.0208$0.006 - $0.01250-70%
m5.large$0.096$0.029 - $0.05850-70%
m5.xlarge$0.192$0.058 - $0.11550-70%
c5.large$0.085$0.026 - $0.05150-70%
r5.large$0.126$0.038 - $0.07650-70%

S3 Storage Cost Benchmarks

Storage Class$/GB/monthGET/POST/DELETE per 1,000Egress per GB
Standard$0.023$0.0004$0.090
IA$0.0125$0.001$0.090
Glacier$0.004$0.05 (retrieval)$0.090
Glacier Deep Archive$0.00099$0.10 (retrieval)$0.090

RDS Cost Benchmarks

Instance Class$/month (single-AZ)$/month (multi-AZ)
db.t3.micro$14.60$28.00
db.t3.medium$29.20$58.40
db.m5.large$57.60$115.20
db.m5.xlarge$115.20$230.40
db.r5.large$91.20$182.40

Lambda Cost Benchmarks

Invocation PatternMonthly Cost Estimate
1M requests, 100ms avg$0.20
10M requests, 100ms avg$2.00
100M requests, 100ms avg$20.00
With provisioned concurrency (always-on)~$0.015/hour per 128MB

Quick Recap

Key Bullets:

  • Right-size based on P95 utilization, not average
  • Reserve predictable baseline; keep on-demand for variability
  • Use spot for fault-tolerant workloads; never for stateful services
  • Tag all resources for cost allocation visibility
  • Automate storage tiering; review monthly

Copy/Paste Checklist:

Monthly Cost Review:
[ ] Review Cost Explorer dashboard
[ ] Identify top 5 cost drivers
[ ] Check reserved instance utilization
[ ] Verify all resources have tags (Team, Environment, Application)
[ ] Review idle resources for cleanup
[ ] Check data transfer costs
[ ] Verify S3 lifecycle policies are working
[ ] Review spot instance allocation
[ ] Update cost allocation report for stakeholders
[ ] Identify one optimization to implement this month

For more on infrastructure topics, see Load Balancing, Geo-Distribution, and Database Scaling.

Category

Related Posts

AWS SQS and SNS: Cloud Messaging Services

Learn AWS SQS for point-to-point queues and SNS for pub/sub notifications, including FIFO ordering, message filtering, and common use cases.

#aws #messaging #sqs

AWS Data Services: Kinesis, Glue, Redshift, and S3

Guide to AWS data services for building data pipelines. Compare Kinesis vs Kafka, use Glue for ETL, query with Athena, and design S3 data lakes.

#data-engineering #aws #kinesis

Data Migration: Strategies and Patterns for Moving Data

Learn proven strategies for migrating data between systems with minimal downtime. Covers bulk migration, CDC patterns, validation, and rollback.

#data-engineering #data-migration #cdc