Google Spanner: Globally Distributed SQL at Scale

Google Spanner architecture combining relational model with horizontal scalability, TrueTime API for global consistency, and F1 database implementation.

published: reading time: 14 min read

Google Spanner: Globally Distributed SQL at Scale

Google Spanner solves a genuinely hard problem: managing petabytes of data across data centers worldwide while keeping the relational model that SQL developers rely on. It launched around 2007 as an internal project, influenced by what engineers called “Bigtable 2”, and became Cloud Spanner in 2017. The selling point was SQL plus horizontal scalability without throwing away ACID guarantees.


Architecture Overview

Spanner arranges data in layers that blend traditional database concepts with distributed systems patterns.

At the center is the directory, a B-tree-like structure that serves as the unit of data placement. Directories can move between servers and zones dynamically, which sounds complicated but happens transparently to running transactions.

The hierarchy: databases contain tables, tables contain directories, directories contain key-value pairs. Each directory falls under a Paxos group controlling replication.

graph TD
    A[Cloud Spanner Instance] --> B[Database 1]
    A --> C[Database 2]
    B --> D[Table: Users]
    B --> E[Table: Orders]
    D --> F[Directory: User Data]
    E --> G[Directory: Order Data]
    F --> H[Paxos Group - 3+ Replicas]
    G --> I[Paxos Group - 3+ Replicas]
    H --> J[Replica 1 - US East]
    H --> K[Replica 2 - US West]
    H --> L[Replica 3 - Europe]

Spanner stores data in tablets (inherited from Bigtable thinking). Each tablet spans a range of primary keys and is managed by a Paxos group. Paxos ensures all replicas agree before writes are acknowledged.


TrueTime: Global Consistency Through Clock Uncertainty

TrueTime is Spanner’s most distinctive feature. It uses GPS receivers and atomic clocks in each data center to produce globally consistent timestamps.

Here’s the thing: TrueTime does not claim clocks are perfectly synchronized. Instead, it makes clock uncertainty explicit as a bounded interval. Call TT.now() and you get back a TTinterval with earliest and latest possible timestamps:

# TrueTime returns a time interval, not a point
tt = TT.now()
# tt.earliest = timestamp that could be in the past
# tt.latest = timestamp that could be in the future
# The actual time is somewhere within this interval

The uncertainty bound sits around 1-7 milliseconds. Spanner leverages these intervals for external consistency: if transaction T1 commits before T2 starts in real-time, then T1’s timestamp is less than T2’s. That is the guarantee that matters in practice.

This means globally consistent reads without a single time source synchronizing everything. Read at a specific timestamp from anywhere on Earth, and Spanner returns results consistent with that moment.

TrueTime needs specialized hardware, which is why direct competitors are rare. But the conceptual model has influenced other systems to adopt “clock uncertainty” thinking.


Schema and SQL Support

Spanner gives you a relational schema with SQL. Unlike NoSQL systems that offer query capabilities as an afterthought, Spanner includes:

  • Standard SQL with JOINs, aggregations, subqueries
  • Inline primary keys and foreign keys
  • Interleaved tables for parent-child relationships
  • Generated columns and indexes
-- Spanner supports standard SQL with relational features
CREATE TABLE Users (
  UserId STRING(64) NOT NULL,
  Email STRING(256),
  CreatedAt TIMESTAMP,
) PRIMARY KEY (UserId);

CREATE TABLE Orders (
  OrderId STRING(64) NOT NULL,
  UserId STRING(64),
  Total DECIMAL(10,2),
  Status STRING(32),
) PRIMARY KEY (OrderId);

-- Foreign key support
ALTER TABLE Orders ADD CONSTRAINT FK_UserOrders
  FOREIGN KEY (UserId) REFERENCES Users(UserId);

-- Create index for common query pattern
CREATE INDEX idx_orders_user_status ON Orders(UserId, Status);

The INTERLEAVE IN clause co-locates related rows physically, cutting the need for distributed joins. When you interleave child rows with their parent, Spanner puts them in the same tablet. Queries run without network round-trips.

Partition Key Design

Choosing a good primary key matters more in Spanner than in a single-node database. Since data is range-partitioned by key, a poorly chosen key creates hotspots that degrade performance.

Avoid sequential keys for high-write tables. Auto-incrementing integers cause all writes to hit a single tablet. UUIDs or hashed keys spread load across nodes:

-- Bad: sequential key creates hotspots
CREATE TABLE Events (
    event_id INT64 AUTO_INCREMENT,
    payload STRING(MAX),
    PRIMARY KEY (event_id)
);

-- Good: uniformly distributed key
CREATE TABLE Events (
    event_id STRING(36) NOT NULL,
    payload STRING(MAX),
    PRIMARY KEY (event_id)
) PRIMARY KEY (event_id);

For read-heavy tables where you query by user ID or customer ID most often, use that ID as the leading key. For write-heavy tables generating many rows per user, consider a composite key with a suffix to distribute writes.

Interleaved Tables

Interleaved tables model parent-child relationships where child rows are co-located with their parent on the same storage node. This improves read performance when you frequently fetch a parent and its children together.

CREATE TABLE Orders (
    order_id STRING(36),
    customer_id STRING(36),
    total DECIMAL,
    created_at TIMESTAMP,
    PRIMARY KEY (order_id)
) PRIMARY KEY (order_id);

CREATE TABLE OrderItems (
    order_id STRING(36),
    item_id STRING(36),
    product_id STRING(36),
    quantity INT64,
    PRIMARY KEY (order_id, item_id)
) INTERLEAVE IN PARENT Orders ON DELETE CASCADE;

With INTERLEAVE IN PARENT Orders ON DELETE CASCADE, deleting an order automatically deletes its items. The real win is that fetching an order with its items stays in a single tablet read, avoiding a distributed query.

Trade-offs: Interleaving shines when you almost always fetch parent and child together. But if you often query children independently, it backfires: all items for a given parent land in the same tablet, which becomes a hotspot for popular parents. Deleting the parent also gets pricier since Spanner has to delete all the children first.


Transaction Handling

Spanner handles two transaction types: read-write and read-only.

Read-write transactions run two-phase commit (2PC) over Paxos. Nodes prepare first, promising to commit if asked. Then the coordinator tells them to commit or abort. This gives atomicity across distributed nodes, but adds latency.

Read-only transactions skip locks. They take a timestamp bound and read from consistent snapshots. No blocking, no contention.

Spanner also supports partitioned transactions for high-contention workloads, splitting large writes across servers for better throughput.

sequenceDiagram
    participant C as Coordinator
    participant P1 as Participant 1
    participant P2 as Participant 2

    Note over C,P2: Two-Phase Commit Protocol

    C->>P1: Prepare
    C->>P2: Prepare
    P1-->>C: Promise to Commit
    P2-->>C: Promise to Commit

    Note over C: Coordinator decides COMMIT

    C->>P1: Commit
    C->>P2: Commit
    P1-->>C: Committed
    P2-->C: Committed

Replication and Consistency

Spanner uses Paxos-based replication with configurable replica counts (typically 3-5), replica placement across regions for disaster recovery, and witness replicas that participate in Paxos without storing data.

The Paxos leader handles all reads and writes, but read leases let followers serve reads locally for a period. This cuts latency for read-heavy workloads.

By default, Spanner provides external consistency, stronger than serializability. If T1 commits before T2 starts in real-time, T1 happens before T2. Most applications need exactly this guarantee.


Direct Connections and Performance

Google recommends direct connections over the Proxy. Direct connections use gRPC and are faster, especially for apps running in Google Cloud.

Rough numbers:

  • Write latency: 10-50ms depending on distance and replica setup
  • Read latency: 1-10ms for local replica reads
  • Throughput scales horizontally as you add nodes

Each node handles about 10,000 QPS reads or 2,000 QPS writes. Your mileage varies with workload patterns.


Use Cases and Trade-offs

Spanner works well for:

  • Globally distributed databases needing strong consistency
  • Financial systems requiring ACID transactions across regions
  • Multi-tenant SaaS needing isolation and scalability
  • Inventory or order management with consistent cross-region updates

Shortcomings:

  • Cost: Not cheap compared to self-managed databases
  • TrueTime dependency: Needs Google’s hardware for full benefit
  • Latency floor: Even local reads have 1-2ms minimum
  • Query optimization: Less mature than PostgreSQL or MySQL

When to Use Spanner vs When NOT To

Choose Spanner When

  • You need global distribution with writes happening in multiple geographic regions simultaneously
  • Your application requires strict serializability or external consistency across regions
  • You want horizontal scalability without abandoning SQL and relational modeling
  • You need managed operations with automatic sharding, replication, and failover
  • Compliance requires data locality controls with regional redundancy
  • Your team lacks database operations expertise and needs a hands-off operational model

Avoid Spanner When

  • Your workload is single-region and does not need global distribution
  • You are cost-sensitive and can self-manage PostgreSQL/MySQL with read replicas
  • You have high-frequency short transactions where the latency floor matters (e.g., real-time bidding)
  • You need advanced query optimization features found in mature relational databases
  • Your data fits on a single node or requires minimal horizontal scaling
  • You require open-source flexibility without vendor lock-in

Decision Framework

flowchart TD
    A[Need Horizontal Scale?] --> |Yes| B[Need Global Distribution?]
    A --> |No| E[Consider PostgreSQL or MySQL]
    B --> |Yes| C[Need Strict Consistency?]
    B --> |No| F[Consider DynamoDB or Cassandra]
    C --> |Yes| D[Can Afford Premium Pricing?]
    C --> |No| F
    D --> |Yes| G[Spanner is a Strong Fit]
    D --> |No| H[Consider CockroachDB or Aurora Global]

Spanner vs CockroachDB vs Aurora: Trade-off Comparison

DimensionSpannerCockroachDBAurora
DeploymentFully managed cloudSelf-hosted or cloudAWS managed
Global DistributionNative multi-regionNative multi-regionRead replicas, Aurora Global
Consistency ModelExternal consistency (TrueTime)Serializability (HLC)Session + eventual
SQL SupportFull standard SQLPostgreSQL-compatibleMySQL/PostgreSQL compatible
Horizontal ScalingAutomatic shardingAutomatic shardingStorage auto-scaling
Write Latency10-50ms cross-region20-100ms cross-region5-20ms single region
Cost ModelPay per node/hourOpen-source + cloudPay per usage
TrueTime HardwareYes (GPS + atomic clocks)No (HLC instead)No
Schema ChangesOnline with limitationsOnline with DDL danceOnline with Aurora DDL
Backup/RestoreBuilt-in, cross-regionFull backup + point-in-timeContinuous backup to S3

Bottom line: Spanner delivers the strongest consistency guarantees with TrueTime but at a premium price. CockroachDB offers similar semantics with an open-source option. Aurora provides cost-effective single-region operation with global read scaling.


F1 Database: Production Validation

Google’s F1 database is the production workload that validates Spanner’s architecture at scale. F1 originally ran on Bigtable, but the team migrated it to Spanner to handle Google’s ad business that required global distribution with strong consistency. This migration proved Spanner could handle a real, mission-critical workload with:

  • Thousands of tables and complex joins
  • High-volume OLTP traffic alongside analytical queries
  • Strict SLAs for latency and availability

The F1 case study demonstrates Spanner is not just a research prototype but production-hardened technology.


Production Failure Scenarios

FailureImpactMitigation
Zone outage (single AZ failure)Spanner keeps serving from other zones; TrueTime bounds how stale reads can getMulti-zone deployment; Spanner replicates synchronously across zones
Leader election during high write loadShort unavailability or latency spike while a new leader takes overTrueTime uncertainty caps the maximum outage window; consider allowing leaderless reads
TrueTime hardware failureTrueTime uncertainty grows until you replace the hardware; reads can return slightly stale dataMonitor clock synchronization closely; replace TrueTime hardware quickly when errors surface
Schema change during heavy loadOnline schema changes hit limits on very large tablesPlan schema changes for low-traffic periods; be careful with ALTER TABLE on big datasets
Cross-region replication lagNon-local replicas see higher read latencyRoute read/write traffic to regional endpoints; globally consistent reads inherently need cross-region latency
Network partitionIsolated region cannot reach the replica majoritySpanner stays available in the majority partition; isolated region serves cached data until the partition heals
Transaction conflict rate spikeMore retries and latency when concurrent transactions touch overlapping key rangesPrefer read-only transactions where feasible; structure your key schema to reduce write contention

Operational Guidance

Backup and Restore

Spanner handles backups automatically with no user intervention. On-demand backups are available through the Cloud Console or the spanner_backup.sh script. You can restore to any point within the last 7 days using point-in-time recovery.

Cross-region backups need a manual setup: export Spanner data to Avro or Parquet in Cloud Storage, then copy files to your target region. Budget engineering time for this if you have strict data residency requirements.

Schema Changes

Spanner supports online schema changes via ALTER TABLE statements. There is no downtime for schema changes, but they go through an internal schema upgrade process that can take time on very large tables.

# Online schema change with ALTER TABLE
operation = client.update_database_ddl(
    database_id,
    ['ALTER TABLE Users ADD COLUMN last_login TIMESTAMP']
)
operation.result()  # Waits for completion

Schema change best practices:

  • Add one column at a time rather than batching multiple changes
  • Avoid large DEFAULT values that increase write amplification
  • Use NOT NULL only when truly required, since it forces a full table rewrite
  • Schedule large schema changes during low-traffic windows when possible

Import and Export

Cloud Dataflow handles bulk import and export. Spanner exports to Avro or Parquet in Cloud Storage. For imports, point Dataflow at your Cloud Storage files and let Spanner do the distributed write.

# Cloud Dataflow export job configuration (pseudocode)
export_job = {
    "target_project": "my-project",
    "target_instance": "my-instance",
    "target_database": "my-db",
    "output_prefix": "gs://my-bucket/spanner-export/",
    "output_format": "AVRO",  # or PARQUET
    "batch_size": 10000
}

For small datasets, the Cloud Console provides a no-code import wizard. For production workloads, always use Dataflow with appropriate batch sizing.


Cost Modeling

Spanner pricing has three main components: storage, compute, and networking egress.

Storage

Storage is charged at approximately $0.17 per GB per month for regional storage. Multi-region storage costs more, around $0.25 per GB per month. Storage includes all data, indexes, and backup overhead.

Compute

Compute is priced per vCPU-hour. A regional configuration costs roughly $0.045 per vCPU-hour. Multi-region configurations (necessary for global write distribution) cost approximately $0.09 per vCPU-hour, roughly double the regional rate.

Networking Egress

Networking charges apply when data leaves Google Cloud infrastructure. Cross-region writes and reads within Spanner are generally cheap, but egress to the public internet or to other cloud providers adds up quickly.

Cost Comparison Example

ConfigurationStorage (1TB)Compute (16 vCPU)Estimated Monthly Cost
Single-region$170/month$26/month (regional)~$196/month
Multi-region (US)$170/month$52/month (multi-region)~$222/month
Multi-region (global)$250/month$104/month~$354/month

These numbers assume 24/7 operation at full compute capacity. Actual costs vary with utilization.

Cost Estimation Pseudocode

def estimate_monthly_spanner_cost(
    storage_gb: float,
    vcpus: int,
    region_type: str = "regional",  # "regional" | "multi-region-us" | "multi-region-global"
    egress_gb: float = 0
) -> dict:
    # Storage pricing per GB/month
    storage_prices = {
        "regional": 0.17,
        "multi-region-us": 0.20,
        "multi-region-global": 0.25
    }
    # Compute pricing per vCPU-hour
    compute_prices = {
        "regional": 0.045,
        "multi-region-us": 0.07,
        "multi-region-global": 0.09
    }
    # Egress pricing per GB (approximate)
    egress_rate = 0.12

    hours_per_month = 730  # Average month

    storage_cost = storage_gb * storage_prices[region_type]
    compute_cost = vcpus * compute_prices[region_type] * hours_per_month
    egress_cost = egress_gb * egress_rate

    return {
        "storage": round(storage_cost, 2),
        "compute": round(compute_cost, 2),
        "egress": round(egress_cost, 2),
        "total": round(storage_cost + compute_cost + egress_cost, 2)
    }

# Example: 1TB storage, 16 vCPU, multi-region US, 100GB egress
cost = estimate_monthly_spanner_cost(
    storage_gb=1024,
    vcpus=16,
    region_type="multi-region-us",
    egress_gb=100
)
# cost["total"] ~ $270/month

Plug your numbers into the Google Cloud Pricing Calculator for real estimates. One practical note: Spanner costs scale linearly with nodes, so you can start small for dev/test environments and scale up when you go to production.


Summary

Spanner combines SQL with horizontal scalability and strong consistency. It shows you do not have to throw away ACID guarantees to get NoSQL-like scaling. TrueTime hardware is the main barrier to competition, but the ideas have shaped how engineers think about distributed consistency.

For globally distributed apps needing strong consistency, Spanner is worth the cost. For others, studying its architecture teaches you the real trade-offs in distributed database design.

Category

Related Posts

Amazon DynamoDB: Scalable NoSQL with Predictable Performance

Deep dive into Amazon DynamoDB architecture, partitioned tables, eventual consistency, on-demand capacity, and the single-digit millisecond SLA.

#distributed-systems #databases #amazon

Google Chubby: The Lock Service That Inspired ZooKeeper

Explore Google Chubby's architecture, lock-based coordination, Paxos integration, cell hierarchy, and its influence on distributed systems design.

#distributed-systems #databases #google

Synchronous Replication: Guaranteeing Data Consistency Across Nodes

Explore synchronous replication patterns in distributed databases. Learn about the write-ahead log shipping, Quorum-based replication, and how synchronous replication ensures zero RPO in production systems.

#distributed-systems #replication #consistency