Google Spanner: Globally Distributed SQL at Scale
Google Spanner architecture combining relational model with horizontal scalability, TrueTime API for global consistency, and F1 database implementation.
Google Spanner: Globally Distributed SQL at Scale
Google Spanner solves a genuinely hard problem: managing petabytes of data across data centers worldwide while keeping the relational model that SQL developers rely on. It launched around 2007 as an internal project, influenced by what engineers called “Bigtable 2”, and became Cloud Spanner in 2017. The selling point was SQL plus horizontal scalability without throwing away ACID guarantees.
Architecture Overview
Spanner arranges data in layers that blend traditional database concepts with distributed systems patterns.
At the center is the directory, a B-tree-like structure that serves as the unit of data placement. Directories can move between servers and zones dynamically, which sounds complicated but happens transparently to running transactions.
The hierarchy: databases contain tables, tables contain directories, directories contain key-value pairs. Each directory falls under a Paxos group controlling replication.
graph TD
A[Cloud Spanner Instance] --> B[Database 1]
A --> C[Database 2]
B --> D[Table: Users]
B --> E[Table: Orders]
D --> F[Directory: User Data]
E --> G[Directory: Order Data]
F --> H[Paxos Group - 3+ Replicas]
G --> I[Paxos Group - 3+ Replicas]
H --> J[Replica 1 - US East]
H --> K[Replica 2 - US West]
H --> L[Replica 3 - Europe]
Spanner stores data in tablets (inherited from Bigtable thinking). Each tablet spans a range of primary keys and is managed by a Paxos group. Paxos ensures all replicas agree before writes are acknowledged.
TrueTime: Global Consistency Through Clock Uncertainty
TrueTime is Spanner’s most distinctive feature. It uses GPS receivers and atomic clocks in each data center to produce globally consistent timestamps.
Here’s the thing: TrueTime does not claim clocks are perfectly synchronized. Instead, it makes clock uncertainty explicit as a bounded interval. Call TT.now() and you get back a TTinterval with earliest and latest possible timestamps:
# TrueTime returns a time interval, not a point
tt = TT.now()
# tt.earliest = timestamp that could be in the past
# tt.latest = timestamp that could be in the future
# The actual time is somewhere within this interval
The uncertainty bound sits around 1-7 milliseconds. Spanner leverages these intervals for external consistency: if transaction T1 commits before T2 starts in real-time, then T1’s timestamp is less than T2’s. That is the guarantee that matters in practice.
This means globally consistent reads without a single time source synchronizing everything. Read at a specific timestamp from anywhere on Earth, and Spanner returns results consistent with that moment.
TrueTime needs specialized hardware, which is why direct competitors are rare. But the conceptual model has influenced other systems to adopt “clock uncertainty” thinking.
Schema and SQL Support
Spanner gives you a relational schema with SQL. Unlike NoSQL systems that offer query capabilities as an afterthought, Spanner includes:
- Standard SQL with JOINs, aggregations, subqueries
- Inline primary keys and foreign keys
- Interleaved tables for parent-child relationships
- Generated columns and indexes
-- Spanner supports standard SQL with relational features
CREATE TABLE Users (
UserId STRING(64) NOT NULL,
Email STRING(256),
CreatedAt TIMESTAMP,
) PRIMARY KEY (UserId);
CREATE TABLE Orders (
OrderId STRING(64) NOT NULL,
UserId STRING(64),
Total DECIMAL(10,2),
Status STRING(32),
) PRIMARY KEY (OrderId);
-- Foreign key support
ALTER TABLE Orders ADD CONSTRAINT FK_UserOrders
FOREIGN KEY (UserId) REFERENCES Users(UserId);
-- Create index for common query pattern
CREATE INDEX idx_orders_user_status ON Orders(UserId, Status);
The INTERLEAVE IN clause co-locates related rows physically, cutting the need for distributed joins. When you interleave child rows with their parent, Spanner puts them in the same tablet. Queries run without network round-trips.
Partition Key Design
Choosing a good primary key matters more in Spanner than in a single-node database. Since data is range-partitioned by key, a poorly chosen key creates hotspots that degrade performance.
Avoid sequential keys for high-write tables. Auto-incrementing integers cause all writes to hit a single tablet. UUIDs or hashed keys spread load across nodes:
-- Bad: sequential key creates hotspots
CREATE TABLE Events (
event_id INT64 AUTO_INCREMENT,
payload STRING(MAX),
PRIMARY KEY (event_id)
);
-- Good: uniformly distributed key
CREATE TABLE Events (
event_id STRING(36) NOT NULL,
payload STRING(MAX),
PRIMARY KEY (event_id)
) PRIMARY KEY (event_id);
For read-heavy tables where you query by user ID or customer ID most often, use that ID as the leading key. For write-heavy tables generating many rows per user, consider a composite key with a suffix to distribute writes.
Interleaved Tables
Interleaved tables model parent-child relationships where child rows are co-located with their parent on the same storage node. This improves read performance when you frequently fetch a parent and its children together.
CREATE TABLE Orders (
order_id STRING(36),
customer_id STRING(36),
total DECIMAL,
created_at TIMESTAMP,
PRIMARY KEY (order_id)
) PRIMARY KEY (order_id);
CREATE TABLE OrderItems (
order_id STRING(36),
item_id STRING(36),
product_id STRING(36),
quantity INT64,
PRIMARY KEY (order_id, item_id)
) INTERLEAVE IN PARENT Orders ON DELETE CASCADE;
With INTERLEAVE IN PARENT Orders ON DELETE CASCADE, deleting an order automatically deletes its items. The real win is that fetching an order with its items stays in a single tablet read, avoiding a distributed query.
Trade-offs: Interleaving shines when you almost always fetch parent and child together. But if you often query children independently, it backfires: all items for a given parent land in the same tablet, which becomes a hotspot for popular parents. Deleting the parent also gets pricier since Spanner has to delete all the children first.
Transaction Handling
Spanner handles two transaction types: read-write and read-only.
Read-write transactions run two-phase commit (2PC) over Paxos. Nodes prepare first, promising to commit if asked. Then the coordinator tells them to commit or abort. This gives atomicity across distributed nodes, but adds latency.
Read-only transactions skip locks. They take a timestamp bound and read from consistent snapshots. No blocking, no contention.
Spanner also supports partitioned transactions for high-contention workloads, splitting large writes across servers for better throughput.
sequenceDiagram
participant C as Coordinator
participant P1 as Participant 1
participant P2 as Participant 2
Note over C,P2: Two-Phase Commit Protocol
C->>P1: Prepare
C->>P2: Prepare
P1-->>C: Promise to Commit
P2-->>C: Promise to Commit
Note over C: Coordinator decides COMMIT
C->>P1: Commit
C->>P2: Commit
P1-->>C: Committed
P2-->C: Committed
Replication and Consistency
Spanner uses Paxos-based replication with configurable replica counts (typically 3-5), replica placement across regions for disaster recovery, and witness replicas that participate in Paxos without storing data.
The Paxos leader handles all reads and writes, but read leases let followers serve reads locally for a period. This cuts latency for read-heavy workloads.
By default, Spanner provides external consistency, stronger than serializability. If T1 commits before T2 starts in real-time, T1 happens before T2. Most applications need exactly this guarantee.
Direct Connections and Performance
Google recommends direct connections over the Proxy. Direct connections use gRPC and are faster, especially for apps running in Google Cloud.
Rough numbers:
- Write latency: 10-50ms depending on distance and replica setup
- Read latency: 1-10ms for local replica reads
- Throughput scales horizontally as you add nodes
Each node handles about 10,000 QPS reads or 2,000 QPS writes. Your mileage varies with workload patterns.
Use Cases and Trade-offs
Spanner works well for:
- Globally distributed databases needing strong consistency
- Financial systems requiring ACID transactions across regions
- Multi-tenant SaaS needing isolation and scalability
- Inventory or order management with consistent cross-region updates
Shortcomings:
- Cost: Not cheap compared to self-managed databases
- TrueTime dependency: Needs Google’s hardware for full benefit
- Latency floor: Even local reads have 1-2ms minimum
- Query optimization: Less mature than PostgreSQL or MySQL
When to Use Spanner vs When NOT To
Choose Spanner When
- You need global distribution with writes happening in multiple geographic regions simultaneously
- Your application requires strict serializability or external consistency across regions
- You want horizontal scalability without abandoning SQL and relational modeling
- You need managed operations with automatic sharding, replication, and failover
- Compliance requires data locality controls with regional redundancy
- Your team lacks database operations expertise and needs a hands-off operational model
Avoid Spanner When
- Your workload is single-region and does not need global distribution
- You are cost-sensitive and can self-manage PostgreSQL/MySQL with read replicas
- You have high-frequency short transactions where the latency floor matters (e.g., real-time bidding)
- You need advanced query optimization features found in mature relational databases
- Your data fits on a single node or requires minimal horizontal scaling
- You require open-source flexibility without vendor lock-in
Decision Framework
flowchart TD
A[Need Horizontal Scale?] --> |Yes| B[Need Global Distribution?]
A --> |No| E[Consider PostgreSQL or MySQL]
B --> |Yes| C[Need Strict Consistency?]
B --> |No| F[Consider DynamoDB or Cassandra]
C --> |Yes| D[Can Afford Premium Pricing?]
C --> |No| F
D --> |Yes| G[Spanner is a Strong Fit]
D --> |No| H[Consider CockroachDB or Aurora Global]
Spanner vs CockroachDB vs Aurora: Trade-off Comparison
| Dimension | Spanner | CockroachDB | Aurora |
|---|---|---|---|
| Deployment | Fully managed cloud | Self-hosted or cloud | AWS managed |
| Global Distribution | Native multi-region | Native multi-region | Read replicas, Aurora Global |
| Consistency Model | External consistency (TrueTime) | Serializability (HLC) | Session + eventual |
| SQL Support | Full standard SQL | PostgreSQL-compatible | MySQL/PostgreSQL compatible |
| Horizontal Scaling | Automatic sharding | Automatic sharding | Storage auto-scaling |
| Write Latency | 10-50ms cross-region | 20-100ms cross-region | 5-20ms single region |
| Cost Model | Pay per node/hour | Open-source + cloud | Pay per usage |
| TrueTime Hardware | Yes (GPS + atomic clocks) | No (HLC instead) | No |
| Schema Changes | Online with limitations | Online with DDL dance | Online with Aurora DDL |
| Backup/Restore | Built-in, cross-region | Full backup + point-in-time | Continuous backup to S3 |
Bottom line: Spanner delivers the strongest consistency guarantees with TrueTime but at a premium price. CockroachDB offers similar semantics with an open-source option. Aurora provides cost-effective single-region operation with global read scaling.
F1 Database: Production Validation
Google’s F1 database is the production workload that validates Spanner’s architecture at scale. F1 originally ran on Bigtable, but the team migrated it to Spanner to handle Google’s ad business that required global distribution with strong consistency. This migration proved Spanner could handle a real, mission-critical workload with:
- Thousands of tables and complex joins
- High-volume OLTP traffic alongside analytical queries
- Strict SLAs for latency and availability
The F1 case study demonstrates Spanner is not just a research prototype but production-hardened technology.
Related Concepts
- CAP Theorem explains why consistency across partitions is hard
- Consistency Models covers external consistency and snapshot isolation
- Database Replication discusses Spanner’s replication strategies
- Distributed Transactions explains two-phase commit
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Zone outage (single AZ failure) | Spanner keeps serving from other zones; TrueTime bounds how stale reads can get | Multi-zone deployment; Spanner replicates synchronously across zones |
| Leader election during high write load | Short unavailability or latency spike while a new leader takes over | TrueTime uncertainty caps the maximum outage window; consider allowing leaderless reads |
| TrueTime hardware failure | TrueTime uncertainty grows until you replace the hardware; reads can return slightly stale data | Monitor clock synchronization closely; replace TrueTime hardware quickly when errors surface |
| Schema change during heavy load | Online schema changes hit limits on very large tables | Plan schema changes for low-traffic periods; be careful with ALTER TABLE on big datasets |
| Cross-region replication lag | Non-local replicas see higher read latency | Route read/write traffic to regional endpoints; globally consistent reads inherently need cross-region latency |
| Network partition | Isolated region cannot reach the replica majority | Spanner stays available in the majority partition; isolated region serves cached data until the partition heals |
| Transaction conflict rate spike | More retries and latency when concurrent transactions touch overlapping key ranges | Prefer read-only transactions where feasible; structure your key schema to reduce write contention |
Operational Guidance
Backup and Restore
Spanner handles backups automatically with no user intervention. On-demand backups are available through the Cloud Console or the spanner_backup.sh script. You can restore to any point within the last 7 days using point-in-time recovery.
Cross-region backups need a manual setup: export Spanner data to Avro or Parquet in Cloud Storage, then copy files to your target region. Budget engineering time for this if you have strict data residency requirements.
Schema Changes
Spanner supports online schema changes via ALTER TABLE statements. There is no downtime for schema changes, but they go through an internal schema upgrade process that can take time on very large tables.
# Online schema change with ALTER TABLE
operation = client.update_database_ddl(
database_id,
['ALTER TABLE Users ADD COLUMN last_login TIMESTAMP']
)
operation.result() # Waits for completion
Schema change best practices:
- Add one column at a time rather than batching multiple changes
- Avoid large DEFAULT values that increase write amplification
- Use
NOT NULLonly when truly required, since it forces a full table rewrite - Schedule large schema changes during low-traffic windows when possible
Import and Export
Cloud Dataflow handles bulk import and export. Spanner exports to Avro or Parquet in Cloud Storage. For imports, point Dataflow at your Cloud Storage files and let Spanner do the distributed write.
# Cloud Dataflow export job configuration (pseudocode)
export_job = {
"target_project": "my-project",
"target_instance": "my-instance",
"target_database": "my-db",
"output_prefix": "gs://my-bucket/spanner-export/",
"output_format": "AVRO", # or PARQUET
"batch_size": 10000
}
For small datasets, the Cloud Console provides a no-code import wizard. For production workloads, always use Dataflow with appropriate batch sizing.
Cost Modeling
Spanner pricing has three main components: storage, compute, and networking egress.
Storage
Storage is charged at approximately $0.17 per GB per month for regional storage. Multi-region storage costs more, around $0.25 per GB per month. Storage includes all data, indexes, and backup overhead.
Compute
Compute is priced per vCPU-hour. A regional configuration costs roughly $0.045 per vCPU-hour. Multi-region configurations (necessary for global write distribution) cost approximately $0.09 per vCPU-hour, roughly double the regional rate.
Networking Egress
Networking charges apply when data leaves Google Cloud infrastructure. Cross-region writes and reads within Spanner are generally cheap, but egress to the public internet or to other cloud providers adds up quickly.
Cost Comparison Example
| Configuration | Storage (1TB) | Compute (16 vCPU) | Estimated Monthly Cost |
|---|---|---|---|
| Single-region | $170/month | $26/month (regional) | ~$196/month |
| Multi-region (US) | $170/month | $52/month (multi-region) | ~$222/month |
| Multi-region (global) | $250/month | $104/month | ~$354/month |
These numbers assume 24/7 operation at full compute capacity. Actual costs vary with utilization.
Cost Estimation Pseudocode
def estimate_monthly_spanner_cost(
storage_gb: float,
vcpus: int,
region_type: str = "regional", # "regional" | "multi-region-us" | "multi-region-global"
egress_gb: float = 0
) -> dict:
# Storage pricing per GB/month
storage_prices = {
"regional": 0.17,
"multi-region-us": 0.20,
"multi-region-global": 0.25
}
# Compute pricing per vCPU-hour
compute_prices = {
"regional": 0.045,
"multi-region-us": 0.07,
"multi-region-global": 0.09
}
# Egress pricing per GB (approximate)
egress_rate = 0.12
hours_per_month = 730 # Average month
storage_cost = storage_gb * storage_prices[region_type]
compute_cost = vcpus * compute_prices[region_type] * hours_per_month
egress_cost = egress_gb * egress_rate
return {
"storage": round(storage_cost, 2),
"compute": round(compute_cost, 2),
"egress": round(egress_cost, 2),
"total": round(storage_cost + compute_cost + egress_cost, 2)
}
# Example: 1TB storage, 16 vCPU, multi-region US, 100GB egress
cost = estimate_monthly_spanner_cost(
storage_gb=1024,
vcpus=16,
region_type="multi-region-us",
egress_gb=100
)
# cost["total"] ~ $270/month
Plug your numbers into the Google Cloud Pricing Calculator for real estimates. One practical note: Spanner costs scale linearly with nodes, so you can start small for dev/test environments and scale up when you go to production.
Summary
Spanner combines SQL with horizontal scalability and strong consistency. It shows you do not have to throw away ACID guarantees to get NoSQL-like scaling. TrueTime hardware is the main barrier to competition, but the ideas have shaped how engineers think about distributed consistency.
For globally distributed apps needing strong consistency, Spanner is worth the cost. For others, studying its architecture teaches you the real trade-offs in distributed database design.
Category
Related Posts
Amazon DynamoDB: Scalable NoSQL with Predictable Performance
Deep dive into Amazon DynamoDB architecture, partitioned tables, eventual consistency, on-demand capacity, and the single-digit millisecond SLA.
Google Chubby: The Lock Service That Inspired ZooKeeper
Explore Google Chubby's architecture, lock-based coordination, Paxos integration, cell hierarchy, and its influence on distributed systems design.
Synchronous Replication: Guaranteeing Data Consistency Across Nodes
Explore synchronous replication patterns in distributed databases. Learn about the write-ahead log shipping, Quorum-based replication, and how synchronous replication ensures zero RPO in production systems.