Google Spanner: Globally Distributed SQL at Scale
Google Spanner architecture combining relational model with horizontal scalability, TrueTime API for global consistency, and F1 database implementation.
Google Spanner: Globally Distributed SQL at Scale
Google Spanner solves a genuinely hard problem: managing petabytes of data across data centers worldwide while keeping the relational model that SQL developers rely on. It launched around 2007 as an internal project, influenced by what engineers called “Bigtable 2”, and became Cloud Spanner in 2017. The selling point was SQL plus horizontal scalability without throwing away ACID guarantees.
Architecture Overview
Spanner arranges data in layers that blend traditional database concepts with distributed systems patterns.
At the center is the directory, a B-tree-like structure that serves as the unit of data placement. Directories can move between servers and zones dynamically, which sounds complicated but happens transparently to running transactions.
The hierarchy: databases contain tables, tables contain directories, directories contain key-value pairs. Each directory falls under a Paxos group controlling replication.
graph TD
A[Cloud Spanner Instance] --> B[Database 1]
A --> C[Database 2]
B --> D[Table: Users]
B --> E[Table: Orders]
D --> F[Directory: User Data]
E --> G[Directory: Order Data]
F --> H[Paxos Group - 3+ Replicas]
G --> I[Paxos Group - 3+ Replicas]
H --> J[Replica 1 - US East]
H --> K[Replica 2 - US West]
H --> L[Replica 3 - Europe]
Spanner stores data in tablets (inherited from Bigtable thinking). Each tablet spans a range of primary keys and is managed by a Paxos group. Paxos ensures all replicas agree before writes are acknowledged.
TrueTime: Global Consistency Through Clock Uncertainty
TrueTime is Spanner’s most distinctive feature. It uses GPS receivers and atomic clocks in each data center to produce globally consistent timestamps.
TrueTime does not claim clocks are perfectly synchronized. Instead, it makes clock uncertainty explicit as a bounded interval. Call TT.now() and you get back a TTinterval with earliest and latest possible timestamps:
# TrueTime returns a time interval, not a point
tt = TT.now()
# tt.earliest = timestamp that could be in the past
# tt.latest = timestamp that could be in the future
# The actual time is somewhere within this interval
The uncertainty bound sits around 1-7 milliseconds. Spanner leverages these intervals for external consistency: if transaction T1 commits before T2 starts in real-time, then T1’s timestamp is less than T2’s. That is the guarantee that matters in practice.
This means globally consistent reads without a single time source synchronizing everything. Read at a specific timestamp from anywhere on Earth, and Spanner returns results consistent with that moment.
TrueTime needs specialized hardware, which is why direct competitors are rare. But the conceptual model has influenced other systems to adopt “clock uncertainty” thinking.
Schema and SQL Support
Spanner gives you a relational schema with SQL. Unlike NoSQL systems that offer query capabilities as an afterthought, Spanner includes:
- Standard SQL with JOINs, aggregations, subqueries
- Inline primary keys and foreign keys
- Interleaved tables for parent-child relationships
- Generated columns and indexes
-- Spanner supports standard SQL with relational features
CREATE TABLE Users (
UserId STRING(64) NOT NULL,
Email STRING(256),
CreatedAt TIMESTAMP,
) PRIMARY KEY (UserId);
CREATE TABLE Orders (
OrderId STRING(64) NOT NULL,
UserId STRING(64),
Total DECIMAL(10,2),
Status STRING(32),
) PRIMARY KEY (OrderId);
-- Foreign key support
ALTER TABLE Orders ADD CONSTRAINT FK_UserOrders
FOREIGN KEY (UserId) REFERENCES Users(UserId);
-- Create index for common query pattern
CREATE INDEX idx_orders_user_status ON Orders(UserId, Status);
The INTERLEAVE IN clause co-locates related rows physically, cutting the need for distributed joins. When you interleave child rows with their parent, Spanner puts them in the same tablet. Queries run without network round-trips.
Partition Key Design
Choosing a good primary key matters more in Spanner than in a single-node database. Since data is range-partitioned by key, a poorly chosen key creates hotspots that degrade performance.
Avoid sequential keys for high-write tables. Auto-incrementing integers cause all writes to hit a single tablet. UUIDs or hashed keys spread load across nodes:
-- Bad: sequential key creates hotspots
CREATE TABLE Events (
event_id INT64 AUTO_INCREMENT,
payload STRING(MAX),
PRIMARY KEY (event_id)
);
-- Good: uniformly distributed key
CREATE TABLE Events (
event_id STRING(36) NOT NULL,
payload STRING(MAX),
PRIMARY KEY (event_id)
) PRIMARY KEY (event_id);
For read-heavy tables where you query by user ID or customer ID most often, use that ID as the leading key. For write-heavy tables generating many rows per user, consider a composite key with a suffix to distribute writes.
Interleaved Tables
Interleaved tables model parent-child relationships where child rows are co-located with their parent on the same storage node. This improves read performance when you frequently fetch a parent and its children together.
CREATE TABLE Orders (
order_id STRING(36),
customer_id STRING(36),
total DECIMAL,
created_at TIMESTAMP,
PRIMARY KEY (order_id)
) PRIMARY KEY (order_id);
CREATE TABLE OrderItems (
order_id STRING(36),
item_id STRING(36),
product_id STRING(36),
quantity INT64,
PRIMARY KEY (order_id, item_id)
) INTERLEAVE IN PARENT Orders ON DELETE CASCADE;
With INTERLEAVE IN PARENT Orders ON DELETE CASCADE, deleting an order automatically deletes its items. The real win is that fetching an order with its items stays in a single tablet read, avoiding a distributed query.
Trade-offs: Interleaving shines when you almost always fetch parent and child together. But if you often query children independently, it backfires: all items for a given parent land in the same tablet, which becomes a hotspot for popular parents. Deleting the parent also gets pricier since Spanner has to delete all the children first.
Transaction Handling
Spanner handles two transaction types: read-write and read-only.
Read-write transactions run two-phase commit (2PC) over Paxos. Nodes prepare first, promising to commit if asked. Then the coordinator tells them to commit or abort. This gives atomicity across distributed nodes, but adds latency.
Read-only transactions skip locks. They take a timestamp bound and read from consistent snapshots. No blocking, no contention.
Spanner also supports partitioned transactions for high-contention workloads, splitting large writes across servers for better throughput.
sequenceDiagram
participant C as Coordinator
participant P1 as Participant 1
participant P2 as Participant 2
Note over C,P2: Two-Phase Commit Protocol
C->>P1: Prepare
C->>P2: Prepare
P1-->>C: Promise to Commit
P2-->>C: Promise to Commit
Note over C: Coordinator decides COMMIT
C->>P1: Commit
C->>P2: Commit
P1-->>C: Committed
P2-->C: Committed
Replication and Consistency
Spanner uses Paxos-based replication with configurable replica counts (typically 3-5), replica placement across regions for disaster recovery, and witness replicas that participate in Paxos without storing data.
The Paxos leader handles all reads and writes, but read leases let followers serve reads locally for a period. This cuts latency for read-heavy workloads.
By default, Spanner provides external consistency, stronger than serializability. If T1 commits before T2 starts in real-time, T1 happens before T2. Most applications need exactly this guarantee.
Direct Connections and Performance
Google recommends direct connections over the Proxy. Direct connections use gRPC and are faster, especially for apps running in Google Cloud.
Rough numbers:
- Write latency: 10-50ms depending on distance and replica setup
- Read latency: 1-10ms for local replica reads
- Throughput scales horizontally as you add nodes
Each node handles about 10,000 QPS reads or 2,000 QPS writes. Your mileage varies with workload patterns.
Use Cases and Trade-offs
Spanner works well for:
- Globally distributed databases needing strong consistency
- Financial systems requiring ACID transactions across regions
- Multi-tenant SaaS needing isolation and scalability
- Inventory or order management with consistent cross-region updates
Shortcomings:
- Cost: Not cheap compared to self-managed databases
- TrueTime dependency: Needs Google’s hardware for full benefit
- Latency floor: Even local reads have 1-2ms minimum
- Query optimization: Less mature than PostgreSQL or MySQL
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Zone outage (single AZ failure) | Spanner keeps serving from other zones; TrueTime bounds how stale reads can get | Multi-zone deployment; Spanner replicates synchronously across zones |
| Leader election during high write load | Short unavailability or latency spike while a new leader takes over | TrueTime uncertainty caps the maximum outage window; consider allowing leaderless reads |
| TrueTime hardware failure | TrueTime uncertainty grows until you replace the hardware; reads can return slightly stale data | Monitor clock synchronization closely; replace TrueTime hardware quickly when errors surface |
| Schema change during heavy load | Online schema changes hit limits on very large tables | Plan schema changes for low-traffic periods; be careful with ALTER TABLE on big datasets |
| Cross-region replication lag | Non-local replicas see higher read latency | Route read/write traffic to regional endpoints; globally consistent reads inherently need cross-region latency |
| Network partition | Isolated region cannot reach the replica majority | Spanner stays available in the majority partition; isolated region serves cached data until the partition heals |
| Transaction conflict rate spike | More retries and latency when concurrent transactions touch overlapping key ranges | Prefer read-only transactions where feasible; structure your key schema to reduce write contention |
Common Pitfalls / Anti-Patterns
Choose Spanner When
- You need global distribution with writes happening in multiple geographic regions simultaneously
- Your application requires strict serializability or external consistency across regions
- You want horizontal scalability without abandoning SQL and relational modeling
- You need managed operations with automatic sharding, replication, and failover
- Compliance requires data locality controls with regional redundancy
- Your team lacks database operations expertise and needs a hands-off operational model
Avoid Spanner When
- Your workload is single-region and does not need global distribution
- You are cost-sensitive and can self-manage PostgreSQL/MySQL with read replicas
- You have high-frequency short transactions where the latency floor matters (e.g., real-time bidding)
- You need advanced query optimization features found in mature relational databases
- Your data fits on a single node or requires minimal horizontal scaling
- You require open-source flexibility without vendor lock-in
Decision Framework
flowchart TD
A[Need Horizontal Scale?] --> |Yes| B[Need Global Distribution?]
A --> |No| E[Consider PostgreSQL or MySQL]
B --> |Yes| C[Need Strict Consistency?]
B --> |No| F[Consider DynamoDB or Cassandra]
C --> |Yes| D[Can Afford Premium Pricing?]
C --> |No| F
D --> |Yes| G[Spanner is a Strong Fit]
D --> |No| H[Consider CockroachDB or Aurora Global]
Quick Recap Checklist
- Spanner arranges data in directories (B-tree-like structures) grouped into Paxos groups for replication
- TrueTime uses GPS and atomic clocks to produce bounded clock uncertainty intervals, enabling globally consistent reads
- Schema supports standard SQL with foreign keys, interleaved tables, generated columns, and indexes
- Read-write transactions use two-phase commit over Paxos; read-only transactions use timestamp-bound snapshots
- Primary key design matters: avoid sequential keys for write-heavy workloads to prevent hotspots
- Interleaved tables co-locate parent-child rows but can create hotspots for popular parents
- External consistency (stronger than serializable) comes from TrueTime’s commit-wait rule
- Spanner works best for globally distributed OLTP, financial systems, and multi-tenant SaaS
- Cost scales linearly with nodes: storage (
$0.17-0.25/GB/mo), compute ($0.045-0.09/vCPU-hour)
Interview Questions
Expected answer points:
- Spanner solves the problem of globally distributed SQL: petabytes of data across data centers with relational model and ACID guarantees
- Unlike early distributed databases that sacrificed consistency or the relational model, Spanner keeps both
- Key differentiator: TrueTime hardware enables external consistency without a single synchronization point
Expected answer points:
- TrueTime uses GPS receivers and atomic clocks in each data center to generate globally consistent timestamps
- It returns a time interval (TTinterval) with earliest and latest possible timestamps, not a single point
- The uncertainty bound is roughly 1-7 milliseconds
- This enables external consistency: if T1 commits before T2 starts in real time, T1's timestamp is less than T2's
- Without bounded uncertainty, you need a single synchronized time source (which fails across continents)
Expected answer points:
- The commit-wait rule: a transaction must wait until its commit timestamp is in the past (past the TrueTime uncertainty bound)
- This guarantees that any transaction reading at timestamp T sees all transactions that committed before T
- Wait time is bounded by the maximum clock uncertainty (typically under 10ms)
- This is what gives Spanner external consistency without synchronized clocks
Expected answer points:
- Directories are B-tree-like structures that serve as the unit of data placement
- Directories can move between servers and zones dynamically without affecting running transactions
- Each directory falls under a Paxos group controlling replication across 3-5 replicas
- Hierarchy: databases → tables → directories → key-value pairs
Expected answer points:
- Read-write transactions run two-phase commit (2PC) over Paxos: nodes prepare first, then coordinator tells them to commit or abort
- Read-only transactions skip locks entirely and read from consistent snapshots at a timestamp bound
- Read-only transactions have no blocking or contention, making them much faster for consistent reads
- Spanner also supports partitioned transactions for high-contention workloads
Expected answer points:
- Data is range-partitioned by primary key, so poorly chosen keys create hotspots that degrade performance
- Sequential keys (like auto-incrementing integers) cause all writes to hit a single tablet
- UUIDs or hashed keys distribute load across nodes for write-heavy workloads
- For read-heavy tables, use the most frequently queried ID as the leading key
- For write-heavy tables with many rows per user, consider composite keys with a suffix to spread writes
Expected answer points:
- Interleaved tables model parent-child relationships where child rows are co-located with their parent on the same storage node
- Benefit: fetching a parent with its children stays in a single tablet read, avoiding distributed queries
- ON DELETE CASCADE automatically cleans up child rows when the parent is deleted
- Trade-off: if you frequently query children independently, all children of a parent land in the same tablet, creating a hotspot
- Deleting the parent becomes expensive because Spanner must delete all children first
Expected answer points:
- Paxos ensures all replicas agree before writes are acknowledged
- Typically 3-5 replicas are placed across regions for disaster recovery
- The Paxos leader handles all reads and writes
- Read leases let followers serve reads locally for a period, reducing latency for read-heavy workloads
- Witness replicas participate in Paxos without storing data
Expected answer points:
- External consistency: if transaction T1 commits before T2 starts in real-time, then T1 happens before T2
- This is stronger than serializability (which only guarantees equivalent to some serial order) because it preserves real-time ordering
- Most applications need exactly this guarantee: actions that happened before appear before to all observers
- Achieved through TrueTime's commit-wait rule
Expected answer points:
- Cost: premium pricing compared to self-managed databases
- TrueTime dependency: needs Google's hardware for full benefit (competitors use HLC instead)
- Latency floor: even local reads have 1-2ms minimum due to Paxos overhead
- Query optimization less mature than PostgreSQL or MySQL
- Avoid when: workload is single-region, cost-sensitive, has high-frequency short transactions, or data fits on a single node
Expected answer points:
- Coordinator asks all participant nodes to prepare: they promise to commit if asked
- Once a majority (Paxos group) promises, coordinator decides COMMIT
- Coordinator tells all participants to commit, and they acknowledge
- This adds latency compared to single-node commits: round trips between distributed nodes
- Cross-region writes can take 10-50ms depending on distance and replica setup
Expected answer points:
- F1 is Google's ad business database that was migrated from Bigtable to Spanner
- It validates Spanner can handle a real mission-critical workload: thousands of tables, complex joins, OLTP alongside analytical queries
- Strict SLAs for latency and availability had to be met during and after migration
- Proves Spanner is production-hardened, not just a research prototype
Expected answer points:
- Spanner keeps serving from other zones when a single AZ fails; replicas in other zones continue operating
- TrueTime bounds how stale reads can get during outage periods
- Network partition: isolated region cannot reach replica majority, so Spanner stays available in the majority partition
- Isolated region serves cached data until partition heals
Expected answer points:
- Three main components: storage ($0.17-0.25/GB/month), compute ($0.045-0.09/vCPU-hour), and networking egress
- Storage: regional ~$0.17/GB, multi-region ~$0.25/GB
- Compute: regional ~$0.045/vCPU-hour, multi-region ~$0.09/vCPU-hour
- Egress charges apply when data leaves Google Cloud infrastructure
- Costs scale linearly with nodes, so you can start small and scale up
Expected answer points:
- Spanner: external consistency via TrueTime (GPS + atomic clocks), fully managed cloud, native multi-region
- CockroachDB: serializability via HLC (no special hardware), self-hosted or cloud, native multi-region
- Aurora: session + eventual consistency, AWS managed only, read replicas with Aurora Global for cross-region
- Write latency: Spanner 10-50ms cross-region, CockroachDB 20-100ms, Aurora 5-20ms single region
- Spanner has the strongest consistency but highest price; CockroachDB offers similar semantics with open-source option
Expected answer points:
- When TrueTime hardware fails, clock uncertainty grows until the hardware is replaced
- Reads can return slightly stale data while uncertainty is elevated
- Spanner continues operating but consistency guarantees weaken temporarily
- Mitigation: monitor clock synchronization closely and replace TrueTime hardware quickly when errors surface
Expected answer points:
- Add one column at a time rather than batching multiple changes
- Avoid large DEFAULT values that increase write amplification
- Use NOT NULL only when truly required, since it forces a full table rewrite
- Schedule large schema changes during low-traffic windows when possible
- Online schema changes via ALTER TABLE have no downtime but can take time on very large tables
Expected answer points:
- More retries and latency when concurrent transactions touch overlapping key ranges
- Hotspots from poor primary key design or popular parent keys in interleaved tables exacerbate conflicts
- Mitigation: prefer read-only transactions where feasible (no locks, no conflicts)
- Structure key schema to reduce write contention: spread writes across key space
- Consider partitioned transactions for large batch operations
Expected answer points:
- The Paxos leader handles all reads and writes by default
- Read leases allow follower replicas to serve reads locally for a bounded period
- During the lease period, clients can read from the nearest follower without contacting the leader
- This cuts latency significantly for read-heavy workloads where slight staleness is acceptable
- Lease duration is managed by Paxos; followers must renew leases to continue serving local reads
Expected answer points:
- Choose Spanner when: you need writes in multiple geographic regions simultaneously, strict serializability across regions, or automatic horizontal scaling without application-level sharding
- Choose PostgreSQL with read replicas when: workload is single-region, cost-sensitive, or does not need global distribution
- Spanner's managed operations eliminate database administration overhead but at premium pricing
- PostgreSQL offers advanced query optimization and more mature ecosystem
- Decision framework: need horizontal scale + global distribution + strict consistency + budget = Spanner; otherwise PostgreSQL
Further Reading
- TrueTime: Time You Can Trust — How GPS and atomic clocks enable globally consistent timestamps
- CAP Theorem: Beyond the Binary — Why consistency vs availability tradeoffs define distributed systems
- CAP Theorem explains why consistency across partitions is hard
- Consistency Models covers external consistency and snapshot isolation
- Database Replication discusses Spanner’s replication strategies
- Database Replication: From Primary to Multi-Primary — How replication topologies impact consistency and availability
- Distributed Transactions explains two-phase commit
- Distributed Transactions: Two-Phase Commit — The consensus puzzle and how Spanner solves it with Paxos
Operational Guidance
Backup and Restore
Spanner handles backups automatically with no user intervention. On-demand backups are available through the Cloud Console or the spanner_backup.sh script. You can restore to any point within the last 7 days using point-in-time recovery.
Cross-region backups need a manual setup: export Spanner data to Avro or Parquet in Cloud Storage, then copy files to your target region. Budget engineering time for this if you have strict data residency requirements.
Schema Changes
Spanner supports online schema changes via ALTER TABLE statements. There is no downtime for schema changes, but they go through an internal schema upgrade process that can take time on very large tables.
# Online schema change with ALTER TABLE
operation = client.update_database_ddl(
database_id,
['ALTER TABLE Users ADD COLUMN last_login TIMESTAMP']
)
operation.result() # Waits for completion
Schema change best practices:
- Add one column at a time rather than batching multiple changes
- Avoid large DEFAULT values that increase write amplification
- Use
NOT NULLonly when truly required, since it forces a full table rewrite - Schedule large schema changes during low-traffic windows when possible
Import and Export
Cloud Dataflow handles bulk import and export. Spanner exports to Avro or Parquet in Cloud Storage. For imports, point Dataflow at your Cloud Storage files and let Spanner do the distributed write.
# Cloud Dataflow export job configuration (pseudocode)
export_job = {
"target_project": "my-project",
"target_instance": "my-instance",
"target_database": "my-db",
"output_prefix": "gs://my-bucket/spanner-export/",
"output_format": "AVRO", # or PARQUET
"batch_size": 10000
}
For small datasets, the Cloud Console provides a no-code import wizard. For production workloads, always use Dataflow with appropriate batch sizing.
Cost Modeling
Spanner pricing has three main components: storage, compute, and networking egress.
Storage
Storage is charged at approximately $0.17 per GB per month for regional storage. Multi-region storage costs more, around $0.25 per GB per month. Storage includes all data, indexes, and backup overhead.
Compute
Compute is priced per vCPU-hour. A regional configuration costs roughly $0.045 per vCPU-hour. Multi-region configurations (necessary for global write distribution) cost approximately $0.09 per vCPU-hour, roughly double the regional rate.
Networking Egress
Networking charges apply when data leaves Google Cloud infrastructure. Cross-region writes and reads within Spanner are generally cheap, but egress to the public internet or to other cloud providers adds up quickly.
Cost Comparison Example
| Configuration | Storage (1TB) | Compute (16 vCPU) | Estimated Monthly Cost |
|---|---|---|---|
| Single-region | $170/month | $26/month (regional) | ~$196/month |
| Multi-region (US) | $170/month | $52/month (multi-region) | ~$222/month |
| Multi-region (global) | $250/month | $104/month | ~$354/month |
These numbers assume 24/7 operation at full compute capacity. Actual costs vary with utilization.
Cost Estimation Pseudocode
def estimate_monthly_spanner_cost(
storage_gb: float,
vcpus: int,
region_type: str = "regional", # "regional" | "multi-region-us" | "multi-region-global"
egress_gb: float = 0
) -> dict:
# Storage pricing per GB/month
storage_prices = {
"regional": 0.17,
"multi-region-us": 0.20,
"multi-region-global": 0.25
}
# Compute pricing per vCPU-hour
compute_prices = {
"regional": 0.045,
"multi-region-us": 0.07,
"multi-region-global": 0.09
}
# Egress pricing per GB (approximate)
egress_rate = 0.12
hours_per_month = 730 # Average month
storage_cost = storage_gb * storage_prices[region_type]
compute_cost = vcpus * compute_prices[region_type] * hours_per_month
egress_cost = egress_gb * egress_rate
return {
"storage": round(storage_cost, 2),
"compute": round(compute_cost, 2),
"egress": round(egress_cost, 2),
"total": round(storage_cost + compute_cost + egress_cost, 2)
}
# Example: 1TB storage, 16 vCPU, multi-region US, 100GB egress
cost = estimate_monthly_spanner_cost(
storage_gb=1024,
vcpus=16,
region_type="multi-region-us",
egress_gb=100
)
# cost["total"] ~ $270/month
Plug your numbers into the Google Cloud Pricing Calculator for real estimates. One practical note: Spanner costs scale linearly with nodes, so you can start small for dev/test environments and scale up when you go to production.
Conclusion
Spanner combines SQL with horizontal scalability and strong consistency. It shows you do not have to throw away ACID guarantees to get NoSQL-like scaling. TrueTime hardware is the main barrier to competition, but the ideas have shaped how engineers think about distributed consistency.
For globally distributed apps needing strong consistency, Spanner is worth the cost. For others, studying its architecture teaches you the real trade-offs in distributed database design.
Category
Related Posts
Amazon DynamoDB: Scalable NoSQL with Predictable Performance
Deep dive into Amazon DynamoDB architecture, partitioned tables, eventual consistency, on-demand capacity, and the single-digit millisecond SLA.
Google Chubby: The Lock Service That Inspired ZooKeeper
Explore Google Chubby's architecture, lock-based coordination, Paxos integration, cell hierarchy, and its influence on distributed systems design.
Synchronous Replication: Data Consistency Across Nodes
Learn about synchronous replication patterns in distributed databases, including Quorum-based replication and write-ahead log shipping for zero RPO.