Multi-Tenancy: Shared Infrastructure, Isolated Data
Multi-tenancy lets multiple customers share infrastructure while keeping data isolated. Explore schema strategies, tenant isolation, and SaaS architecture.
Multi-Tenancy: Shared Infrastructure, Isolated Data
Multi-tenancy is how most SaaS applications work. You deploy one application that serves thousands of customers, with each customer’s data kept separate. The appeal is obvious: shared infrastructure means lower costs, easier maintenance, fewer deployment pipelines.
But the complexity does not disappear. It moves. Instead of managing many isolated deployments, you manage isolation within a shared environment. Get isolation wrong and you leak data between tenants. Get performance wrong and one loud neighbor drowns out everyone else.
Introduction
A tenant is a group of users who share access to the same data. In a multi-tenant system, one application instance serves multiple tenants. Each tenant cannot see or access other tenants’ data.
Single-tenancy is the alternative: each customer gets their own application and database. Stronger isolation, higher costs.
graph TD
subgraph "Multi-Tenant Architecture"
A[Application] --> B[Shared Database]
B --> C[Tenant A Data]
B --> D[Tenant B Data]
B --> E[Tenant C Data]
end
subgraph "Single-Tenant Architecture"
F[App A] --> G[DB A]
H[App B] --> I[DB B]
end
The shared database approach is the most cost-effective. One database, one application, one deployment pipeline. Compute and storage costs scale sub-linearly with tenants.
Schema Strategies
How you organize tenant data in the database affects isolation, performance, and complexity.
Shared Schema with Tenant ID
All tenants share the same tables. A tenant_id column identifies which row belongs to which tenant.
CREATE TABLE orders (
id UUID PRIMARY KEY,
tenant_id UUID NOT NULL,
user_id UUID,
total DECIMAL(10,2),
created_at TIMESTAMP,
CONSTRAINT tenant_isolation CHECK (tenant_id IS NOT NULL)
);
-- Query always filters by tenant
SELECT * FROM orders WHERE tenant_id = 'tenant-123';
Every query must include the tenant_id filter. Miss it once and you have a data leak. Use row-level security (RLS) in PostgreSQL or similar features to enforce this at the database level.
-- Enable row-level security
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
-- Create policy that filters by current_setting
CREATE POLICY tenant_isolation ON orders
USING (tenant_id::text = current_setting('app.current_tenant'));
RLS makes it impossible to accidentally query across tenants. The database enforces isolation even if your application code has bugs.
Separate Schemas per Tenant
Each tenant gets their own schema within the same database.
-- Tenant A's schema
CREATE SCHEMA tenant_a;
CREATE TABLE tenant_a.orders (...);
-- Tenant B's schema
CREATE SCHEMA tenant_b;
CREATE TABLE tenant_b.orders (...);
Applications connect with a search_path that includes the tenant’s schema. Queries do not need explicit tenant_id filtering because the schema provides implicit isolation.
The trade-off: schema migrations become more complex. You must run migrations against every tenant schema. With thousands of tenants, this does not scale.
Separate Databases per Tenant
Maximum isolation. Each tenant gets their own database instance.
graph TD
A[Load Balancer] --> B[Application Cluster]
B --> C[Tenant A DB]
B --> D[Tenant B DB]
B --> E[Tenant N DB]
This approach suits regulatory requirements where data must be physically separated. It also simplifies per-tenant customization. But the operational overhead is brutal: thousands of databases mean thousands of backups, thousands of patches, thousands of failure points.
Most SaaS companies do not need this level of isolation until they have specific compliance requirements.
Tenant Isolation Patterns
Beyond the database, you need to think about isolation at every layer of your stack.
Application-Layer Isolation
Your application code must be tenant-aware from the start. Middleware extracts the tenant from the request and sets a context variable.
from contextvars import ContextVar
from flask import g, request
current_tenant: ContextVar[str] = ContextVar('tenant_id')
@app.before_request
def before_request():
# Extract tenant from JWT or subdomain
token = request.headers.get('Authorization')
tenant = extract_tenant_from_token(token)
current_tenant.set(tenant)
@app.route('/orders')
def get_orders():
tenant_id = current_tenant.get()
return query_orders_for_tenant(tenant_id)
Do not rely on user input to determine the tenant without validation. A user should not be able to specify their tenant_id in a request parameter unless your application explicitly maps users to tenants.
Caching Considerations
Redis and similar caches are shared across tenants. You must namespace keys by tenant.
# Bad: cache key could collide between tenants
cache.set(f"user:{user_id}", user_data)
# Good: tenant-scoped cache key
cache.set(f"tenant:{tenant_id}:user:{user_id}", user_data)
If you use cache-aside caching, be careful about cache stampedes when a tenant’s data expires. One tenant’s traffic spike could evict another tenant’s frequently-accessed data.
Background Jobs and Queues
Worker processes handle background tasks. These must also be tenant-aware.
# Task includes tenant context
@celery.task
def generate_report(tenant_id, report_type):
# Use tenant_id throughout
data = fetch_data_for_tenant(tenant_id)
report = build_report(data, report_type)
store_report_for_tenant(tenant_id, report)
Never assume that because a task was queued by one tenant, it only affects that tenant. Cross-tenant bugs in background jobs are particularly nasty because they may not be caught until data is already corrupted.
Performance Isolation
Shared infrastructure means shared resources. Without careful design, one tenant’s workload can degrade performance for everyone.
Resource Quotas
Implement per-tenant limits on CPU, memory, database connections, and API calls. Track usage against quotas and throttle or reject requests that exceed limits.
@dataclass
class TenantQuota:
tenant_id: str
monthly_spend_limit: float
api_rate_limit: int # requests per minute
max_db_connections: int
storage_gb: float
def check_quota(tenant_id: str, operation: str) -> bool:
quota = get_quota(tenant_id)
current_usage = get_current_usage(tenant_id)
if operation == 'api_request':
if current_usage.api_requests >= quota.api_rate_limit:
return False
elif operation == 'db_connection':
if current_usage.db_connections >= quota.max_db_connections:
return False
return True
Compute Isolation with Namespace
If you use Kubernetes, you can isolate tenants using namespaces and resource quotas. CPU limits prevent one tenant from consuming all available compute.
apiVersion: v1
kind: ResourceQuota
metadata:
name: tenant-quota
namespace: tenant-123
spec:
hard:
requests.cpu: "4"
requests.memory: 8Gi
limits.cpu: "8"
limits.memory: 16Gi
Database Connection Pooling
Database connections are often the scarcest resource. Use PgBouncer or similar connection poolers to multiplex many application connections over fewer database connections.
; pgbouncer.ini
[databases]
app_db = host=db.example.com port=5432 dbname=production
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 50
With transaction-mode pooling, connections are released back to the pool when transactions commit. This lets you support far more tenants per database instance.
Security Considerations
Multi-tenancy amplifies security risks. A vulnerability affects not just one customer but potentially all customers.
Access Controls
Implement role-based access control that respects tenant boundaries. Users should only be able to access resources within their own tenant.
@require_permission('read:orders')
def get_orders(tenant_id):
# Permission check happens via decorator
# It verifies user belongs to tenant_id
return db.orders.filter(tenant_id=tenant_id)
Audit logging becomes critical. You need to know who accessed what and when. Log tenant_id with every security-relevant event.
Network Isolation
Consider network-level isolation for sensitive tenants. Private networking (VPC peering, private links) keeps tenant traffic off shared networks.
For highly regulated industries, some tenants may require dedicated infrastructure. This moves toward single-tenancy but within a managed environment.
Data Encryption
Encrypt data at rest and in transit. With shared databases, encryption at rest protects against infrastructure-level breaches but not against application-level vulnerabilities.
Use tenant-specific encryption keys if your compliance requirements demand it. AWS KMS and similar services support per-tenant keys with envelope encryption.
Data Residency and Compliance
Where data physically lives matters when you have customers in different jurisdictions. GDPR requires that personal data of EU residents either stays in the EU or moves to countries with equivalent protections. HIPAA requires healthcare data to meet specific security standards regardless of where it is stored.
Multi-tenancy complicates this because all tenants share infrastructure. You need to know which tenants have residency requirements and route their data accordingly.
Tenant-Region Routing
Route tenants to region-specific database instances based on their residency requirements. A European tenant gets a database in Frankfurt. An Australian tenant gets Sydney.
TENANT_REGION_MAP = {
'eu': 'db-frankfurt.example.com',
'us': 'db-virginia.example.com',
'apac': 'db-sydney.example.com',
}
def get_db_connection(tenant_id: str) -> Connection:
tenant = get_tenant_metadata(tenant_id)
host = TENANT_REGION_MAP.get(tenant.region, TENANT_REGION_MAP['us'])
return create_connection(host=host)
This hybrid approach - shared application, region-specific databases - keeps multi-tenancy cost benefits while meeting residency requirements.
GDPR Considerations
GDPR compliance requires planning for the full data lifecycle:
- Right to erasure: when a tenant requests deletion, purge data from all systems including backups, logs, and caches. Document the erasure workflow before you need it.
- Data Processing Agreements: your tenants are likely data processors for their users. You need DPAs in place before onboarding them.
- Audit trails: maintain logs of who accessed what data and when. These logs must themselves comply with retention policies.
HIPAA Considerations
Healthcare tenants typically require Business Associate Agreements with your infrastructure providers, audit logs for all access to protected health information, and encryption at rest with specific key management requirements.
Many SaaS companies offer a separate “HIPAA tier” that moves qualifying tenants to dedicated infrastructure. The operational overhead is real, but it is often the only way to serve healthcare customers without compromising the shared tenant base.
Tenant Lifecycle Management
Tenants are not static. They sign up, grow, change plans, and eventually leave. Your system needs to handle the full lifecycle cleanly.
Tenant Onboarding
New tenant provisioning should be automated and idempotent. If provisioning fails halfway, you should be able to rerun it safely.
def provision_tenant(tenant_id: str, plan: str) -> None:
"""Idempotent tenant provisioning."""
# Create database schema if not exists
if not schema_exists(tenant_id):
create_schema(tenant_id)
run_migrations(tenant_id)
# Set initial quotas based on plan
quotas = PLAN_QUOTAS[plan]
set_tenant_quotas(tenant_id, quotas)
# Seed initial data
if not has_initial_data(tenant_id):
seed_tenant_data(tenant_id)
# Enable features for plan
set_tenant_features(tenant_id, PLAN_FEATURES[plan])
Idempotency matters because provisioning can fail at any step. If you rerun it and it tries to recreate an already-created schema, it should succeed or skip gracefully rather than fail.
Plan Changes and Upgrades
When a tenant upgrades, update quotas immediately. Downgrade requires validating that current usage does not exceed new limits.
def change_tenant_plan(tenant_id: str, new_plan: str) -> bool:
current_usage = get_tenant_usage(tenant_id)
new_quotas = PLAN_QUOTAS[new_plan]
# Validate downgrade will not violate new limits
if current_usage.exceeds(new_quotas):
return False # Tenant must reduce usage first
update_tenant_quotas(tenant_id, new_quotas)
update_tenant_features(tenant_id, PLAN_FEATURES[new_plan])
return True
Tenant Offboarding and Data Deletion
When a tenant cancels, you need a clear data retention and deletion policy. Most SaaS products keep data for 30-90 days after cancellation for potential reactivation, then delete it.
def offboard_tenant(tenant_id: str, hard_delete: bool = False) -> None:
if hard_delete:
# Permanent deletion for GDPR erasure requests
delete_all_tenant_data(tenant_id)
drop_tenant_schema(tenant_id)
purge_tenant_from_cache(tenant_id)
schedule_backup_deletion(tenant_id)
else:
# Soft delete - mark for eventual cleanup
mark_tenant_inactive(tenant_id)
revoke_tenant_access(tenant_id)
schedule_deletion(tenant_id, days=90)
Data deletion across all systems is harder than it looks. Logs, analytics pipelines, data warehouses, and backups all potentially contain tenant data. Maintain a data inventory so you know exactly where to look when a deletion request arrives.
When to Use Multi-Tenancy
Multi-tenancy makes sense when:
- Your tenants have similar workloads and resource needs
- Cost efficiency matters more than maximum isolation
- You can build and maintain proper isolation tooling
- Regulatory requirements allow shared infrastructure
Single-tenancy makes sense when:
- Tenants have wildly different resource requirements
- Strong regulatory isolation is required
- Per-tenant customization is extensive
- Tenant count is small (dozens, not thousands)
Most SaaS applications start multi-tenant. The efficiency gains are hard to pass up. Build isolation and observability tooling early, before you have hundreds of tenants and bugs that are hard to fix.
Real-world Failure Scenarios
Multi-tenancy failures often stem from subtle isolation gaps. Here are the most common:
The Tenant ID Injection Attack
A request handler forgets to validate that the tenant_id from the JWT actually owns the resource being accessed:
# Vulnerable - trusts tenant_id from token blindly
def get_project(project_id):
tenant_id = get_tenant_from_token()
return db.query("SELECT * FROM projects WHERE id = ?", project_id)
An attacker with tenant A’s token can guess or brute-force project IDs belonging to tenant B. The fix: always verify tenant_id ownership in queries.
The Shared Cache Poisoning Incident
A caching layer doesn’t namespace keys by tenant. User A’s data appears in User B’s session:
# Wrong - collision across tenants
cache.set(f"user:{user_id}", data)
# Correct - tenant-scoped
cache.set(f"{tenant_id}:user:{user_id}", data)
The Migration That Took Down All Tenants
A schema migration on a shared table locked the database for 45 minutes during peak traffic. Every tenant experienced timeouts simultaneously.
Prevention: Use online schema change tools (pt-online-schema-change, gh-ost). Test migrations against production-scale datasets. Schedule maintenance windows.
The Noisy Neighbor Resource Exhaustion
One tenant’s batch job consumed 90% of CPU, causing latency spikes for all other tenants on the shared cluster.
Prevention: Enforce compute quotas per tenant. Set cgroup limits. Implement fair queuing for background jobs.
The Backup Restoration Accident
During a recovery drill, an on-call engineer accidentally restored production backup to the staging environment. Staging used different credentials, but the backup overwrote cross-tenant data in shared tables.
Prevention: Test restores on isolated systems. Use point-in-time recovery to isolate tenants. Maintain separate backup credentials per environment.
Trade-off Analysis
| Factor | Shared Schema | Separate Schema | Separate Database |
|---|---|---|---|
| Isolation | Low - RLS required | Medium - schema separation | High - complete isolation |
| Cost Efficiency | Highest | High | Low |
| Operational Complexity | Medium | High | Highest |
| Schema Migrations | Shared - risky | Per-tenant - complex | Per-tenant - complex |
| Query Performance | Requires tenant_id indexes | Good - implicit isolation | Best per-tenant |
| Customization | Limited | Schema-level customization | Full customization |
| Backup/Restore | All tenants together | Per schema | Per database |
| Regulatory Fit | General data | Segregated data | Strict isolation |
| Tenant Count | Thousands | Hundreds | Dozens |
| Failure Domain | All tenants share | Schema-level failures | Isolated per tenant |
Multi-Tenancy Isolation Architecture
graph TB
subgraph "Shared Infrastructure"
LB[Load Balancer]
App[Application Cluster]
Cache[(Shared Cache<br/>namespaced)]
end
subgraph "Data Layer Options"
direction LR
SharedSchema["Shared Schema<br/>tenant_id + RLS"]
SeparateSchema["Separate Schemas<br/>per tenant"]
SeparateDB["Separate Databases<br/>per tenant"]
end
subgraph "Isolation Boundaries"
direction TB
Network[Network Isolation<br/>VPC/Private Links]
Compute[Compute Quotas<br/>per tenant]
Storage[Storage Limits<br/>per tenant]
end
LB --> App
App --> Cache
App --> SharedSchema
App --> SeparateSchema
App --> SeparateDB
Network -.->|applies to| App
Compute -.->|applies to| App
Storage -.->|applies to| SharedSchema
For more on related topics, see Microservices Architecture, API Gateway Patterns, and Database Scaling.
Cost Optimization
Multi-tenancy’s appeal is cost efficiency. Make sure you are actually achieving it.
Shared Compute
A single application deployment handling thousands of tenants uses resources far more efficiently than thousands of single-tenant deployments.
The numbers: if single-tenancy needs 1GB RAM per tenant, 1000 tenants requires 1TB RAM. With multi-tenancy and proper resource sharing, you might handle the same workload with 64GB RAM.
Database Cost per Tenant
Track your database cost per active tenant. As tenants grow, you need to decide whether to scale up (bigger database) or scale out (more database instances).
def calculate_cost_per_tenant():
monthly_db_cost = get_monthly_database_bill()
active_tenants = get_active_tenant_count()
return monthly_db_cost / active_tenants
If cost per tenant exceeds thresholds, investigate. Perhaps some tenants are outliers with unusual workloads. Perhaps your schema design needs optimization.
Metadata Tiering
Keep tenant metadata (billing info, subscription tier, settings) in a lightweight store. Full application data stays in your main database. This separation lets you scale metadata independently.
Quick Recap Checklist
- Multi-tenancy shares infrastructure for cost efficiency but requires strict isolation
- Schema strategies range from shared tables (tenant_id) to separate databases per tenant
- RLS and similar database features enforce isolation at the data layer
- Performance isolation requires quotas, resource limits, and monitoring
- Security isolation requires defense in depth across all layers
Setup checklist:
- Tenant context extraction from auth token
- Tenant ID validation on every request
- Row-level security enabled on databases
- Cache keys namespaced by tenant
- Per-tenant resource quotas defined
- Background jobs include tenant context
- Monitoring per tenant (not just aggregate)
- Quota alerts configured
- Cross-tenant access monitoring enabled
- Regular tenant isolation audit
Observability Checklist
-
Metrics:
- Queries per tenant (identify noisy neighbors)
- Database connection usage per tenant
- Cache hit rate per tenant
- API latency per tenant
- Quota utilization per tenant
-
Logs:
- Tenant ID logged on every security-relevant event
- Cross-tenant access attempts (should be zero)
- Quota violations with tenant context
- Migration progress per tenant
-
Alerts:
- Any cross-tenant data access attempts
- Tenant exceeding quota thresholds
- Database connection pool saturation
- Slow queries from specific tenants
- Cache invalidation failures
Security Checklist
- Row-level security enabled on all tenant-scoped tables
- Tenant ID cannot be user-supplied without validation
- Cache keys namespaced by tenant
- Background jobs include tenant validation
- Audit logging captures tenant context on all data access
- Network isolation for sensitive tenants (VPC/private links)
- Per-tenant encryption keys for sensitive data
- Access control respects tenant boundaries at every layer
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| tenant_id filter missing in query | Cross-tenant data exposure | Enable RLS at database level; audit queries in code review |
| One tenant consumes all connections | Other tenants cannot access database | Per-tenant connection limits; connection pool monitoring |
| Cache key collision between tenants | Tenant A sees Tenant B data | Use tenant-scoped cache keys; implement key prefixing |
| Background job processes wrong tenant | Data cross-contamination | Pass tenant_id explicitly; validate tenant context in every job |
| Schema migration fails on shared table | All tenants affected simultaneously | Test on sample tenants first; maintain backwards compatibility |
| Quota enforcement bug | One tenant monopolizes resources | Implement quota checks at multiple layers; monitor usage |
Common Pitfalls / Anti-Patterns
Query Accidents
Forgetting to filter by tenant_id is the most common bug. Use RLS or similar database-level enforcement to protect against this.
-- This will fail or return empty if RLS is properly configured
SELECT * FROM orders;
Cache Invalidation
When tenant data changes, you must invalidate the correct cache keys. Use consistent key naming and always include tenant_id in cache invalidation logic.
Migration Runbook
When you need to alter a shared schema, you must:
- Test the migration against a representative sample of tenants
- Plan for backwards compatibility (old and new code running simultaneously)
- Have a rollback plan
- Execute during low-traffic windows
Schema changes on shared tables are risky. One bad migration affects all tenants simultaneously.
Storing tenant_id in User Input
The user should never provide their own tenant_id. Extract it from authenticated context (JWT, session, OAuth token).
Shared Cache Without Namespacing
Redis is shared across tenants. Without key namespacing, one tenant can evict another’s data.
Global State for Tenant Context
Using global variables for tenant context breaks under async/multithreaded execution. Use context variables or dependency injection.
Trusting Subdomain for Tenant Identification
Subdomains can be spoofed. Always validate tenant identity from authenticated credentials.
Interview Questions
Multi-tenancy means one application instance serves multiple customers (tenants), with each tenant's data isolated from others. Single-tenancy gives each customer their own application and database instance.
The trade-off is cost versus isolation. Multi-tenancy is cheaper to operate (shared infrastructure, single deployment pipeline) but requires careful isolation to prevent data leakage between tenants. Single-tenancy provides stronger isolation naturally, but costs scale linearly with customers.
Shared schema with tenant_id: All tenants share tables, rows identified by tenant_id. Cheapest, hardest to isolate. Row-level security helps. Risk of noisy neighbor on indexes.
Separate schemas per tenant: Each tenant gets their own schema in the same database. Better isolation, reasonable cost. Schema migrations become more complex: you must run them per tenant or handle schema drift.
Separate databases per tenant: Each tenant gets their own database. Strongest isolation, highest cost. Best for tenants with strict compliance requirements or wildly different schema needs. Operational complexity explodes at scale.
Three common approaches, in order of security:
JWT claims: Extract tenant_id from the authenticated user's JWT token. Most secure: tenant identity comes from authentication, not the request. The application cannot be fooled by a spoofed subdomain or header.
Authenticated credentials: Query the database by authenticated user to find their tenant. Requires a database lookup but still trustworthy.
Subdomain: Extract tenant from tenant.myapp.com. Not trustworthy alone; subdomains can be spoofed. Use only as a hint, always validate against authenticated user.
The noisy neighbor problem: one tenant's workload saturates shared resources, degrading performance for all other tenants.
Prevention strategies: resource quotas per tenant (CPU, memory, IOPS limits), compute isolation for heavy tenants (separate instance or dedicated capacity), connection pooling limits per tenant, and aggressive auto-scaling triggers. At the database level, use connection limits per tenant and consider separate schemas or databases for tenants with consistent high load.
It depends on your schema strategy. For shared schema, migrations work normally: all tenants get the same schema changes. For separate schemas, you have two options: run migrations per tenant sequentially (slow for many tenants, risk of schema drift) or maintain a "gold schema" and apply migrations to each tenant in parallel.
For separate databases per tenant, migrations are isolated but operational complexity explodes. You need automation to run migrations across hundreds of tenant databases; consider tools like Terraform for orchestration or managed services that handle this.
Tenant isolation is the core concern. Ensure queries always include tenant_id filters; accidental omission leaks data. Use row-level security at the database level as a safety net even if application code is correct.
Network isolation matters: tenants on shared infrastructure can potentially sniff traffic meant for others if your VPC setup is wrong. Use private subnets, security groups scoped to tenants where possible, and encrypt data at rest and in transit.
Audit logging becomes critical: you need to know which tenant accessed what data and when, for compliance and debugging.
Separate databases make sense when you have tenants with dramatically different requirements: compliance mandates that demand physical isolation (HIPAA, GDPR), tenants who need custom schema modifications, or enterprise tenants willing to pay premium for dedicated resources and explicit isolation guarantees.
The operational cost is significant: you're now managing hundreds of database instances instead of one. Only pay this cost when the requirement is real, not hypothetical.
For shared schema: one backup covers all tenants. Recovery is all-or-nothing: you cannot restore one tenant's data without restoring everyone. Consider point-in-time recovery carefully; if one tenant needs a specific restore, you affect all tenants during the operation.
For separate schemas or databases: you can backup and restore per tenant. More operational complexity, but finer-grained recovery options. Some tenants may have compliance requirements for independent backup retention periods.
Per-tenant metrics are essential: tenant-level CPU, memory, database connections, API request counts, and error rates. This lets you identify noisy neighbors and spot tenants who are growing toward quota limits.
Aggregate metrics across the system tell you about overall health: total resource utilization, percentage of tenants at or above 80% of quota, number of tenants with elevated error rates. Watch for concentration: if 80% of your traffic comes from 3 tenants, those tenants are effectively your SLA.
Feature flags per tenant let you enable features for specific tenants without deploying separate code. Store tenant_id-to-feature mapping in a configuration store (database table, feature flag service, or config file).
When evaluating a feature flag, check the tenant context first - extract tenant from the authenticated request, then look up which features are enabled for that tenant. Cache aggressively since this runs on every request. Consider using a dedicated feature flag service like LaunchDarkly or Statsig that handles the tenant scoping natively.
Route tenants to region-specific databases based on their residency requirements. A European tenant gets a database in Frankfurt; an Australian tenant gets Sydney. The application layer stays shared while the data layer is regionalized.
Track each tenant's required region in their metadata. Database connection logic reads this and routes to the appropriate instance. This hybrid approach keeps the cost benefits of multi-tenancy while meeting GDPR, HIPAA, or data sovereignty requirements.
Without a connection pooler, a busy tenant can exhaust the PostgreSQL connection limit (typically 100-200 connections), leaving other tenants unable to connect. PgBouncer acts as a multiplexer: it holds a small pool of actual database connections and queues application requests against that pool.
In transaction-mode pooling, connections return to the pool after each commit, letting you serve far more tenants than the raw PostgreSQL limit allows. Pair this with per-tenant connection caps so no single tenant can monopolize the pool even under that system.
RLS lets you define policies that filter rows automatically based on the current session context. You set a session variable before queries run, and the policy filters all rows to only those matching that tenant - even if the application query omits the WHERE clause.
You enable it with ALTER TABLE ... ENABLE ROW LEVEL SECURITY, then define a policy checking tenant_id against a session variable. The tradeoff is slight query plan complexity and the requirement to set the session variable on every connection. The benefit is that isolation holds even when application code has bugs.
Rate limiting must be per-tenant, not per-IP or globally per-user. A tenant with 100 users should have a combined rate limit, not 100 separate individual limits that effectively bypass the quota.
Implement it in middleware before request processing. Use Redis to track request counts with TTL-based sliding window counters keyed by tenant_id and time bucket. When a tenant exceeds their limit, return 429 with a Retry-After header. Log quota violations - repeated ones usually indicate either a runaway client on the tenant side or a tenant who needs a higher plan.
In a shared schema, every index on a tenant-scoped table should include tenant_id as the first column. Without it, a query for one tenant's orders does a full table scan before filtering - catastrophic at scale.
Composite indexes should be ordered (tenant_id, other_columns). Partial indexes help for common patterns. Monitor index size as tenant count grows: B-tree indexes on high-cardinality tenant_id columns stay manageable, but index bloat in very large shared tables can become a problem worth tracking.
GDPR requires you to export a tenant's complete data in a machine-readable format on request. In a shared schema this means querying every tenant-scoped table filtered by tenant_id and serializing to a standard format like JSON or CSV.
Run exports asynchronously - they can be slow for large tenants. Deliver via a time-limited secure download link. Include everything: user accounts, audit logs, settings, and custom data. Test exports regularly to verify completeness. The worst time to discover your export is missing a table is when a regulator asks.
Schema drift happens when some tenant schemas diverge from the canonical schema - usually because a migration succeeded for most tenants but failed or was skipped for a few. It's subtle because queries work for most tenants but fail or return wrong results for the affected ones.
Prevent it by tracking migration state per tenant in a central registry, alerting on migration failures, and auditing schema versions periodically. If drift is detected, treat affected tenants as blocked from the next migration until the discrepancy is resolved. Tools like Flyway or Liquibase support per-schema tracking out of the box.
Multi-tenancy complicates capacity planning because tenant activity is unpredictable and uneven. Instead of planning per-tenant, plan for peak concurrent load across your entire tenant base.
Key inputs: the 95th-percentile active tenant count at peak hours, average resource consumption per active tenant, and headroom for outlier tenants. Track resource consumption per tenant over time to identify growth trends. Autoscaling handles compute well, but databases scale more slowly - pre-provision for database capacity and use quotas to prevent any single tenant from hitting the ceiling before you can react.
When a user authenticates, the auth service issues a JWT containing the user's ID, their tenant_id, and their permissions within that tenant. Application middleware validates the JWT signature and extracts tenant_id from the claims on every request.
This ties tenant identification to authentication cryptographically. The user cannot claim a different tenant without a valid token for that tenant. The flow is: login request, auth service validates credentials, issues JWT with tenant_id in claims, application extracts and trusts that claim. Never accept tenant_id from request parameters - always derive it from authenticated context.
Most search engines index documents globally. You need to ensure queries are always scoped to the current tenant, similar to how you enforce tenant_id in database queries.
Three approaches: per-tenant indexes (cleanest isolation, higher overhead at scale), a shared index with mandatory tenant_id filter on every query (risk if a query forgets the filter), or index aliases per tenant pointing to a shared index (middle ground using Elasticsearch aliases). Whichever approach, validate that the search query includes a tenant filter before execution - the risk of a misconfigured query leaking search results across tenants is real and hard to detect.
Further Reading
- PostgreSQL Row Security Policies - Official documentation for configuring RLS in production
- PgBouncer Documentation - Connection pooling configuration reference, especially transaction-mode pooling
- AWS Multi-Tenant SaaS Storage Strategies - AWS guide comparing silo, bridge, and pool storage models
- GDPR Article 44 on Data Transfers - Legal basis for data residency requirements affecting SaaS tenants
- API Gateway Patterns - How to handle tenant routing and authentication at the API layer
- Database Scaling Strategies - When and how to scale your database layer as tenant count grows
- Microservices Architecture Roadmap - Multi-tenancy patterns in a microservices context, including service-level isolation
Conclusion
Multi-tenancy is the default choice for SaaS because shared infrastructure is genuinely cheaper to run and easier to maintain. The cost is the isolation work you have to do everywhere - in the database, in caches, in background jobs, and at every layer of the stack.
The schema choice drives most of the downstream architecture. Shared schema with tenant_id is the most common starting point and works well for most cases. Separate schemas give better isolation at the cost of migration complexity. Separate databases are for regulated industries or enterprise clients who need physical separation and are willing to pay for it.
Row-level security at the database is your safety net. Even if application code has bugs, RLS prevents cross-tenant data exposure. Enable it early - retrofitting it later is painful.
Performance isolation matters at scale. A single noisy tenant can saturate shared resources and degrade everyone else. Per-tenant quotas, connection limits, and monitoring are not optional - they are what makes shared infrastructure viable as tenant count grows.
Build observability into the system from day one. Per-tenant metrics let you spot problems before they escalate into incidents. The first time you catch a data leak before a customer does, you will be glad you instrumented this properly.
Category
Related Posts
Geo-Distribution: Multi-Region Deployment Strategies
Deploy applications across multiple geographic regions for low latency and high availability. Covers latency-based routing, conflict resolution, and global distribution.
Centralized vs Distributed VCS: Architecture, Trade-offs, and When to Use Each
Compare centralized (SVN, CVS) vs distributed (Git, Mercurial) version control systems — their architectures, trade-offs, and when to use each approach.
The Three States: Working Directory, Staging Area, and Repository
Explain Git's three-state architecture with diagrams and practical examples — understand how files flow between working, staging, and committed states.