API Gateway: Single Entry Point for Microservices
Learn how API gateways work, when to use them, architecture patterns, failure scenarios, and implementation strategies for production microservices.
API Gateway: The Single Entry Point for Microservices Architecture
Introduction
An API gateway sits at the entrance of your backend services and handles everything that would otherwise clutter individual services or repeat across them: request routing, authentication, rate limiting, protocol translation. When a mobile app asks your product catalog for data, the gateway receives that request, checks the user’s session token, applies rate limiting rules, and forwards it to the product service — all before the actual service sees a single byte.
The benefit is simpler client code. Applications talk to one endpoint instead of tracking multiple service addresses, certificates, and authentication mechanisms. You also get a single place to enforce security policies, collect metrics, and aggregate responses from different services. For teams operating microservices at any real scale, a gateway stops being optional pretty quickly.
Core Concepts
Request Flow
sequenceDiagram
participant Client
participant Gateway as API Gateway
participant Auth as Auth Service
participant Catalog as Product Service
participant Order as Order Service
participant Cache as Redis Cache
Client->>Gateway: POST /api/products/123
Gateway->>Gateway: Extract JWT, Rate Limit Check
Gateway->>Auth: Validate Token
Auth-->>Gateway: Token Valid
Gateway->>Cache: Check Cache
Cache-->>Gateway: Cache Hit
Gateway-->>Client: Product JSON
Client->>Gateway: POST /api/orders
Gateway->>Gateway: Extract JWT, Rate Limit Check
Gateway->>Auth: Validate Token
Auth-->>Gateway: Token Valid
Gateway->>Catalog: Check Product Availability
Catalog-->>Gateway: Available
Gateway->>Order: Create Order
Order-->>Gateway: Order Created
Gateway-->>Client: Order Confirmation
Gateway Internal Components
graph TD
A[Client Request] --> B[TLS Termination]
B --> C[Authentication]
C --> D[Authorization]
D --> E[Rate Limiting]
E --> F[Request Routing]
F --> G[Service Discovery]
G --> H[Backend Service]
H --> F
F --> I[Response Aggregation]
I --> J[Metrics Collection]
J --> K[Client Response]
subgraph Security Layer
C
D
E
end
subgraph Routing Layer
F
G
end
Failure Flow
graph TD
A[Client Request] --> B{Gateway Available?}
B -->|No| C[Return 503 Service Unavailable]
B -->|Yes| D{Auth Passed?}
D -->|No| E[Return 401 Unauthorized]
D -->|Yes| F{Rate Limit OK?}
F -->|No| G[Return 429 Too Many Requests]
F -->|Yes| H{Backend Service Available?}
H -->|No| I[Return 502 Bad Gateway]
H -->|Yes| J{Request Valid?}
J -->|No| K[Return 400 Bad Request]
J -->|Yes| L[Forward to Service]
L --> M{Service Timeout?}
M -->|Yes| N[Return 504 Gateway Timeout]
M -->|No| O[Return Service Response]
When to Use / When Not to Use
| Scenario | Recommendation |
|---|---|
| Multiple backend services need unified access control | Use API Gateway |
| Mobile, web, and third-party clients consume the same APIs | Use API Gateway |
| You need centralized rate limiting and throttling | Use API Gateway |
| Service aggregation is required for client convenience | Use API Gateway |
| Single monolithic application with no external clients | Do NOT use API Gateway |
| Services are tightly coupled and share a deployment unit | Do NOT use API Gateway |
| Ultra-low latency is critical (gateway adds ~1-3ms) | Consider alternatives |
| Simple CRUD application with one or two services | Consider direct service calls |
When TO Use an API Gateway
- Unified client access: Your mobile app, web app, and third-party integrations all hit different services. Without a gateway, clients need to know about every service endpoint, certificate, and authentication mechanism.
- Shared authentication and authorization: You want a single place to validate JWTs, check permissions, and reject unauthorized requests before they reach your services.
- Rate limiting at the edge: You need to protect your services from traffic spikes, abusive clients, or accidental misconfiguration without adding this logic to every service.
- Protocol translation: Your mobile clients use REST, but your internal services might use gRPC or WebSocket. The gateway translates between them.
- Request aggregation: A mobile screen needs data from three different services. Without aggregation in the gateway, the client makes three separate calls with associated latency and complexity.
When NOT to Use an API Gateway
- Adding unnecessary hops: If your system is a simple monolith or a handful of tightly coordinated services, the gateway introduces latency without meaningful benefit.
- Bypassing for internal services: In some architectures, internal services behind the gateway still need to call each other directly. The gateway becomes a bottleneck rather than a helper.
- Single-purpose applications: A data processing pipeline with no external clients does not need a gateway.
- Latency-sensitive paths: Every request going through the gateway adds 1-3ms. For extremely latency-sensitive use cases, this matters.
Rate Limiting Algorithms
Not all rate limiting works the same way. The algorithm you pick affects burst tolerance, memory usage, and how fairly limits get enforced across clients.
| Algorithm | How it works | Burst tolerance | Memory | Best for |
|---|---|---|---|---|
| Fixed Window | Count requests per fixed time window (e.g., 100/min) | High at window boundary | Low | Simple cases, approximate enforcement |
| Sliding Window Log | Store timestamps per request, count within rolling window | Accurate, no boundary burst | High | Exact enforcement, lower QPS APIs |
| Sliding Window Counter | Weighted interpolation between adjacent windows | Low | Low | Balance of accuracy and memory |
| Token Bucket | Tokens added at fixed rate; each request consumes one | Controlled bursts allowed | Low | APIs with bursty clients |
| Leaky Bucket | Requests queue up and process at fixed rate | No burst — queue or drop | Low | Smoothing traffic to backends |
The fixed window edge case
Fixed windows have a known problem: clients can effectively double their rate by sending requests at the end of one window and the start of the next.
Window 1 (0-60s): 90 requests at t=59s
Window 2 (60-120s): 90 requests at t=61s
Effective rate: 180 requests in 2 seconds, both windows satisfied
Sliding window approaches fix this. The log variant is exact but stores one entry per request. The counter variant approximates the sliding window using weights between two adjacent fixed windows — much cheaper on memory with acceptable accuracy.
Token bucket in practice
Token bucket is the most common choice for API gateways. It allows short bursts up to bucket capacity while enforcing a long-term average rate. Here is a Redis implementation using atomic operations:
async function tokenBucketAllow(userId, maxTokens, refillRate) {
const key = `rate:${userId}`;
const now = Date.now();
const bucket = await redis.hgetall(key);
const tokens = bucket ? parseFloat(bucket.tokens) : maxTokens;
const lastRefill = bucket ? parseFloat(bucket.lastRefill) : now;
// Refill tokens based on elapsed time
const elapsed = (now - lastRefill) / 1000;
const refilled = Math.min(maxTokens, tokens + elapsed * refillRate);
if (refilled < 1) {
const retryAfter = Math.ceil((1 - refilled) / refillRate);
return { allowed: false, retryAfter };
}
await redis.hset(key, { tokens: refilled - 1, lastRefill: now });
await redis.expire(key, 3600);
return { allowed: true, remaining: Math.floor(refilled - 1) };
}
The key thing: run this with Redis so all gateway instances share state. Local in-memory rate limiting breaks as soon as you scale past one instance.
Authentication Strategies at the Gateway
The gateway validates credentials so individual services do not have to. Each strategy has a different trade-off between revocation speed, overhead, and operational complexity.
| Strategy | How it works | Revocation | Overhead | Best for |
|---|---|---|---|---|
| API Keys | Static key in header or query string | Immediate (delete) | Very low | Machine-to-machine, third-party devs |
| JWT (stateless) | Signed token decoded locally at gateway | Requires blocklist | Very low | Internal services, short-lived tokens |
| OAuth 2.0 + JWT | Token from auth server, decoded or introspected | Via introspection | Medium | User-facing APIs |
| mTLS | Mutual TLS certificates both sides | CRL / OCSP | High | Service-to-service, regulated envs |
| Session tokens | Opaque token looked up in session store per request | Immediate | Medium | Traditional web apps |
The JWT revocation problem
Stateless JWT validation is fast because the gateway decodes the token locally without calling another service. The problem: you cannot revoke a JWT before it expires.
If a user logs out and you issued a JWT with a one-hour TTL, that token stays valid for up to an hour.
Two practical mitigations:
- Short TTL plus refresh tokens: Issue JWTs with 5-15 minute TTLs. Clients use a longer-lived refresh token to get new JWTs. The revocation window equals the TTL.
- Token blocklist in Redis: Store revoked token IDs (JTI claim) in Redis with a TTL matching the original JWT TTL. The gateway checks the blocklist on every request. Costs about 1ms per check.
For most applications, short TTLs with refresh tokens are the right call. Blocklists are worth adding if you need immediate revocation — compliance requirements, suspected credential compromise, or account suspension flows.
Backend for Frontend (BFF) Pattern
A Backend for Frontend (BFF) is a specialized gateway instance tailored to a specific client type. Instead of one generic gateway that mobile, web, and partner clients all share, you build separate gateways per client.
graph TD
A[Mobile App] --> B[Mobile BFF]
C[Web App] --> D[Web BFF]
E[Partner API] --> F[Partner Gateway]
B --> G[Product Service]
B --> H[Order Service]
D --> G
D --> H
D --> I[Recommendation Service]
F --> G
BFFs solve the problem of one gateway trying to serve every client’s needs. Mobile apps typically want smaller payloads, fewer fields, and different aggregation than web apps. Without BFF, you end up with a bloated general-purpose gateway that either handles every possible client requirement or pushes aggregation logic into the clients themselves.
| Approach | Complexity | Flexibility | Team ownership |
|---|---|---|---|
| Single gateway | Low | Limited | Centralized platform team |
| BFF per client type | Medium | High | Per-client teams own their BFF |
| BFF per team | High | Very high | Full autonomy, but risk of duplication |
BFF works well when:
- Different client types have significantly different data requirements
- Teams have clear ownership boundaries (mobile team, web team, partner integrations team)
- You have enough traffic to justify separate deployments
It adds complexity when:
- Teams are small and one group would own multiple BFFs
- Clients have mostly overlapping requirements
- Deployment automation is not already mature
Production Failure Scenarios
| Failure Scenario | Impact | Mitigation |
|---|---|---|
| Gateway instance crash | All traffic fails | Run multiple gateway instances behind load balancer; health checks detect failures |
| Backend service timeout | Client hangs indefinitely | Set aggressive timeouts (e.g., 5s); circuit breaker returns error immediately |
| Auth service unavailable | No requests can be validated | Cache JWT validation results with short TTL; allow requests if auth is slow |
| Rate limiter memory exhaustion | Rate limiting fails open | Use Redis-backed rate limiting; set hard limits on memory per tenant |
| Gateway misconfiguration | All traffic routing incorrectly | Use version-controlled config; canary deployments for config changes |
| SSL/TLS certificate expiry | HTTPS requests fail | Automate certificate renewal (Let’s Encrypt); alert 30 days before expiry |
| Service discovery returns stale IPs | Requests go to dead instances | Use short TTL in service registry; health checks remove unhealthy instances |
| Request payload too large | Memory exhaustion on gateway | Set max request size limits; reject oversized payloads early |
Common Pitfalls / Anti-Patterns
Pitfall 1: Gateway as a Monolith Proxy
Problem: Teams sometimes build the gateway to contain significant business logic, transforming it into another monolith that mirrors the old system.
Solution: Keep the gateway thin. It should handle cross-cutting concerns (auth, routing, rate limiting) but delegate business logic to the appropriate backend services. If you find yourself writing if (user.plan === 'enterprise') { ... } in the gateway, that is a sign business logic is leaking into the gateway.
Pitfall 2: No Circuit Breaker on Backend Calls
Problem: A slow or failing backend service causes requests to pile up at the gateway, eventually exhausting gateway resources and taking down the entire system.
Solution: Always wrap backend service calls with circuit breakers. When a backend error rate exceeds a threshold, the circuit opens and immediately returns an error rather than waiting for timeouts.
// Never do this - no timeout, no circuit breaker
const response = await axios.get(`${BACKEND_URL}/data`);
// Always do this
const circuit = new CircuitBreaker(axios.get, {
timeout: 3000,
errorThresholdPercentage: 50,
});
const response = await circuit.fire(BACKEND_URL);
Pitfall 3: Stale Service Discovery
Problem: The gateway caches service endpoints that have changed (scale-down, failures), causing requests to go to dead instances.
Solution: Use short TTLs in service discovery (30 seconds or less), implement health checks that remove unhealthy instances immediately, and have backend services register/deregister dynamically with the service registry.
Pitfall 4: Authentication Bypass via Direct Service Access
Problem: Backend services are accessible directly without going through the gateway, bypassing all authentication and rate limiting.
Solution: Use network-level isolation (private VPC/subnet) so backend services are only reachable via the gateway. Never expose backend service ports to the public internet.
Pitfall 5: Rate Limiting Without Global State
Problem: Running multiple gateway instances with local (in-memory) rate limiting allows clients to get max_requests × instance_count by spreading requests across instances.
Solution: Use a shared rate limiting store (Redis) that all gateway instances consult. This ensures consistent enforcement regardless of which instance handles the request.
Real-world Failure Scenarios
This section documents documented incidents and post-mortems from production systems using API gateways, illustrating how failures manifest and how teams responded.
| Incident | What Happened | Root Cause | Resolution |
|---|---|---|---|
| Cloudflare API outage (2022) | Edge API endpoints returned 502 errors for ~30 minutes | A misconfigured authentication module in their API gateway layer rejected valid requests after a rule deployment | Rollback of gateway configuration rules; staged deployment process with canary testing introduced |
| AWS API Gateway throttling cascade (2020) | Downstream services saw traffic spikes as clients retried after hitting rate limits | Clients received 429 errors, retried immediately, and amplified traffic 3-5x | Implemented exponential backoff with jitter on retry logic; added client-side rate limit awareness |
| Stripe gateway timeout chain | Payment processing API returned timeouts during peak traffic | Gateway had 30s default timeouts; a slow downstream auth service caused timeout cascades | Reduced gateway timeouts to 5s; implemented circuit breakers with fallback responses |
| GitHub API gateway misroute | Internal services received requests with wrong routing headers | A gateway configuration deployment caused routes to be incorrectly rewritten | Configuration validation pipeline added before deployments; route testing in staging |
| Netflix API gateway split-brain | Some API requests succeeded, others failed during regional failover | Gateway instances were not synchronized during failover, serving stale routing tables | Implemented consistent hashing for route lookups; session affinity during failover |
How Incident Response Changes with an API Gateway
When an API gateway is in the critical path, gateway-level incidents affect all downstream services simultaneously. This changes incident response in several ways:
-
You cannot isolate the failure at the service level — if the gateway is down or misbehaving, services cannot be reached regardless of their individual health.
-
Gateway logs are your first signal — request logs at the gateway show where failures occur: TLS termination, auth validation, rate limiting, routing, or backend timeout.
-
Rollback vs. fix — gateway configuration changes should be immediately reversible. If a bad config deployment causes an outage, the fastest recovery is reverting the gateway config, not fixing downstream services.
-
Health check endpoints matter — your
/healthand/readyendpoints on the gateway determine load balancer routing. If these endpoints report healthy during a partial failure, traffic continues routing to failing gateway instances.
Trade-off Analysis
| Factor | With API Gateway | Without API Gateway |
|---|---|---|
| Latency | +1-3ms per request | Baseline |
| Consistency | Centralized auth/rate limiting | Duplicated per service |
| Cost | Gateway instances + operation | No additional cost |
| Complexity | Centralized logic, single config | Distributed logic, multiple configs |
| Operability | Single point to monitor | Monitor each service separately |
| Client complexity | Low (one endpoint) | High (manage multiple endpoints) |
| Debugging | Single point to trace | Trace across multiple services |
| Single point of failure | Yes, unless highly available | No (but more complex clients) |
| Flexibility | Limited by gateway capabilities | Full flexibility per service |
Gateway vs Service Mesh
| Aspect | API Gateway | Service Mesh |
|---|---|---|
| Layer | L7 (Application) | L4/L7 (Transport + Application) |
| Scope | North-South traffic (client to service) | East-West traffic (service to service) |
| Typical Users | Platform teams, API product teams | DevOps, SRE teams |
| Features | Auth, routing, aggregation, protocol translation | mTLS, retries, circuit breaking |
| Deployment | Sits at edge | Sidecar proxies on each service |
For most architectures, you need both. The API gateway handles external client traffic while a service mesh handles internal service-to-service communication. See Service Mesh for a deep dive.
Capacity Estimation
Assumptions
- Average request size: 2 KB
- Average response size: 16 KB
- Peak QPS: 10,000 requests/second
- Average response time target: 50ms (gateway overhead: 3ms)
Gateway Instance Calculation
Required instances = (Peak QPS × Avg Latency × Safety Factor) / (Max Throughput per Instance)
Where:
- Peak QPS = 10,000
- Avg Latency = 50ms (0.05s)
- Safety Factor = 2x
- Max Throughput per Instance = 2,000 QPS (typical for 2 vCPU instance)
Required = (10,000 × 0.05 × 2) / 2,000 = 0.5 instances → 2 instances (minimum for HA)
Network Bandwidth
Inbound: 10,000 QPS × 2 KB = 20 MB/s = 160 Mbps
Outbound: 10,000 QPS × 16 KB = 160 MB/s = 1.28 Gbps
Total network required: ~1.5 Gbps
Memory (per instance with 2 vCPU)
Connection buffers: 256 MB
Rate limiting state (Redis): Shared across instances
Application heap: 512 MB
Operating system: 256 MB
Total per instance: ~1 GB RAM
Operational Checklists
Quick Recap Checklist
- An API gateway provides a single entry point for all client requests, handling auth, routing, rate limiting, and protocol translation.
- Use API gateways when you have multiple services, diverse clients, or need centralized security policy enforcement.
- Avoid gateways when latency is critical, for simple single-service applications, or when the overhead outweighs benefits.
- Always implement circuit breakers, proper timeouts, and health checks when calling backend services.
- Run multiple gateway instances behind a load balancer to avoid single points of failure.
- Log structured data (request ID, latency, status) for debugging; emit metrics for alerting.
- Pick your rate limiting algorithm based on burst tolerance requirements — token bucket works for most cases.
- JWT revocation requires either short TTLs with refresh tokens or a Redis-backed blocklist.
- BFF pattern is worth adding when different client types have significantly different data needs.
Observability Checklist
Metrics to Capture
gateway_requests_total(counter) - Total requests by route, status codegateway_request_duration_seconds(histogram) - Latency by route, percentile bandsgateway_active_connections(gauge) - Current concurrent connectionsgateway_rate_limit_exceeded_total(counter) - Rate limit violations by clientgateway_backend_errors_total(counter) - Backend service errors by servicegateway_circuit_breaker_state(gauge) - Circuit breaker state by backend
Logs to Emit
Each request should emit structured JSON logs:
{
"timestamp": "2026-03-23T10:15:30.123Z",
"requestId": "550e8400-e29b-41d4-a716-446655440000",
"method": "GET",
"path": "/api/products/123",
"statusCode": 200,
"latencyMs": 12,
"clientIp": "203.0.113.42",
"userAgent": "MobileApp/2.1",
"userId": "usr_abc123",
"rateLimitRemaining": 87,
"backendService": "product-service",
"backendLatencyMs": 8
}
Alerts to Configure
| Alert | Threshold | Severity |
|---|---|---|
| P99 latency > 100ms | 100ms for 5 minutes | Warning |
| P99 latency > 500ms | 500ms for 1 minute | Critical |
| Error rate > 1% | 1% for 5 minutes | Warning |
| Error rate > 5% | 5% for 1 minute | Critical |
| Rate limit violations spike | > 1000/min from single IP | Warning |
| Backend service unavailable | Any backend down > 30s | Critical |
| Certificate expiry < 30 days | Any cert expiring soon | Warning |
Distributed Tracing
The gateway must propagate trace context to backend services:
// Propagate trace headers to backend services
const traceHeaders = {
"X-Request-ID": req.id,
"X-B3-TraceId": req.headers["x-b3-traceid"],
"X-B3-SpanId": req.headers["x-b3-spanid"],
"X-B3-Sampled": req.headers["x-b3-sampled"],
};
await axios.get(`${SERVICE_URL}/products/${id}`, {
headers: { ...traceHeaders, Authorization: req.headers.authorization },
});
Security Checklist
- TLS 1.2+ termination with modern cipher suites
- JWT validation with proper signature verification
- Rate limiting configured per-client (IP, API key, user ID)
- Request size limits to prevent payload amplification
- Input validation on all request parameters
- Output encoding to prevent XSS in responses
- CORS policy properly configured
- Security headers (HSTS, CSP, X-Frame-Options)
- Audit logging for all authentication/authorization failures
- API key rotation mechanism
- Deprecation notices for older API versions
- Penetration testing performed annually
- DDoS protection at edge (Cloudflare, AWS Shield)
- Backend services unreachable directly (only via gateway)
Implementation Example (Node.js)
Here is a minimal but production-ready API gateway implementation using Express:
const express = require("express");
const axios = require("axios");
const rateLimit = require("express-rate-limit");
const jwt = require("jsonwebtoken");
const app = express();
// Configuration
const PORT = process.env.PORT || 3000;
const AUTH_SERVICE_URL =
process.env.AUTH_SERVICE_URL || "http://auth-service:8080";
const PRODUCT_SERVICE_URL =
process.env.PRODUCT_SERVICE_URL || "http://product-service:8080";
const ORDER_SERVICE_URL =
process.env.ORDER_SERVICE_URL || "http://order-service:8080";
// Middleware: Parse JSON with size limit
app.use(express.json({ limit: "1mb" }));
// Middleware: Request ID for tracing
app.use((req, res, next) => {
req.id = crypto.randomUUID();
res.setHeader("X-Request-ID", req.id);
next();
});
// Middleware: Rate limiting (Redis-backed in production)
const limiter = rateLimit({
windowMs: 60 * 1000, // 1 minute
max: 100, // 100 requests per minute per IP
message: { error: "Too many requests", requestId: (req) => req.id },
standardHeaders: true,
legacyHeaders: false,
});
app.use("/api/", limiter);
// Middleware: Authentication
async function authenticate(req, res, next) {
const token = req.headers.authorization?.replace("Bearer ", "");
if (!token) {
return res
.status(401)
.json({ error: "Missing authorization token", requestId: req.id });
}
try {
// In production, use a distributed cache for validation results
const decoded = jwt.verify(token, process.env.JWT_SECRET);
req.user = decoded;
next();
} catch (error) {
return res.status(401).json({ error: "Invalid token", requestId: req.id });
}
}
// Middleware: Authorization
function authorize(...allowedRoles) {
return (req, res, next) => {
if (!req.user || !allowedRoles.includes(req.user.role)) {
return res
.status(403)
.json({ error: "Insufficient permissions", requestId: req.id });
}
next();
};
}
// Health check endpoint
app.get("/health", (req, res) => {
res.json({ status: "healthy", timestamp: new Date().toISOString() });
});
// Route: Product catalog with circuit breaker
const { CircuitBreaker } = require("opossum");
const productCircuit = new CircuitBreaker(
async (productId) => {
const response = await axios.get(
`${PRODUCT_SERVICE_URL}/products/${productId}`,
{
timeout: 5000,
headers: { "X-Request-ID": req.id },
},
);
return response.data;
},
{
timeout: 5000,
errorThresholdPercentage: 50,
resetTimeout: 30000,
},
);
productCircuit.on("fallback", () => ({
error: "Service temporarily unavailable",
}));
productCircuit.on("timeout", () => ({ error: "Service timeout" }));
app.get("/api/products/:id", authenticate, async (req, res) => {
try {
const product = await productCircuit.fire(req.params.id);
res.json(product);
} catch (error) {
res.status(502).json({ error: "Bad gateway", requestId: req.id });
}
});
// Route: Create order (aggregates product and order services)
app.post(
"/api/orders",
authenticate,
authorize("user", "admin"),
async (req, res) => {
const { productId, quantity } = req.body;
try {
// Check product availability
const productResponse = await axios.get(
`${PRODUCT_SERVICE_URL}/products/${productId}`,
{ timeout: 3000 },
);
if (!productResponse.data.available) {
return res.status(400).json({ error: "Product not available" });
}
// Create order
const orderResponse = await axios.post(
`${ORDER_SERVICE_URL}/orders`,
{ productId, quantity, userId: req.user.id },
{ timeout: 5000 },
);
res.status(201).json(orderResponse.data);
} catch (error) {
if (error.code === "ECONNABORTED") {
return res.status(504).json({ error: "Gateway timeout" });
}
res.status(502).json({ error: "Failed to create order" });
}
},
);
// Error handling middleware
app.use((err, req, res, next) => {
console.error(`[${req.id}] Unhandled error:`, err);
res.status(500).json({ error: "Internal server error", requestId: req.id });
});
app.listen(PORT, () => {
console.log(`API Gateway listening on port ${PORT}`);
});
Docker Compose for Local Development
version: "3.8"
services:
api-gateway:
build: ./api-gateway
ports:
- "3000:3000"
environment:
- JWT_SECRET=your-secret-key
- AUTH_SERVICE_URL=http://auth-service:8080
- PRODUCT_SERVICE_URL=http://product-service:8080
- ORDER_SERVICE_URL=http://order-service:8080
- REDIS_URL=redis://redis:6379
depends_on:
- redis
redis:
image: redis:7-alpine
ports:
- "6379:6379"
auth-service:
image: your-auth-service-image
ports:
- "8080:8080"
product-service:
image: your-product-service-image
ports:
- "8081:8080"
order-service:
image: your-order-service-image
ports:
- "8082:8080"
Interview Questions
An API gateway provides a single entry point for all client requests to backend services. It handles cross-cutting concerns that would otherwise be duplicated across services: authentication, authorization, rate limiting, request routing, protocol translation, and observability.
Without a gateway, clients must know about every service endpoint, manage authentication for each, and handle the complexity of calling multiple services. The gateway simplifies client code and gives you a central place to enforce policies.
In-memory rate limiting lets each gateway instance enforce its own limit independently. A user could make N requests per server. With 10 servers, they effectively get 10N requests. This defeats the purpose of rate limiting.
Redis-backed rate limiting uses shared global state. All gateway instances consult the same Redis counter, ensuring consistent enforcement regardless of which instance handles the request.
Redis also handles atomic operations — INCR and EXPIRE work together to increment and auto-expire counters without race conditions. The latency cost (1-2ms) is acceptable for most gateway use cases.
Gateway instance crash: all traffic fails. Mitigate by running multiple instances behind a load balancer with health checks. Gateway instances should be stateless — store session state in Redis, not local memory.
Backend service timeout: threads pile up waiting. Mitigate with aggressive timeouts (5 seconds or less) and circuit breakers. Backend service unavailable: return 502 Bad Gateway immediately rather than waiting. Auth service unavailable: cache JWT validation with short TTL, or fail-open with alerting for critical services.
SSL/TLS certificate expiry: all HTTPS requests fail. Automate certificate renewal with Let's Encrypt or similar. Alert 30 days before expiry.
Typical overhead is 1-3ms per request. For most web applications, where backend services respond in tens to hundreds of milliseconds, this is negligible. The gateway's TLS termination, authentication checks, and routing add up to a small fraction of total latency.
For ultra-low-latency applications (high-frequency trading, real-time gaming), 1-3ms matters. In these cases, consider whether a gateway is necessary or if clients can call services directly with appropriate SDKs.
The gateway can actually reduce latency in some cases: response caching eliminates backend calls, and connection pooling to backends amortizes connection setup costs.
TLS termination at the gateway with modern cipher suites only. JWT validation with signature verification before forwarding requests. Network isolation: backend services must be unreachable directly from the internet, accessible only through the gateway.
DDoS protection at the edge — Cloudflare, AWS Shield, or similar. Rate limiting prevents abuse. Request size limits prevent payload amplification attacks. Input validation prevents injection attacks. Audit logging for authentication and authorization failures for compliance.
The gateway can translate between client-facing protocols (REST, GraphQL) and internal protocols (gRPC, WebSocket). It handles content type negotiation via Accept headers, translating between XML and JSON if needed.
Protocol translation lives at the gateway, not in backend services. Backend services speak their native protocol; the gateway translates. This keeps backend services simple while supporting diverse client needs.
Calculate required instances: (Peak QPS × Average Latency × Safety Factor) / Maximum Throughput per Instance. A 2 vCPU instance handles roughly 2,000 QPS for a typical gateway workload. Always deploy minimum 2 instances for high availability.
Network bandwidth matters: outbound traffic is typically 8x inbound (responses are larger than requests). Memory sizing: ~1 GB per instance covers buffers, application heap, and OS overhead. Plan for failover capacity — during instance failure, remaining instances must handle full traffic.
A generic API gateway serves all client types through a single instance with unified routing and aggregation logic. A BFF is a specialized gateway instance tailored to a specific client type — mobile, web, or partner API — each with its own data requirements, payload shapes, and aggregation patterns.
Use BFF when mobile apps need smaller payloads with different fields than web apps, when teams have clear ownership boundaries (mobile team vs. web team), and when you have enough traffic to justify separate deployments. It adds complexity but gives per-client teams full autonomy over their gateway logic.
Rate limiting enforces a hard cap on the number of requests a client can make in a time window — excess requests get rejected with 429. Throttling smooths out traffic by queuing or slowing requests rather than dropping them outright.
At the gateway layer, rate limiting is the primary mechanism — it is simple to implement and gives clear pass/fail signals. Throttling is less common at the gateway because queued requests still hold gateway resources. Some gateways implement "delayed rejection" throttling where requests wait briefly before being rejected.
TLS termination at the gateway (edge termination) is the standard approach: clients terminate TLS at the gateway, and the gateway communicates with backend services over internal plaintext or mTLS. This reduces cryptographic overhead at scale and centralizes certificate management.
Re-encrypting for backend calls (mTLS between gateway and services) adds security for sensitive traffic but increases CPU overhead. For low-security internal networks, plaintext backend communication is acceptable as long as network isolation prevents direct access to services.
Full end-to-end TLS (client to backend, gateway as pass-through) adds maximum security but eliminates the gateway's ability to inspect, transform, or log request/response content.
Request aggregation — the gateway calling multiple backend services and combining responses — is powerful for client convenience but can cause memory pressure when responses are large or many services are called in parallel.
Memory issues arise when: a single aggregated response exceeds the gateway's memory limits, slow backend services cause the gateway to hold many in-flight responses simultaneously, or aggregation timeouts allow partial responses to accumulate.
Mitigations: set per-request memory limits, use streaming aggregation where possible, apply aggressive timeouts to individual backend calls, and cap the number of parallel backend calls the gateway will make for a single client request.
A service registry (e.g., Consul, etcd, Kubernetes endpoints) maintains the current list of healthy instances for each backend service. The gateway queries the registry to route requests, rather than using static configuration.
Health checks keep the registry accurate: the gateway or a separate process periodically calls each service instance's health endpoint and deregisters instances that fail. This ensures routing stops going to instances that are starting up, overloaded, or crashed.
Stale registry data is a common failure mode — instances can be dead but still in the registry if health checks are infrequent or the deregistration signal is missed. Use short TTLs (30 seconds or less) and ensure deregistration is event-driven, not just TTL-based.
To isolate gateway latency: measure time-to-first-byte at the gateway (before forwarding to backend) vs. backend response time. The gateway's internal processing time (auth, rate limiting, routing) should be captured as a separate histogram bucket.
Key metrics: gateway_request_duration_seconds (with backend_service and route labels), gateway_backend_latency_seconds (time spent waiting for backend), and gateway_overhead_seconds (calculated as total minus backend time).
Percentiles matter more than averages: P50 can look fine while P99 reveals latency tails caused by connection pool exhaustion, GC pauses, or slow rate limiting stores. Always look at P95 and P99 when diagnosing latency issues.
A gateway can be both a DDoS target and a DDoS shield. As a target, attackers aim traffic at the gateway to exhaust its resources. As a shield, the gateway's rate limiting and connection management can absorb or deflect attack traffic before it reaches backend services.
Protective measures at the gateway: aggressive rate limiting by IP and API key, connection limits per client, request size limits to prevent amplification, and IP blocklists for known bad actors. For volumetric DDoS (Gbps+ attacks), these are insufficient — edge DDoS protection (Cloudflare, AWS Shield) is needed before traffic reaches the gateway.
The gateway should also emit rate limit violation metrics so security teams can detect and respond to attack patterns in real time.
Configuration drift — the gateway behaving differently across environments due to subtle config differences — is a common operational problem. Rate limiting thresholds, routing rules, and feature flags often vary between environments in ways that cause prod-only bugs.
Best practices: store gateway config in version control with environment-specific overrides; use canary deployments for config changes (roll out to 5% of traffic first); treat config as code with code review requirements; and have automated config validation that runs before applying changes.
Secrets management is separate from config: use a secrets manager (Vault, AWS Secrets Manager) for API keys and credentials, not the gateway config file itself.
The gateway must identify tenants from incoming requests — via API key header, JWT claim, or subdomain — and ensure requests route only to that tenant's backend services. Tenant isolation is enforced at the routing layer, not left to backend services alone.
For shared backend services serving multiple tenants, the gateway should inject tenant context into request headers (X-Tenant-ID) so backend services can scope data queries. The gateway itself must never cache responses across tenants — a cached response for one tenant must not be served to another.
Rate limiting must be per-tenant, not global. A single misbehaving tenant should not consume budget that affects other tenants on the same gateway.
The gateway is the central enforcement point for API lifecycle management. It should add Deprecation and Sunset headers to responses for deprecated endpoints, track usage of deprecated API versions, and eventually block requests to sunset endpoints with clear migration guidance.
Deprecation workflow: announce deprecation 6+ months before sunset, add Deprecation: true and Sunset:
At sunset date, the gateway should return 410 Gone for deleted endpoints rather than generic 404, with a response body explaining the replacement version and migration steps.
Without graceful shutdown, a gateway instance being terminated loses in-flight requests — clients see connection errors mid-request. For a gateway handling hundreds or thousands of concurrent requests, this causes a spike of failed requests at every deployment.
Graceful shutdown involves: stopping new connections (draining the load balancer target), waiting for in-flight requests to complete (with a timeout), then exiting. Typical configuration: SIGTERM triggers graceful shutdown, 30-second timeout for in-flight requests, then force-kill if needed.
Health checks at /health can report unhealthy during drain, causing the load balancer to stop routing new traffic while existing requests complete.
The gateway timeout should always be shorter than the backend service timeout. If the gateway waits longer than the backend, the gateway times out first and returns an error for a request that might have succeeded — the backend wasted resources processing it.
Best practice: gateway timeout = backend timeout minus headroom for gateway processing (e.g., backend has 10s timeout, gateway uses 8s). This ensures the gateway returns a clean 504 before the backend sends a response the gateway will drop.
Different routes can have different timeout values based on the backend service's characteristics. Slow endpoints (report generation) get longer timeouts; fast endpoints (health checks) get shorter ones.
The gateway can implement retries for idempotent GET requests or those with explicit idempotency keys. Retries should use exponential backoff with jitter to avoid thundering herd problems. The gateway should add X-Request-ID to track retry chains.
Retries must be avoided for non-idempotent mutations (POST, DELETE without idempotency keys) as they can cause duplicate operations. POST to create an order should not be retried automatically — if it times out, the client should check order status before retrying.
Retries amplify failures: a backend at 50% capacity receiving retries goes to 100% and fails more. Circuit breakers should trip before retries amplify a degraded backend into a cascading failure.
Further Reading
Internal Resources
- Service Mesh — Managing internal service-to-service communication
- Load Balancing — Distributing traffic across multiple gateway instances
- RESTful API Design — Best practices for API contract design
- Circuit Breaker Pattern — Preventing cascade failures
- System Design Roadmap — Complete learning path for system design
External Resources
- NGINX API Gateway documentation — Production-grade reverse proxy and gateway setup
- Kong Gateway docs — Open-source API gateway with plugin ecosystem
- AWS API Gateway developer guide — Managed gateway on AWS with Lambda integration
- OAuth 2.0 RFC 6749 — The specification behind modern API authentication
- Rate Limiting Algorithms — Cloudflare blog — Deep dive on sliding window and token bucket at scale
Conclusion
An API gateway is the foundational piece that ties together client requests, backend services, and operational concerns like authentication, rate limiting, and observability. It simplifies client code by providing a single entry point, centralizes cross-cutting concerns so individual services stay thin, and gives you a central vantage point for monitoring, security, and traffic management.
The key decisions when adopting an API gateway are: choosing between a managed service or self-hosted solution, implementing Redis-backed rate limiting for consistent enforcement across instances, adding circuit breakers to prevent cascade failures, and evaluating whether a BFF pattern is needed for multi-client architectures.
Most production deployments require at least two gateway instances behind a load balancer, TLS termination at the edge, short JWT TTLs with refresh token rotation, and automated certificate renewal. Treat the gateway as a stateless proxy — keep business logic in backend services and store session state externally in Redis.
For most microservices architectures, an API gateway handles external client traffic while a service mesh handles internal service-to-service communication. Together they provide comprehensive coverage for north-south and east-west traffic patterns.
Category
Related Posts
Microservices vs Monolith: Choosing the Right Architecture
Understand the fundamental differences between monolithic and microservices architectures, their trade-offs, and how to decide which approach fits your project.
Server-Side Discovery: Load Balancer-Based Service Routing
Learn how server-side discovery uses load balancers and reverse proxies to route service requests in microservices architectures.
Amazon Architecture: Lessons from the Pioneer of Microservices
Learn how Amazon pioneered service-oriented architecture, the famous 'two-pizza team' rule, and how they built the foundation for AWS.