API Gateway: Single Entry Point for Microservices

Learn how API gateways work, when to use them, architecture patterns, failure scenarios, and implementation strategies for production microservices.

published: March 23, 2026 reading time: 41 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

An API gateway is your single entry point for all client traffic hitting backend services. It takes on the work that would otherwise repeat across every service: routing, auth, rate limiting, protocol translation. When a request comes in, the gateway validates credentials, checks rate limits, and forwards it — the backend service never sees unauthenticated traffic. The post covers what you actually need to implement: Redis-backed token bucket rate limiting for consistent enforcement across instances, JWT with short TTLs plus refresh tokens for revocation, the BFF pattern when clients have divergent needs, and circuit breakers to prevent slow backends from taking everything down. On the ops side: run at least two instances behind a load balancer, terminate TLS at the edge, and keep the gateway thin. No business logic here — that's what the services are for. Store session state in Redis, version your config, and canary-deploy changes.

API Gateway: The Single Entry Point for Microservices Architecture

Introduction

An API gateway sits at the entrance of your backend services and handles everything that would otherwise clutter individual services or repeat across them: request routing, authentication, rate limiting, protocol translation. When a mobile app asks your product catalog for data, the gateway receives that request, checks the user’s session token, applies rate limiting rules, and forwards it to the product service — all before the actual service sees a single byte.

The benefit is simpler client code. Applications talk to one endpoint instead of tracking multiple service addresses, certificates, and authentication mechanisms. You also get a single place to enforce security policies, collect metrics, and aggregate responses from different services. For teams operating microservices at any real scale, a gateway stops being optional pretty quickly.

Core Concepts

Request Flow

The sequence diagram shows a request traveling through an API gateway from the moment a client sends it to the moment the gateway returns a response. It covers the two most common scenarios: a cache hit that short-circuits the backend call, and a cache miss that requires the gateway to fan out to multiple services, aggregate their responses, and return a consolidated result.

The first scenario is a product lookup. The gateway validates the JWT and checks rate limits before doing anything else, then hits Redis. On a cache hit, the product service never sees the request. The second scenario is order creation, which requires the gateway to check product availability before creating the order, then wait for the order service to confirm. The client makes one call; the gateway makes three.

Both paths start the same way. Same endpoint, same security checks. They diverge only after the cache lookup. This is the real benefit of putting a gateway in front of everything: clients deal with one address, one auth mechanism, and one error format. The complexity of backend service discovery, protocol translation, and response aggregation stays hidden.

sequenceDiagram
    participant Client
    participant Gateway as API Gateway
    participant Auth as Auth Service
    participant Catalog as Product Service
    participant Order as Order Service
    participant Cache as Redis Cache

    Client->>Gateway: POST /api/products/123
    Gateway->>Gateway: Extract JWT, Rate Limit Check
    Gateway->>Auth: Validate Token
    Auth-->>Gateway: Token Valid
    Gateway->>Cache: Check Cache
    Cache-->>Gateway: Cache Hit
    Gateway-->>Client: Product JSON

    Client->>Gateway: POST /api/orders
    Gateway->>Gateway: Extract JWT, Rate Limit Check
    Gateway->>Auth: Validate Token
    Auth-->>Gateway: Token Valid
    Gateway->>Catalog: Check Product Availability
    Catalog-->>Gateway: Available
    Gateway->>Order: Create Order
    Order-->>Gateway: Order Created
    Gateway-->>Client: Order Confirmation

Gateway Internal Components

The gateway processes every inbound request through a layered pipeline. The diagram below breaks this into two logical groups: the security layer that every request passes through before routing decisions are made, and the routing layer that determines where the request goes and how the response comes back.

The security layer handles cross-cutting concerns that would otherwise be duplicated in every backend service. TLS termination decrypts the incoming request. Authentication validates who the caller is. Authorization checks whether that caller is allowed to make this specific request. Rate limiting enforces quotas before the request consumes any backend resources. None of these belong in individual services.

The routing layer takes over after security checks pass. It looks up backend instances in the service registry, applies routing rules (path rewriting, header injection, canary weight splitting), forwards the request, and handles the response. If the gateway is aggregating responses from multiple services, it calls them in parallel, waits, merges the results, and sends a single reply to the client. Metrics get emitted at every step.

graph TD
    A[Client Request] --> B[TLS Termination]
    B --> C[Authentication]
    C --> D[Authorization]
    D --> E[Rate Limiting]
    E --> F[Request Routing]
    F --> G[Service Discovery]
    G --> H[Backend Service]
    H --> F
    F --> I[Response Aggregation]
    I --> J[Metrics Collection]
    J --> K[Client Response]

    subgraph Security Layer
        C
        D
        E
    end

    subgraph Routing Layer
        F
        G
    end

Failure Flow

The gateway sits in the critical path for every request, which means how it handles failures determines whether downstream services ever see traffic at all. The failure decision tree maps every possible error condition to the most appropriate HTTP status code and tells the client what to do next.

The first check is whether the gateway itself is operational. If the gateway is unavailable (instance crash, OOM, deployment in progress), clients get a 503. This is terminal; the client can only retry after a backoff. If the gateway is up, the request proceeds to authentication.

Auth failures return 401. Rate limit violations return 429 with a Retry-After header. Backend service unavailability returns 502. Service timeouts return 504. Each maps to a specific failure mode and tells the client something useful about what happened and when to retry.

The design principle is fail-fast. Return an error as soon as you know the request cannot succeed, not after running through the entire pipeline only to fail at the end. This protects gateway resources and gives clients actionable signals.

graph TD
    A[Client Request] --> B{Gateway Available?}
    B -->|No| C[Return 503 Service Unavailable]
    B -->|Yes| D{Auth Passed?}
    D -->|No| E[Return 401 Unauthorized]
    D -->|Yes| F{Rate Limit OK?}
    F -->|No| G[Return 429 Too Many Requests]
    F -->|Yes| H{Backend Service Available?}
    H -->|No| I[Return 502 Bad Gateway]
    H -->|Yes| J{Request Valid?}
    J -->|No| K[Return 400 Bad Request]
    J -->|Yes| L[Forward to Service]
    L --> M{Service Timeout?}
    M -->|Yes| N[Return 504 Gateway Timeout]
    M -->|No| O[Return Service Response]

When to Use / When Not to Use

Scenario	Recommendation
Multiple backend services need unified access control	Use API Gateway
Mobile, web, and third-party clients consume the same APIs	Use API Gateway
You need centralized rate limiting and throttling	Use API Gateway
Service aggregation is required for client convenience	Use API Gateway
Single monolithic application with no external clients	Do NOT use API Gateway
Services are tightly coupled and share a deployment unit	Do NOT use API Gateway
Ultra-low latency is critical (gateway adds ~1-3ms)	Consider alternatives
Simple CRUD application with one or two services	Consider direct service calls

When TO Use an API Gateway

Unified client access: Your mobile app, web app, and third-party integrations all hit different services. Without a gateway, clients need to know about every service endpoint, certificate, and authentication mechanism.
Shared authentication and authorization: You want a single place to validate JWTs, check permissions, and reject unauthorized requests before they reach your services.
Rate limiting at the edge: You need to protect your services from traffic spikes, abusive clients, or accidental misconfiguration without adding this logic to every service.
Protocol translation: Your mobile clients use REST, but your internal services might use gRPC or WebSocket. The gateway translates between them.
Request aggregation: A mobile screen needs data from three different services. Without aggregation in the gateway, the client makes three separate calls with associated latency and complexity.

When NOT to Use an API Gateway

Adding unnecessary hops: If your system is a simple monolith or a handful of tightly coordinated services, the gateway introduces latency without meaningful benefit.
Bypassing for internal services: In some architectures, internal services behind the gateway still need to call each other directly. The gateway becomes a bottleneck rather than a helper.
Single-purpose applications: A data processing pipeline with no external clients does not need a gateway.
Latency-sensitive paths: Every request going through the gateway adds 1-3ms. For extremely latency-sensitive use cases, this matters.

Rate Limiting Algorithms

Not all rate limiting works the same way. The algorithm you pick affects burst tolerance, memory usage, and how fairly limits get enforced across clients.

Algorithm	How it works	Burst tolerance	Memory	Best for
Fixed Window	Count requests per fixed time window (e.g., 100/min)	High at window boundary	Low	Simple cases, approximate enforcement
Sliding Window Log	Store timestamps per request, count within rolling window	Accurate, no boundary burst	High	Exact enforcement, lower QPS APIs
Sliding Window Counter	Weighted interpolation between adjacent windows	Low	Low	Balance of accuracy and memory
Token Bucket	Tokens added at fixed rate; each request consumes one	Controlled bursts allowed	Low	APIs with bursty clients
Leaky Bucket	Requests queue up and process at fixed rate	No burst — queue or drop	Low	Smoothing traffic to backends

The fixed window edge case

Fixed windows have a known problem: clients can effectively double their rate by sending requests at the end of one window and the start of the next.

Window 1 (0-60s):   90 requests at t=59s
Window 2 (60-120s): 90 requests at t=61s

Effective rate: 180 requests in 2 seconds, both windows satisfied

Sliding window approaches fix this. The log variant is exact but stores one entry per request. The counter variant approximates the sliding window using weights between two adjacent fixed windows — much cheaper on memory with acceptable accuracy.

Token bucket in practice

Token bucket is the most common choice for API gateways. It allows short bursts up to bucket capacity while enforcing a long-term average rate. Here is a Redis implementation using atomic operations:

async function tokenBucketAllow(userId, maxTokens, refillRate) {
  const key = `rate:${userId}`;
  const now = Date.now();

  const bucket = await redis.hgetall(key);

  const tokens = bucket ? parseFloat(bucket.tokens) : maxTokens;
  const lastRefill = bucket ? parseFloat(bucket.lastRefill) : now;

  // Refill tokens based on elapsed time
  const elapsed = (now - lastRefill) / 1000;
  const refilled = Math.min(maxTokens, tokens + elapsed * refillRate);

  if (refilled < 1) {
    const retryAfter = Math.ceil((1 - refilled) / refillRate);
    return { allowed: false, retryAfter };
  }

  await redis.hset(key, { tokens: refilled - 1, lastRefill: now });
  await redis.expire(key, 3600);

  return { allowed: true, remaining: Math.floor(refilled - 1) };
}

The key thing: run this with Redis so all gateway instances share state. Local in-memory rate limiting breaks as soon as you scale past one instance.

Authentication Strategies at the Gateway

The gateway validates credentials so individual services do not have to. Each strategy has a different trade-off between revocation speed, overhead, and operational complexity.

Strategy	How it works	Revocation	Overhead	Best for
API Keys	Static key in header or query string	Immediate (delete)	Very low	Machine-to-machine, third-party devs
JWT (stateless)	Signed token decoded locally at gateway	Requires blocklist	Very low	Internal services, short-lived tokens
OAuth 2.0 + JWT	Token from auth server, decoded or introspected	Via introspection	Medium	User-facing APIs
mTLS	Mutual TLS certificates both sides	CRL / OCSP	High	Service-to-service, regulated envs
Session tokens	Opaque token looked up in session store per request	Immediate	Medium	Traditional web apps

The JWT revocation problem

Stateless JWT validation is fast because the gateway decodes the token locally without calling another service. The problem: you cannot revoke a JWT before it expires.

If a user logs out and you issued a JWT with a one-hour TTL, that token stays valid for up to an hour.

Two practical mitigations:

Short TTL plus refresh tokens: Issue JWTs with 5-15 minute TTLs. Clients use a longer-lived refresh token to get new JWTs. The revocation window equals the TTL.
Token blocklist in Redis: Store revoked token IDs (JTI claim) in Redis with a TTL matching the original JWT TTL. The gateway checks the blocklist on every request. Costs about 1ms per check.

For most applications, short TTLs with refresh tokens are the right call. Blocklists are worth adding if you need immediate revocation — compliance requirements, suspected credential compromise, or account suspension flows.

Backend for Frontend (BFF) Pattern

A Backend for Frontend (BFF) is a specialized gateway instance tailored to a specific client type. Instead of one generic gateway that mobile, web, and partner clients all share, you build separate gateways per client.

graph TD
    A[Mobile App] --> B[Mobile BFF]
    C[Web App] --> D[Web BFF]
    E[Partner API] --> F[Partner Gateway]
    B --> G[Product Service]
    B --> H[Order Service]
    D --> G
    D --> H
    D --> I[Recommendation Service]
    F --> G

BFFs solve the problem of one gateway trying to serve every client’s needs. Mobile apps typically want smaller payloads, fewer fields, and different aggregation than web apps. Without BFF, you end up with a bloated general-purpose gateway that either handles every possible client requirement or pushes aggregation logic into the clients themselves.

Approach	Complexity	Flexibility	Team ownership
Single gateway	Low	Limited	Centralized platform team
BFF per client type	Medium	High	Per-client teams own their BFF
BFF per team	High	Very high	Full autonomy, but risk of duplication

BFF works well when:

Different client types have significantly different data requirements
Teams have clear ownership boundaries (mobile team, web team, partner integrations team)
You have enough traffic to justify separate deployments

It adds complexity when:

Teams are small and one group would own multiple BFFs
Clients have mostly overlapping requirements
Deployment automation is not already mature

Production Failure Scenarios

Failure Scenario	Impact	Mitigation
Gateway instance crash	All traffic fails	Run multiple gateway instances behind load balancer; health checks detect failures
Backend service timeout	Client hangs indefinitely	Set aggressive timeouts (e.g., 5s); circuit breaker returns error immediately
Auth service unavailable	No requests can be validated	Cache JWT validation results with short TTL; allow requests if auth is slow
Rate limiter memory exhaustion	Rate limiting fails open	Use Redis-backed rate limiting; set hard limits on memory per tenant
Gateway misconfiguration	All traffic routing incorrectly	Use version-controlled config; canary deployments for config changes
SSL/TLS certificate expiry	HTTPS requests fail	Automate certificate renewal (Let’s Encrypt); alert 30 days before expiry
Service discovery returns stale IPs	Requests go to dead instances	Use short TTL in service registry; health checks remove unhealthy instances
Request payload too large	Memory exhaustion on gateway	Set max request size limits; reject oversized payloads early

Common Pitfalls / Anti-Patterns

Pitfall 1: Gateway as a Monolith Proxy

Teams sometimes build the gateway to contain significant business logic, transforming it into another monolith that mirrors the old system. Instead of routing requests, the gateway accumulates conditional logic, data transformations, and workflow orchestration. Over time, it becomes the most critical—and most fragile—component in the stack. A change to pricing logic should not require a gateway deployment.

The symptom is unmistakable: gateway code grows faster than service code, and deployments start requiring gateway changes for every feature. Teams blame the monolith they just replaced and build a new one with extra hops. Business logic in the gateway means every team needs the gateway team to review and deploy their changes, bottlenecking feature velocity.

The fix is strict layering from the start. The gateway owns the network layer: auth headers, route decisions, protocol translation. Backend services own product logic, pricing rules, and domain workflows. When a product team asks to add a feature to the gateway, the question to ask is whether that logic belongs in the gateway or the service. Most of the time it belongs in the service.

Pitfall 2: No Circuit Breaker on Backend Calls

A gateway without circuit breakers on backend calls is a single point of failure that multiplies downstream problems. When a payment service starts timing out instead of failing cleanly, every pending request holds a connection open waiting for a response. Threads queue up, memory fills with buffered requests, and the gateway stops accepting new requests entirely—not because it is broken, but because it ran out of resources protecting a sick dependency.

The failure cascade works like this. The payment service slips from 50ms responses to 30 seconds. Gateway threads hold connections open waiting. New checkout requests queue behind slow payment requests. Inventory and shipping calls start queuing too. Threads exhaust. The entire checkout flow stops, not just payments. One misbehaving backend kills a checkout pipeline that was working fine five minutes ago.

Circuit breakers stop this chain reaction. When a backend’s error rate crosses a threshold, the breaker opens and returns an error immediately—no thread held, no timeout waited. The checkout flow fails fast for payments and falls back to cached pricing or a graceful error. Everything else keeps working. The payment service gets breathing room to recover. Without that circuit in place, the gateway amplifies a small problem into a system-wide outage.

// Never do this - no timeout, no circuit breaker
const response = await axios.get(`${BACKEND_URL}/data`);

// Always do this
const circuit = new CircuitBreaker(axios.get, {
  timeout: 3000,
  errorThresholdPercentage: 50,
});
const response = await circuit.fire(BACKEND_URL);

Pitfall 3: Stale Service Discovery

Service discovery gets stale in ways that are hard to detect without active probing. The registry shows five healthy instances, but one of them is in the middle of a rolling deployment and no longer accepting connections. Another three are scheduled for termination but still appear active between health checks. The gateway routes to dead pods while dashboards report everything is green.

Kubernetes endpointslices update when pods change state, but the gateway’s local cache or even the in-cluster discovery client can lag by seconds to minutes. During a rolling deployment or scale-down event, that lag is enough to route traffic to instances that are terminating. Clients see connection refused while the gateway keeps sending traffic to addresses that no longer have anything listening.

Watch-based discovery handles this better than polling. Instead of asking the registry for a list every N seconds, the registry pushes changes to subscribers immediately. In Kubernetes, use the endpointslices API with watch streams rather than list-then-watch patterns. Configure the gateway to treat endpoint changes as signals, not just as data to cache. When an instance disappears from the list, remove it from routing immediately, not after a TTL expires.

Pitfall 4: Authentication Bypass via Direct Service Access

Backend services that skip the gateway for “internal” access create a hole in your security perimeter that attackers look for first. The pattern is common: a developer sets up direct HTTP access for testing, or an internal tool needs to bypass the gateway for speed, or a microservice-to-microservice call skips auth because it is “internal.” Each shortcut opens a path that traffic analysis or a misconfigured firewall exposes to the public internet.

The risk is not hypothetical. Port scans find open management interfaces on cloud instances within minutes of going live. Unauthenticated endpoints—debug pages, internal health endpoints, actuator endpoints with JMX—get scanned and exploited if they face the internet. Rate limiting that does not exist on direct paths means brute force attacks have no throttle. JWT validation that lives in the gateway never runs for traffic that bypasses it.

The only safe posture is network-level enforcement: backend services on private subnets with no inbound routes from the internet, security groups that allow traffic only from the gateway’s CIDR, and egress rules that prevent direct internet access for backend ports. When testing requires direct access, use a separate test harness that runs the gateway’s auth and rate limiting logic locally, not a bypass that exposes unfiltered paths.

Pitfall 5: Rate Limiting Without Global State

With four gateway instances handling traffic, a client making 100 requests per minute can hit each instance with 25 requests and pass all of them through. Local in-memory rate limiting is per-process accounting. It works for a single instance. The moment you scale horizontally, you have a distributed hole in your protection that any client can discover by hitting instances round-robin.

The attack is not sophisticated. A script loops through instance IPs or uses connection pooling with the same key to distribute load across gateway processes. For authenticated endpoints, token buckets tied to user ID or API key leak if the gateway does not share state. For anonymous rate limits, IP-based limits split across instances give each one a fresh counter. The fix is simple: store counters in Redis, not in process memory. Every gateway instance reads from and writes to the same distributed state, so the limit is enforced globally, not per-instance.

Real-world Failure Scenarios

This section documents documented incidents and post-mortems from production systems using API gateways, illustrating how failures manifest and how teams responded.

Incident	What Happened	Root Cause	Resolution
Cloudflare API outage (2022)	Edge API endpoints returned 502 errors for ~30 minutes	A misconfigured authentication module in their API gateway layer rejected valid requests after a rule deployment	Rollback of gateway configuration rules; staged deployment process with canary testing introduced
AWS API Gateway throttling cascade (2020)	Downstream services saw traffic spikes as clients retried after hitting rate limits	Clients received 429 errors, retried immediately, and amplified traffic 3-5x	Implemented exponential backoff with jitter on retry logic; added client-side rate limit awareness
Stripe gateway timeout chain	Payment processing API returned timeouts during peak traffic	Gateway had 30s default timeouts; a slow downstream auth service caused timeout cascades	Reduced gateway timeouts to 5s; implemented circuit breakers with fallback responses
GitHub API gateway misroute	Internal services received requests with wrong routing headers	A gateway configuration deployment caused routes to be incorrectly rewritten	Configuration validation pipeline added before deployments; route testing in staging
Netflix API gateway split-brain	Some API requests succeeded, others failed during regional failover	Gateway instances were not synchronized during failover, serving stale routing tables	Implemented consistent hashing for route lookups; session affinity during failover

How Incident Response Changes with an API Gateway

When an API gateway is in the critical path, gateway-level incidents affect all downstream services simultaneously. This changes incident response in several ways:

You cannot isolate the failure at the service level — if the gateway is down or misbehaving, services cannot be reached regardless of their individual health.
Gateway logs are your first signal — request logs at the gateway show where failures occur: TLS termination, auth validation, rate limiting, routing, or backend timeout.
Rollback vs. fix — gateway configuration changes should be immediately reversible. If a bad config deployment causes an outage, the fastest recovery is reverting the gateway config, not fixing downstream services.
Health check endpoints matter — your /health and /ready endpoints on the gateway determine load balancer routing. If these endpoints report healthy during a partial failure, traffic continues routing to failing gateway instances.

Trade-off Analysis

Factor	With API Gateway	Without API Gateway
Latency	+1-3ms per request	Baseline
Consistency	Centralized auth/rate limiting	Duplicated per service
Cost	Gateway instances + operation	No additional cost
Complexity	Centralized logic, single config	Distributed logic, multiple configs
Operability	Single point to monitor	Monitor each service separately
Client complexity	Low (one endpoint)	High (manage multiple endpoints)
Debugging	Single point to trace	Trace across multiple services
Single point of failure	Yes, unless highly available	No (but more complex clients)
Flexibility	Limited by gateway capabilities	Full flexibility per service

Gateway vs Service Mesh

Aspect	API Gateway	Service Mesh
Layer	L7 (Application)	L4/L7 (Transport + Application)
Scope	North-South traffic (client to service)	East-West traffic (service to service)
Typical Users	Platform teams, API product teams	DevOps, SRE teams
Features	Auth, routing, aggregation, protocol translation	mTLS, retries, circuit breaking
Deployment	Sits at edge	Sidecar proxies on each service

For most architectures, you need both. The API gateway handles external client traffic while a service mesh handles internal service-to-service communication. See Service Mesh for a deep dive.

Capacity Estimation

Assumptions

Average request size: 2 KB
Average response size: 16 KB
Peak QPS: 10,000 requests/second
Average response time target: 50ms (gateway overhead: 3ms)

Gateway Instance Calculation

Required instances = (Peak QPS × Avg Latency × Safety Factor) / (Max Throughput per Instance)

Where:
- Peak QPS = 10,000
- Avg Latency = 50ms (0.05s)
- Safety Factor = 2x
- Max Throughput per Instance = 2,000 QPS (typical for 2 vCPU instance)

Required = (10,000 × 0.05 × 2) / 2,000 = 0.5 instances → 2 instances (minimum for HA)

Network Bandwidth

Inbound:  10,000 QPS × 2 KB = 20 MB/s = 160 Mbps
Outbound: 10,000 QPS × 16 KB = 160 MB/s = 1.28 Gbps

Total network required: ~1.5 Gbps

Memory (per instance with 2 vCPU)

Connection buffers: 256 MB
Rate limiting state (Redis): Shared across instances
Application heap: 512 MB
Operating system: 256 MB

Total per instance: ~1 GB RAM

Operational Checklists

Quick Recap Checklist

An API gateway provides a single entry point for all client requests, handling auth, routing, rate limiting, and protocol translation.
Use API gateways when you have multiple services, diverse clients, or need centralized security policy enforcement.
Avoid gateways when latency is critical, for simple single-service applications, or when the overhead outweighs benefits.
Always implement circuit breakers, proper timeouts, and health checks when calling backend services.
Run multiple gateway instances behind a load balancer to avoid single points of failure.
Log structured data (request ID, latency, status) for debugging; emit metrics for alerting.
Pick your rate limiting algorithm based on burst tolerance requirements — token bucket works for most cases.
JWT revocation requires either short TTLs with refresh tokens or a Redis-backed blocklist.
BFF pattern is worth adding when different client types have significantly different data needs.

Observability Checklist

Metrics to Capture

gateway_requests_total (counter) - Total requests by route, status code
gateway_request_duration_seconds (histogram) - Latency by route, percentile bands
gateway_active_connections (gauge) - Current concurrent connections
gateway_rate_limit_exceeded_total (counter) - Rate limit violations by client
gateway_backend_errors_total (counter) - Backend service errors by service
gateway_circuit_breaker_state (gauge) - Circuit breaker state by backend

Logs to Emit

Each request should emit structured JSON logs:

{
  "timestamp": "2026-03-23T10:15:30.123Z",
  "requestId": "550e8400-e29b-41d4-a716-446655440000",
  "method": "GET",
  "path": "/api/products/123",
  "statusCode": 200,
  "latencyMs": 12,
  "clientIp": "203.0.113.42",
  "userAgent": "MobileApp/2.1",
  "userId": "usr_abc123",
  "rateLimitRemaining": 87,
  "backendService": "product-service",
  "backendLatencyMs": 8
}

Alerts to Configure

Alert	Threshold	Severity
P99 latency > 100ms	100ms for 5 minutes	Warning
P99 latency > 500ms	500ms for 1 minute	Critical
Error rate > 1%	1% for 5 minutes	Warning
Error rate > 5%	5% for 1 minute	Critical
Rate limit violations spike	> 1000/min from single IP	Warning
Backend service unavailable	Any backend down > 30s	Critical
Certificate expiry < 30 days	Any cert expiring soon	Warning

Distributed Tracing

The gateway must propagate trace context to backend services:

// Propagate trace headers to backend services
const traceHeaders = {
  "X-Request-ID": req.id,
  "X-B3-TraceId": req.headers["x-b3-traceid"],
  "X-B3-SpanId": req.headers["x-b3-spanid"],
  "X-B3-Sampled": req.headers["x-b3-sampled"],
};

await axios.get(`${SERVICE_URL}/products/${id}`, {
  headers: { ...traceHeaders, Authorization: req.headers.authorization },
});

Security Checklist

Implementation Example (Node.js)

Here is a minimal but production-ready API gateway implementation using Express:

const express = require("express");
const axios = require("axios");
const rateLimit = require("express-rate-limit");
const jwt = require("jsonwebtoken");

const app = express();

// Configuration
const PORT = process.env.PORT || 3000;
const AUTH_SERVICE_URL =
  process.env.AUTH_SERVICE_URL || "http://auth-service:8080";
const PRODUCT_SERVICE_URL =
  process.env.PRODUCT_SERVICE_URL || "http://product-service:8080";
const ORDER_SERVICE_URL =
  process.env.ORDER_SERVICE_URL || "http://order-service:8080";

// Middleware: Parse JSON with size limit
app.use(express.json({ limit: "1mb" }));

// Middleware: Request ID for tracing
app.use((req, res, next) => {
  req.id = crypto.randomUUID();
  res.setHeader("X-Request-ID", req.id);
  next();
});

// Middleware: Rate limiting (Redis-backed in production)
const limiter = rateLimit({
  windowMs: 60 * 1000, // 1 minute
  max: 100, // 100 requests per minute per IP
  message: { error: "Too many requests", requestId: (req) => req.id },
  standardHeaders: true,
  legacyHeaders: false,
});
app.use("/api/", limiter);

// Middleware: Authentication
async function authenticate(req, res, next) {
  const token = req.headers.authorization?.replace("Bearer ", "");

  if (!token) {
    return res
      .status(401)
      .json({ error: "Missing authorization token", requestId: req.id });
  }

  try {
    // In production, use a distributed cache for validation results
    const decoded = jwt.verify(token, process.env.JWT_SECRET);
    req.user = decoded;
    next();
  } catch (error) {
    return res.status(401).json({ error: "Invalid token", requestId: req.id });
  }
}

// Middleware: Authorization
function authorize(...allowedRoles) {
  return (req, res, next) => {
    if (!req.user || !allowedRoles.includes(req.user.role)) {
      return res
        .status(403)
        .json({ error: "Insufficient permissions", requestId: req.id });
    }
    next();
  };
}

// Health check endpoint
app.get("/health", (req, res) => {
  res.json({ status: "healthy", timestamp: new Date().toISOString() });
});

// Route: Product catalog with circuit breaker
const { CircuitBreaker } = require("opossum");

const productCircuit = new CircuitBreaker(
  async (productId) => {
    const response = await axios.get(
      `${PRODUCT_SERVICE_URL}/products/${productId}`,
      {
        timeout: 5000,
        headers: { "X-Request-ID": req.id },
      },
    );
    return response.data;
  },
  {
    timeout: 5000,
    errorThresholdPercentage: 50,
    resetTimeout: 30000,
  },
);

productCircuit.on("fallback", () => ({
  error: "Service temporarily unavailable",
}));
productCircuit.on("timeout", () => ({ error: "Service timeout" }));

app.get("/api/products/:id", authenticate, async (req, res) => {
  try {
    const product = await productCircuit.fire(req.params.id);
    res.json(product);
  } catch (error) {
    res.status(502).json({ error: "Bad gateway", requestId: req.id });
  }
});

// Route: Create order (aggregates product and order services)
app.post(
  "/api/orders",
  authenticate,
  authorize("user", "admin"),
  async (req, res) => {
    const { productId, quantity } = req.body;

    try {
      // Check product availability
      const productResponse = await axios.get(
        `${PRODUCT_SERVICE_URL}/products/${productId}`,
        { timeout: 3000 },
      );

      if (!productResponse.data.available) {
        return res.status(400).json({ error: "Product not available" });
      }

      // Create order
      const orderResponse = await axios.post(
        `${ORDER_SERVICE_URL}/orders`,
        { productId, quantity, userId: req.user.id },
        { timeout: 5000 },
      );

      res.status(201).json(orderResponse.data);
    } catch (error) {
      if (error.code === "ECONNABORTED") {
        return res.status(504).json({ error: "Gateway timeout" });
      }
      res.status(502).json({ error: "Failed to create order" });
    }
  },
);

// Error handling middleware
app.use((err, req, res, next) => {
  console.error(`[${req.id}] Unhandled error:`, err);
  res.status(500).json({ error: "Internal server error", requestId: req.id });
});

app.listen(PORT, () => {
  console.log(`API Gateway listening on port ${PORT}`);
});

Docker Compose for Local Development

version: "3.8"

services:
  api-gateway:
    build: ./api-gateway
    ports:
      - "3000:3000"
    environment:
      - JWT_SECRET=your-secret-key
      - AUTH_SERVICE_URL=http://auth-service:8080
      - PRODUCT_SERVICE_URL=http://product-service:8080
      - ORDER_SERVICE_URL=http://order-service:8080
      - REDIS_URL=redis://redis:6379
    depends_on:
      - redis

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  auth-service:
    image: your-auth-service-image
    ports:
      - "8080:8080"

  product-service:
    image: your-product-service-image
    ports:
      - "8081:8080"

  order-service:
    image: your-order-service-image
    ports:
      - "8082:8080"

Interview Questions

1. What is the primary purpose of an API gateway in a microservices architecture?

An API gateway provides a single entry point for all client requests to backend services. It handles cross-cutting concerns that would otherwise be duplicated across services: authentication, authorization, rate limiting, request routing, protocol translation, and observability.

Without a gateway, clients must know about every service endpoint, manage authentication for each, and handle the complexity of calling multiple services. The gateway simplifies client code and gives you a central place to enforce policies.

2. Why is Redis-backed rate limiting preferred over in-memory rate limiting?

In-memory rate limiting lets each gateway instance enforce its own limit independently. A user could make N requests per server. With 10 servers, they effectively get 10N requests. This defeats the purpose of rate limiting.

Redis-backed rate limiting uses shared global state. All gateway instances consult the same Redis counter, ensuring consistent enforcement regardless of which instance handles the request.

Redis also handles atomic operations — INCR and EXPIRE work together to increment and auto-expire counters without race conditions. The latency cost (1-2ms) is acceptable for most gateway use cases.

3. What are the main failure scenarios for an API gateway?

Gateway instance crash: all traffic fails. Mitigate by running multiple instances behind a load balancer with health checks. Gateway instances should be stateless — store session state in Redis, not local memory.

Backend service timeout: threads pile up waiting. Mitigate with aggressive timeouts (5 seconds or less) and circuit breakers. Backend service unavailable: return 502 Bad Gateway immediately rather than waiting. Auth service unavailable: cache JWT validation with short TTL, or fail-open with alerting for critical services.

SSL/TLS certificate expiry: all HTTPS requests fail. Automate certificate renewal with Let's Encrypt or similar. Alert 30 days before expiry.

4. What latency overhead does an API gateway introduce?

Typical overhead is 1-3ms per request. For most web applications, where backend services respond in tens to hundreds of milliseconds, this is negligible. The gateway's TLS termination, authentication checks, and routing add up to a small fraction of total latency.

For ultra-low-latency applications (high-frequency trading, real-time gaming), 1-3ms matters. In these cases, consider whether a gateway is necessary or if clients can call services directly with appropriate SDKs.

The gateway can actually reduce latency in some cases: response caching eliminates backend calls, and connection pooling to backends amortizes connection setup costs.

5. What security considerations are unique to API gateway deployments?

TLS termination at the gateway with modern cipher suites only. JWT validation with signature verification before forwarding requests. Network isolation: backend services must be unreachable directly from the internet, accessible only through the gateway.

DDoS protection at the edge — Cloudflare, AWS Shield, or similar. Rate limiting prevents abuse. Request size limits prevent payload amplification attacks. Input validation prevents injection attacks. Audit logging for authentication and authorization failures for compliance.

6. How does the gateway handle protocol translation between clients and services?

The gateway can translate between client-facing protocols (REST, GraphQL) and internal protocols (gRPC, WebSocket). It handles content type negotiation via Accept headers, translating between XML and JSON if needed.

Protocol translation lives at the gateway, not in backend services. Backend services speak their native protocol; the gateway translates. This keeps backend services simple while supporting diverse client needs.

7. What capacity planning considerations apply to API gateway deployments?

Calculate required instances: (Peak QPS × Average Latency × Safety Factor) / Maximum Throughput per Instance. A 2 vCPU instance handles roughly 2,000 QPS for a typical gateway workload. Always deploy minimum 2 instances for high availability.

Network bandwidth matters: outbound traffic is typically 8x inbound (responses are larger than requests). Memory sizing: ~1 GB per instance covers buffers, application heap, and OS overhead. Plan for failover capacity — during instance failure, remaining instances must handle full traffic.

8. How does the Backend for Frontend (BFF) pattern differ from a generic API gateway, and when should you use it?

A generic API gateway serves all client types through a single instance with unified routing and aggregation logic. A BFF is a specialized gateway instance tailored to a specific client type — mobile, web, or partner API — each with its own data requirements, payload shapes, and aggregation patterns.

Use BFF when mobile apps need smaller payloads with different fields than web apps, when teams have clear ownership boundaries (mobile team vs. web team), and when you have enough traffic to justify separate deployments. It adds complexity but gives per-client teams full autonomy over their gateway logic.

9. What is the difference between rate limiting and throttling, and how does each apply at the gateway layer?

Rate limiting enforces a hard cap on the number of requests a client can make in a time window — excess requests get rejected with 429. Throttling smooths out traffic by queuing or slowing requests rather than dropping them outright.

At the gateway layer, rate limiting is the primary mechanism — it is simple to implement and gives clear pass/fail signals. Throttling is less common at the gateway because queued requests still hold gateway resources. Some gateways implement "delayed rejection" throttling where requests wait briefly before being rejected.

10. What are the security implications of TLS termination at the gateway versus at each backend service?

TLS termination at the gateway (edge termination) is the standard approach: clients terminate TLS at the gateway, and the gateway communicates with backend services over internal plaintext or mTLS. This reduces cryptographic overhead at scale and centralizes certificate management.

Re-encrypting for backend calls (mTLS between gateway and services) adds security for sensitive traffic but increases CPU overhead. For low-security internal networks, plaintext backend communication is acceptable as long as network isolation prevents direct access to services.

Full end-to-end TLS (client to backend, gateway as pass-through) adds maximum security but eliminates the gateway's ability to inspect, transform, or log request/response content.

11. How does request aggregation at the gateway impact memory usage and when does it become a problem?

Request aggregation — the gateway calling multiple backend services and combining responses — is powerful for client convenience but can cause memory pressure when responses are large or many services are called in parallel.

Memory issues arise when: a single aggregated response exceeds the gateway's memory limits, slow backend services cause the gateway to hold many in-flight responses simultaneously, or aggregation timeouts allow partial responses to accumulate.

Mitigations: set per-request memory limits, use streaming aggregation where possible, apply aggressive timeouts to individual backend calls, and cap the number of parallel backend calls the gateway will make for a single client request.

12. What is the role of a service registry in API gateway routing, and how does it interact with health checks?

A service registry (e.g., Consul, etcd, Kubernetes endpoints) maintains the current list of healthy instances for each backend service. The gateway queries the registry to route requests, rather than using static configuration.

Health checks keep the registry accurate: the gateway or a separate process periodically calls each service instance's health endpoint and deregisters instances that fail. This ensures routing stops going to instances that are starting up, overloaded, or crashed.

Stale registry data is a common failure mode — instances can be dead but still in the registry if health checks are infrequent or the deregistration signal is missed. Use short TTLs (30 seconds or less) and ensure deregistration is event-driven, not just TTL-based.

13. What monitoring metrics are most critical for diagnosing gateway-induced latency versus backend-induced latency?

To isolate gateway latency: measure time-to-first-byte at the gateway (before forwarding to backend) vs. backend response time. The gateway's internal processing time (auth, rate limiting, routing) should be captured as a separate histogram bucket.

Key metrics: gateway_request_duration_seconds (with backend_service and route labels), gateway_backend_latency_seconds (time spent waiting for backend), and gateway_overhead_seconds (calculated as total minus backend time).

Percentiles matter more than averages: P50 can look fine while P99 reveals latency tails caused by connection pool exhaustion, GC pauses, or slow rate limiting stores. Always look at P95 and P99 when diagnosing latency issues.

14. How does the gateway contribute to or mitigate DDoS attacks, and what protections should be in place?

A gateway can be both a DDoS target and a DDoS shield. As a target, attackers aim traffic at the gateway to exhaust its resources. As a shield, the gateway's rate limiting and connection management can absorb or deflect attack traffic before it reaches backend services.

Protective measures at the gateway: aggressive rate limiting by IP and API key, connection limits per client, request size limits to prevent amplification, and IP blocklists for known bad actors. For volumetric DDoS (Gbps+ attacks), these are insufficient — edge DDoS protection (Cloudflare, AWS Shield) is needed before traffic reaches the gateway.

The gateway should also emit rate limit violation metrics so security teams can detect and respond to attack patterns in real time.

15. What are the operational challenges of maintaining gateway configuration across multiple environments (dev, staging, prod)?

Configuration drift — the gateway behaving differently across environments due to subtle config differences — is a common operational problem. Rate limiting thresholds, routing rules, and feature flags often vary between environments in ways that cause prod-only bugs.

Best practices: store gateway config in version control with environment-specific overrides; use canary deployments for config changes (roll out to 5% of traffic first); treat config as code with code review requirements; and have automated config validation that runs before applying changes.

Secrets management is separate from config: use a secrets manager (Vault, AWS Secrets Manager) for API keys and credentials, not the gateway config file itself.

16. How should an API gateway support multi-tenancy without tenant data leakage between clients?

The gateway must identify tenants from incoming requests — via API key header, JWT claim, or subdomain — and ensure requests route only to that tenant's backend services. Tenant isolation is enforced at the routing layer, not left to backend services alone.

For shared backend services serving multiple tenants, the gateway should inject tenant context into request headers (X-Tenant-ID) so backend services can scope data queries. The gateway itself must never cache responses across tenants — a cached response for one tenant must not be served to another.

Rate limiting must be per-tenant, not global. A single misbehaving tenant should not consume budget that affects other tenants on the same gateway.

17. What role does a gateway play in API deprecation and sunset management?

The gateway is the central enforcement point for API lifecycle management. It should add Deprecation and Sunset headers to responses for deprecated endpoints, track usage of deprecated API versions, and eventually block requests to sunset endpoints with clear migration guidance.

Deprecation workflow: announce deprecation 6+ months before sunset, add Deprecation: true and Sunset: headers, notify via email and API status page, and apply rate limits to deprecated versions to incentivize migration.

At sunset date, the gateway should return 410 Gone for deleted endpoints rather than generic 404, with a response body explaining the replacement version and migration steps.

18. Why is graceful shutdown critical for an API gateway, and what does it involve?

Without graceful shutdown, a gateway instance being terminated loses in-flight requests — clients see connection errors mid-request. For a gateway handling hundreds or thousands of concurrent requests, this causes a spike of failed requests at every deployment.

Graceful shutdown involves: stopping new connections (draining the load balancer target), waiting for in-flight requests to complete (with a timeout), then exiting. Typical configuration: SIGTERM triggers graceful shutdown, 30-second timeout for in-flight requests, then force-kill if needed.

Health checks at /health can report unhealthy during drain, causing the load balancer to stop routing new traffic while existing requests complete.

19. What is the relationship between API gateway timeout configuration and backend service timeout configuration?

The gateway timeout should always be shorter than the backend service timeout. If the gateway waits longer than the backend, the gateway times out first and returns an error for a request that might have succeeded — the backend wasted resources processing it.

Best practice: gateway timeout = backend timeout minus headroom for gateway processing (e.g., backend has 10s timeout, gateway uses 8s). This ensures the gateway returns a clean 504 before the backend sends a response the gateway will drop.

Different routes can have different timeout values based on the backend service's characteristics. Slow endpoints (report generation) get longer timeouts; fast endpoints (health checks) get shorter ones.

20. How does an API gateway handle request retries versus when should retries be avoided?

The gateway can implement retries for idempotent GET requests or those with explicit idempotency keys. Retries should use exponential backoff with jitter to avoid thundering herd problems. The gateway should add X-Request-ID to track retry chains.

Retries must be avoided for non-idempotent mutations (POST, DELETE without idempotency keys) as they can cause duplicate operations. POST to create an order should not be retried automatically — if it times out, the client should check order status before retrying.

Retries amplify failures: a backend at 50% capacity receiving retries goes to 100% and fails more. Circuit breakers should trip before retries amplify a degraded backend into a cascading failure.

Conclusion

An API gateway is the foundational piece that ties together client requests, backend services, and operational concerns like authentication, rate limiting, and observability. It simplifies client code by providing a single entry point, centralizes cross-cutting concerns so individual services stay thin, and gives you a central vantage point for monitoring, security, and traffic management.

The key decisions when adopting an API gateway are: choosing between a managed service or self-hosted solution, implementing Redis-backed rate limiting for consistent enforcement across instances, adding circuit breakers to prevent cascade failures, and evaluating whether a BFF pattern is needed for multi-client architectures.

Most production deployments require at least two gateway instances behind a load balancer, TLS termination at the edge, short JWT TTLs with refresh token rotation, and automated certificate renewal. Treat the gateway as a stateless proxy — keep business logic in backend services and store session state externally in Redis.

For most microservices architectures, an API gateway handles external client traffic while a service mesh handles internal service-to-service communication. Together they provide comprehensive coverage for north-south and east-west traffic patterns.