Rate limiting: protecting the server from overload

So far, this series on resilience has focused on a single side of the equation: protecting the client. Retry to attempt a failing call again, Timeout to bound a wait, Circuit Breaker to cut short the domino effect, three patterns that all start from the same point of view, that of the application consuming an external service.

But a distributed system has two ends. And the day your service ends up on the other end of the line, swamped with requests, none of these patterns will save you. To protect the server, you have to shift perspective and adopt the pattern that acts upstream: Rate Limiting.

What is Rate Limiting?

The principle fits in one sentence: limit the number of requests a client can send within a given time window. Beyond the threshold, the server politely refuses to handle additional requests rather than letting itself be overwhelmed.

The primary goal is stability. A server resource (CPU, memory, connections, thread pool) is never infinite. By capping the incoming throughput, you guarantee that the system stays within a controlled operating range, even when facing a traffic spike or a poorly coded client that goes into a loop. Better to serve 95% of requests correctly than to collapse trying to serve 100%.

A security dimension

The second use case is security. Rate Limiting is a first line of defense against abuse: brute force on a login page, aggressive scraping, denial of service (DDoS). By restricting requests coming from an identified source (typically an IP address), you block the attacker before they consume resources that should be serving legitimate users.

A business dimension

Finally, the pattern often carries a business dimension. Quota-based pricing is the most telling example: it’s the model of nearly every system that exposes LLMs: ChatGPT, Cursor, Claude, and many others. The plan you pay for determines how many requests you can issue. Here, Rate Limiting is no longer just a technical safeguard; it becomes the mechanism that materializes a commercial offering.

The link with the other resilience patterns

Rate Limiting doesn’t live in a vacuum. It dialogues directly with two of the patterns covered earlier, and it’s by understanding these links that you grasp its real place in a resilient architecture.

With Retry (episode 1)

A client-side Retry is by nature unilateral: the application decides on its own to try again, without knowing anything about the server’s actual state. Rate Limiting flips this logic by introducing genuine collaboration between the two ends.

In an HTTP context, the server has a standardized vocabulary for this. When a client exceeds its limit, it can respond with:

a 429 Too Many Requests status code, which explicitly signals the overage;
a Retry-After header, which tells the client how long to wait before retrying.

It’s now the server, and no longer the client, that dictates the Retry strategy. This inversion is valuable: it lets you regulate the system’s real load at the source, instead of letting each client retry according to its own logic, often at the worst possible moment.

With the Circuit Breaker (episode 3)

Rate Limiting and the Circuit Breaker look remarkably alike in their implementation: both work like an on/off breaker that either lets traffic through or cuts it. But their roles are diametrically opposed.

Pattern	Protects	Acts
Circuit Breaker	the client facing an unstable service	in reaction to failures downstream
Rate Limiting	the server facing overload	in prevention of saturation upstream

The first reacts after the fact, when the service on the other end is already failing; the second anticipates, by refusing excess traffic before it does any harm. Together, they let a distributed system stay under control on both sides: by calling cautiously, and by receiving in moderation.

Two levels of implementation

In practice, Rate Limiting is set up in two complementary places, each addressing a distinct concern.

At the infrastructure level

For the security dimension, that of the first line of defense, it’s better to delegate Rate Limiting to a reverse proxy like NGINX or HAProxy. Three reasons for this: responsibility (filtering abusive traffic shouldn’t pollute your business code), isolation (the malicious request is rejected before it even reaches the application), and performance (a proxy is built for this job and does it at minimal cost).

Here’s an NGINX configuration that caps traffic per IP address before forwarding it to the application:

worker_processes auto;

events {
    worker_connections 4096;
}

http {
    # 50 requests per second per IP address
    limit_req_zone $binary_remote_addr zone=req_limit_per_ip:10m rate=50r/s;

    # HTTP 429 (Too Many Requests) instead of the default 503 to be explicit
    limit_req_status 429;

    # (Optional) Log refused requests
    error_log /var/log/nginx/ratelimit.log notice;

    server {
        listen 80;

        location /api/ {
            # Applies the limit defined above:
            # - allows up to 50 requests per second per IP
            # - allows a "burst" of 20 extra instantaneous requests
            # - "nodelay" means requests beyond that are rejected immediately
            limit_req zone=req_limit_per_ip burst=20 nodelay;

            # Application target, within which complementary Rate Limiting
            # logic can be applied based on business logic
            proxy_pass http://localhost:3000;

            # (Optional) HTTP header to help the client know when to retry
            # Caution: fixed indicative value, not tied to NGINX's internal logic
            add_header Retry-After 120 always;
        }
    }
}

At the application level

For the business dimension, however, infrastructure is no longer enough. Managing per-user quotas requires knowing who is issuing the request (their API Key, their identifier, their plan), information only the application possesses. So this logic finds its place in the code, most often backed by a fast store like Redis or Memcached to share counters across instances.

The example below, with Fastify and its @fastify/rate-limit plugin, applies a differentiated quota depending on the user’s plan (free, pro, or premium), identified by their API Key, with a fallback to the IP address for anonymous requests:

import Fastify from "fastify";

const fastify = Fastify({ logger: true });

const users = {
  "apikey-123abc": { plan: "free" },
  "apikey-236abc": { plan: "pro" },
  "apikey-23610202abc": { plan: "premium" },
} as const;

type ApiKey = keyof typeof users;

const isPaidUser = (value: string | undefined): value is ApiKey =>
  !!value && value in users;

const quotaPerPlan = {
  free: { max: 5, timeWindow: 60 * 1000 }, // 5 requests per minute
  pro: { max: 100, timeWindow: 60 * 1000 }, // 100 requests per minute
  premium: { max: 1000, timeWindow: 60 * 1000 }, // 1000 requests per minute
} as const;

function getUserQuota(key: string) {
  if (isPaidUser(key)) {
    const { plan } = users[key];
    return quotaPerPlan[plan];
  }
  return quotaPerPlan.free;
}

await fastify.register(import("@fastify/rate-limit"), {
  keyGenerator: (req) => {
    const apiKey = req.headers["x-api-key"] as string | undefined;
    // If the user provides a valid API Key, we use it as the
    // identifying basis for the Rate Limiting strategy.
    if (isPaidUser(apiKey)) {
      return apiKey;
    }
    // Otherwise, we use the user's IP address.
    return req.ip;
  },
  // Number of requests allowed per timeWindow depending on the user type.
  max: (_request, key) => getUserQuota(key).max,
  // Window duration depending on the user type.
  timeWindow: (_request, key) => getUserQuota(key).timeWindow,
  errorResponseBuilder: (_request, context) => ({
    statusCode: 429,
    error: "Too Many Requests",
    message: `Quota exceeded. Retry after ${context.after}`,
    retryAfter: context.after,
  }),
});

The result seen from the client side

As long as the user stays under their quota, the request goes through and the server returns a 200 OK, along with headers that expose the counter’s state, how many requests they have left, and how soon the window resets:

< HTTP/1.1 200 OK
< x-ratelimit-limit: 5
< x-ratelimit-remaining: 4
< x-ratelimit-reset: 60
< content-type: application/json; charset=utf-8
<
{"message":"OK"}

The moment they cross the limit, the response switches to 429 Too Many Requests. The retry-after header tells them precisely when to come back, and the response body details the reason for the refusal:

< HTTP/1.1 429 Too Many Requests
< x-ratelimit-limit: 5
< x-ratelimit-remaining: 0
< x-ratelimit-reset: 5
< retry-after: 5
< content-type: application/json; charset=utf-8
<
{"statusCode":429,"error":"Too Many Requests","message":"Quota exceeded. Retry after 5 seconds","retryAfter":"5 seconds"}

Here, concretely, we find the collaboration mentioned earlier: the server doesn’t just slam the door, it gives the client all the information needed to retry intelligently.

Conclusion

Resilience isn’t built in one direction only. We spend a lot of time protecting our outgoing calls (Retry, Timeout, Circuit Breaker), but a system is only solid if it also knows how to defend against what comes in. That’s exactly the role of Rate Limiting: mastering the load at the source, for reasons of stability, security, and sometimes business model.

By deploying it at both levels (a reverse proxy on the front line to filter abuse, application logic to manage business quotas), you get a server that stays in command of its throughput, whatever shows up on the other end. Protect your outgoing calls, but protect just as much what comes in: that’s the condition under which a distributed system stays under control end to end.