The Timeout pattern: stopping cascading failure in prod

A call that never answers hasn’t really failed: it just hangs, and that’s exactly what makes it dangerous. Sooner or later, an external service will slow down, hiccup, or simply stop responding altogether, and without a time limit, your application will wait for it indefinitely, resources tied up. The question isn’t whether this will happen, but how your application will react the day it does.

This second installment in the resilience series extends the Retry pattern covered previously with a mechanism that is both simple and devastatingly effective: the Timeout. On its own, it protects your system against one of the most insidious failures in production: cascading degradation.

The problem: waiting is never free

The moment you interact with an external service (database, HTTP API, gRPC, message broker), you depend on a system you have no control over. That service might take a considerable amount of time to respond, or never respond at all. The interaction is, by nature, non-deterministic: nothing guarantees that a call answering in 50 ms today won’t take 30 seconds tomorrow.

And that wait is never free. As long as a call hasn’t completed, resources stay tied up on both sides:

on the client side, our application holds onto threads, sockets, connections;
on the server side, the external service likewise keeps consuming resources to process a request whose client is still waiting for the response.

Cascading degradation

The real danger isn’t the slowness of a single isolated call, it’s what that slowness triggers. By passively absorbing an external service’s instability, you start importing its failure into your own system. Pending requests pile up, resources grow scarce, memory climbs, and bit by bit the whole thing slows down. A single sluggish dependency can thus tip an entire application over.

It’s precisely to cut this domino effect short that you have to take the initiative and impose a strict limit on the duration of every interaction: a timeout.

Every external interaction deserves its timeout

The rule is simple and admits no exception: any communication with the outside world must be bounded in time.

a database query = a timeout;
an HTTP or gRPC request = a timeout;
a request to a broker = a timeout.

Beware the false sense of security: some protocols do impose a default timeout, but it’s often far too long or completely disconnected from your business context. A default timeout of several minutes protects you from nothing in an API that’s supposed to respond in under a second. Better, then, to configure it explicitly rather than rely on some arbitrary value.

In practice

The good news is that the timeout is a first-class mechanism in most environments. On the web platform, for example, AbortSignal.timeout() lets you cleanly cancel a fetch. And most database or cache clients expose their own timeout options, which you just need to fill in.

/**
 * Simple but effective, the Timeout pattern lets you limit
 * the duration of an operation.
 * You can use AbortSignal#timeout for APIs that support it,
 * such as the Web fetch API.
 */
function networkCallWithTimeout() {
  const timeout = AbortSignal.timeout(10_000);

  return fetch("https://jsonplaceholder.typicode.com/todos/1", {
    signal: timeout,
    // ^^^^^^ timeout expired => "[TimeoutError]: The operation was aborted due to timeout"
  });
}

/**
 * Otherwise, don't forget to configure timeouts with the libs
 * that allow it, and to adjust the durations to fit your needs.
 */
knex
  .select()
  .from("books")
  .timeout(10_000, { cancel: true });

const redis = new Redis({
  host: "localhost",
  port: 6379,
  connectTimeout: 2000,
  commandTimeout: 1000,
});

/**
 * You can then combine a Timeout with the Retry pattern.
 * You bound the duration of an operation with the Timeout, while
 * retrying on failure, whether or not the failure is related to
 * the timeout.
 */
import { Duration, Effect, pipe, Schedule } from "effect";

const networkCallWithTimeoutAndRetry = pipe(
  Effect.tryPromise(networkCallWithTimeout),
  Effect.retry(Schedule.exponential(Duration.seconds(1), 2))
);

The winning combo: Retry + Timeout

The Timeout doesn’t conflict with the Retry covered in the previous episode: the two patterns complement each other beautifully. By bounding each attempt in time and then retrying, you get the best of both worlds: every attempt is guaranteed to finish quickly, and a transient failure doesn’t doom the operation. Conversely, retrying without a timeout would amount to stacking up requests that never complete and saturating the system even faster.

A cancellation that must be honored on both sides

One often-overlooked point is worth dwelling on: a timeout on the client side doesn’t magically free up resources on the server side. The external service has the responsibility to react to the cancellation for the benefit to be complete. The scenario to aim for looks like this:

the client initiates an HTTP request with a defined timeout;
the server starts processing and subscribes to the TCP socket’s events;
the client-side timeout fires, the socket is closed;
the server receives the close event and can, must, free the resources associated with the request.

Without this fourth step, only the client benefits from the timeout: the server, for its part, keeps working in a vacuum for a response no one is waiting for anymore. The cutoff is only truly clean when both ends play along.

Choosing the right value depends on context

There’s no universal timeout. For some services, a latency of 30 seconds is perfectly normal; for others, 5 seconds is already far too much. The right value depends entirely on the nature of the call and the expectations on the business side.

That’s why setting a timeout shouldn’t be a purely technical decision made in isolation. Discuss it with your domain experts: knowing how long an operation can reasonably take before it’s considered lost is as much a business question as an engineering one. Resilience is built at that level too.

Conclusion

The Timeout is one of the simplest patterns to put in place, and yet one of the most rewarding. By bounding each external interaction in time, you prevent a dependency’s slowness from contaminating your entire system, you free your resources as fast as possible, and you cut cascading degradation off cleanly before it takes hold.

Paired with Retry, it forms a solid foundation of resilience on which the series’ subsequent patterns will build. The rule to remember fits in a single sentence: no external interaction should ever be able to wait indefinitely.