Pattern: Retry with Exponential Backoff

Beginner

One Liner

When an operation fails, retry it with progressively longer delays plus random jitter to avoid thundering herd.

Real-World Analogy

Calling a busy restaurant for a reservation. You try once, get a busy signal, wait a minute, try again. Still busy? Wait two minutes. Then four. You also vary the timing slightly so that everyone who got a busy signal isn't calling back at the exact same moment.

Core Idea

Instead of retrying immediately (which overloads the failing service) or giving up (which loses the request), exponential backoff doubles the wait time on each retry. Adding jitter randomizes the delay so thousands of clients don't retry simultaneously.

text

  Time ────────────────────────────────────────────────►

  Attempt 1  ✗ ├─┤ 1s
  Attempt 2  ✗ ├───┤ 2s
  Attempt 3  ✗ ├───────┤ 4s
  Attempt 4  ✗ ├───────────────┤ 8s
  Attempt 5  ✗ ├───────────────────────────────┤ 16s (cap)
  Attempt 6  ✓

  Each bar = wait before next retry (doubles each time)
  + jitter: randomize within each bar to avoid thundering herd

The formula: delay = min(base * 2^attempt + random(0, jitter), maxDelay)

Property	Value
Delay growth	Exponential — doubles each attempt
Max delay	Capped (typically 30–60 s) to bound worst-case wait
Jitter	Randomized to prevent thundering herd
Total attempts	Bounded (typically 3–10) to avoid infinite loops

Try it yourself — send a request and watch exponential backoff with jitter in action:

Production Proof

Project	Source	Usage
Kubernetes	backoff.go#L30-L50	`Backoff` struct defines `Duration`, `Factor`, `Jitter`, `Steps`, `Cap`. `ExponentialBackoff` (line 475) retries with this config. Used for pod restart backoff, API server retries, controller reconciliation.
gRPC-Go	backoff.go#L56-L75	`Exponential.Backoff()` — computes exponential delay with jitter. Base delay doubles per retry, capped at `MaxDelay`. `RunF` (L86-L109) is the retry orchestration loop with context cancellation and `ErrResetBackoff` support.

Implementation

TypeScriptRustGoPython

typescript

interface BackoffConfig {
  maxRetries: number;
  baseDelay: number;
  maxDelay: number;
  jitter: number; // 0-1
}

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  config: BackoffConfig = { maxRetries: 5, baseDelay: 1000, maxDelay: 30000, jitter: 0.5 },
): Promise<T> {
  let lastError: Error | undefined;

  for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err as Error;
      if (attempt === config.maxRetries) break;

      const exponential = config.baseDelay * Math.pow(2, attempt);
      const jitter = exponential * config.jitter * Math.random();
      const delay = Math.min(exponential + jitter, config.maxDelay);

      await new Promise((r) => setTimeout(r, delay));
    }
  }

  throw lastError;
}

rust

use std::time::Duration;

pub struct Backoff {
    pub max_retries: u32,
    pub base_delay: Duration,
    pub max_delay: Duration,
}

impl Backoff {
    pub fn delay_for(&self, attempt: u32) -> Duration {
        let exponential = self.base_delay.as_millis() as u64 * 2u64.pow(attempt);
        let capped = exponential.min(self.max_delay.as_millis() as u64);
        Duration::from_millis(capped)
    }
}

package backoff

import (
	"math"
	"math/rand"
	"time"
)

type Config struct {
	MaxRetries int
	BaseDelay  time.Duration
	MaxDelay   time.Duration
	Jitter     float64
}

func Retry(fn func() error, cfg Config) error {
	var lastErr error
	for attempt := 0; attempt <= cfg.MaxRetries; attempt++ {
		lastErr = fn()
		if lastErr == nil {
			return nil
		}
		if attempt == cfg.MaxRetries {
			break
		}
		exp := float64(cfg.BaseDelay) * math.Pow(2, float64(attempt))
		jitter := exp * cfg.Jitter * rand.Float64()
		delay := time.Duration(math.Min(exp+jitter, float64(cfg.MaxDelay)))
		time.Sleep(delay)
	}
	return lastErr
}

python

import time
import random

def retry_with_backoff(fn, max_retries=5, base_delay=1.0, max_delay=30.0, jitter=0.5):
    last_error = None
    for attempt in range(max_retries + 1):
        try:
            return fn()
        except Exception as e:
            last_error = e
            if attempt == max_retries:
                break
            exponential = base_delay * (2 ** attempt)
            delay = min(exponential + exponential * jitter * random.random(), max_delay)
            time.sleep(delay)
    raise last_error

Exercises

Level	Exercise	File
Basic	Implement retry with configurable backoff	`exercises/typescript/retry-backoff/01-basic.test.ts`
Intermediate	Retry with circuit breaker integration	`exercises/typescript/retry-backoff/02-intermediate.test.ts`

Run exercises: pnpm test:exercises (TypeScript) · cargo test (Rust) · go test ./... (Go) · pytest (Python)

Exercise files: Rust exercises/rust/src/retry_backoff/mod.rs · Go exercises/go/retry_backoff/retry_backoff_test.go · Python exercises/python/retry_backoff/test_retry_backoff.py

When to Use

Network requests — HTTP calls, database connections, RPC
Distributed systems — service-to-service calls that may transiently fail
Rate-limited APIs — back off when hitting rate limits (often 429 responses)
Queue consumers — retry failed message processing

When NOT to Use

Non-transient errors — 400 Bad Request won't succeed on retry; validate input instead
Idempotency not guaranteed — retrying a non-idempotent POST could create duplicates
User-facing latency — exponential backoff means 30+ second waits; show an error instead
Local operations — file not found, parse error — these won't fix themselves on retry

More Production Uses

AWS SDK
Azure SDK
Google Cloud
Envoy — proxy
Celery — Python task queue

Pattern	Relationship
Circuit Breaker	Circuit breaker tells you when to stop retrying entirely
Batch Processing	Failed batch items can be retried with backoff independently
Rate Limiter (Token Bucket)	Jittered backoff prevents retry storms, similar to rate limiting's goal

Challenge Questions

Q1: You remove jitter from your retry logic to make delays "predictable." Under a thundering herd scenario, what happens?

Answer: All clients that failed at the same time retry at exactly the same intervals, repeatedly overloading the recovering service in synchronized waves.

Without jitter, 10,000 clients that got a 503 at t=0 all retry at t=1s, then t=2s, then t=4s — creating periodic traffic spikes that prevent recovery. Jitter spreads retries across the delay window so the recovering service sees a smooth trickle instead of synchronized bursts. This is why every production retry library includes jitter.

Q2: Your service retries a POST /create-order endpoint that is NOT idempotent. The first attempt times out but actually succeeded on the server. What happens on retry?

Answer: The retry creates a duplicate order. The customer gets charged twice.

A timeout does not mean the request failed — it means you don't know if it succeeded. Retrying a non-idempotent operation risks duplication. The fix is to make the operation idempotent using an idempotency key: the client generates a unique ID and the server deduplicates. Without idempotency, you should not retry write operations.

Q3: A downstream service returns HTTP 400 Bad Request. Should you retry with exponential backoff?

Answer: No. A 400 is a client error indicating bad input. Retrying the same request will produce the same error every time.

Retry with backoff is designed for transient failures — 503 Service Unavailable, 429 Too Many Requests, network timeouts, connection resets. A 400 means "your request is malformed," which won't fix itself with time. Retrying it wastes resources and delays the real fix (correcting the input). Always classify errors before deciding to retry.

Q4: Your retry config uses baseDelay=1s, maxDelay=30s, maxRetries=10. A junior engineer asks: "Why not set maxRetries=1000 so we never lose a request?" What's wrong with that?

Answer: With exponential backoff capped at 30s and 1000 retries, the client would spend up to 8+ hours retrying a single request, holding resources the entire time.

High retry counts consume connection pool slots, memory, goroutines/threads, and often hold database transactions or locks open. If the downstream service is truly down, those retries won't help — you need a circuit breaker to fail fast and shed load. In practice, 3-5 retries with backoff is enough to handle transient blips; anything longer should be handled by a persistent queue with dead-letter semantics.

Pattern: Retry with Exponential Backoff ​

One Liner ​

Real-World Analogy ​

Core Idea ​

Production Proof ​

Implementation ​

Exercises ​

When to Use ​

When NOT to Use ​

More Production Uses ​

Related Patterns ​

Challenge Questions ​