Skip to content

Pattern: Retry with Exponential Backoff

Beginner

One Liner

When an operation fails, retry it with progressively longer delays plus random jitter to avoid thundering herd.

Interactive Demo

Real-World Analogy

Calling a busy restaurant for a reservation. You try once, get a busy signal, wait a minute, try again. Still busy? Wait two minutes. Then four. You also vary the timing slightly so that everyone who got a busy signal isn't calling back at the exact same moment.

Core Idea

Instead of retrying immediately (which overloads the failing service) or giving up (which loses the request), exponential backoff doubles the wait time on each retry. Adding jitter randomizes the delay so thousands of clients don't retry simultaneously.

text
  Time ────────────────────────────────────────────────►

  Attempt 1  ✗ ├─┤ 1s
  Attempt 2  ✗ ├───┤ 2s
  Attempt 3  ✗ ├───────┤ 4s
  Attempt 4  ✗ ├───────────────┤ 8s
  Attempt 5  ✗ ├───────────────────────────────┤ 16s (cap)
  Attempt 6  ✓

  Each bar = wait before next retry (doubles each time)
  + jitter: randomize within each bar to avoid thundering herd

The formula: delay = min(base * 2^attempt + random(0, jitter), maxDelay)

PropertyValue
Delay growthExponential — doubles each attempt
Max delayCapped (typically 30–60 s) to bound worst-case wait
JitterRandomized to prevent thundering herd
Total attemptsBounded (typically 3–10) to avoid infinite loops

Try it yourself — send a request and watch exponential backoff with jitter in action:

Production Proof

ProjectSourceUsage
Kubernetesbackoff.go#L30-L50Backoff struct defines Duration, Factor, Jitter, Steps, Cap. ExponentialBackoff (line 475) retries with this config. Used for pod restart backoff, API server retries, controller reconciliation.
gRPC-Gobackoff.go#L56-L75Exponential.Backoff() — computes exponential delay with jitter. Base delay doubles per retry, capped at MaxDelay. RunF (L86-L109) is the retry orchestration loop with context cancellation and ErrResetBackoff support.

Implementation

typescript
interface BackoffConfig {
  maxRetries: number;
  baseDelay: number;
  maxDelay: number;
  jitter: number; // 0-1
}

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  config: BackoffConfig = { maxRetries: 5, baseDelay: 1000, maxDelay: 30000, jitter: 0.5 },
): Promise<T> {
  let lastError: Error | undefined;

  for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err as Error;
      if (attempt === config.maxRetries) break;

      const exponential = config.baseDelay * Math.pow(2, attempt);
      const jitter = exponential * config.jitter * Math.random();
      const delay = Math.min(exponential + jitter, config.maxDelay);

      await new Promise((r) => setTimeout(r, delay));
    }
  }

  throw lastError;
}
rust
use std::time::Duration;

pub struct Backoff {
    pub max_retries: u32,
    pub base_delay: Duration,
    pub max_delay: Duration,
}

impl Backoff {
    pub fn delay_for(&self, attempt: u32) -> Duration {
        let exponential = self.base_delay.as_millis() as u64 * 2u64.pow(attempt);
        let capped = exponential.min(self.max_delay.as_millis() as u64);
        Duration::from_millis(capped)
    }
}
go
package backoff

import (
	"math"
	"math/rand"
	"time"
)

type Config struct {
	MaxRetries int
	BaseDelay  time.Duration
	MaxDelay   time.Duration
	Jitter     float64
}

func Retry(fn func() error, cfg Config) error {
	var lastErr error
	for attempt := 0; attempt <= cfg.MaxRetries; attempt++ {
		lastErr = fn()
		if lastErr == nil {
			return nil
		}
		if attempt == cfg.MaxRetries {
			break
		}
		exp := float64(cfg.BaseDelay) * math.Pow(2, float64(attempt))
		jitter := exp * cfg.Jitter * rand.Float64()
		delay := time.Duration(math.Min(exp+jitter, float64(cfg.MaxDelay)))
		time.Sleep(delay)
	}
	return lastErr
}
python
import time
import random

def retry_with_backoff(fn, max_retries=5, base_delay=1.0, max_delay=30.0, jitter=0.5):
    last_error = None
    for attempt in range(max_retries + 1):
        try:
            return fn()
        except Exception as e:
            last_error = e
            if attempt == max_retries:
                break
            exponential = base_delay * (2 ** attempt)
            delay = min(exponential + exponential * jitter * random.random(), max_delay)
            time.sleep(delay)
    raise last_error

Exercises

LevelExerciseFile
BasicImplement retry with configurable backoffexercises/typescript/retry-backoff/01-basic.test.ts
IntermediateRetry with circuit breaker integrationexercises/typescript/retry-backoff/02-intermediate.test.ts

Run exercises: pnpm test:exercises (TypeScript) · cargo test (Rust) · go test ./... (Go) · pytest (Python)

Exercise files: Rust exercises/rust/src/retry_backoff/mod.rs · Go exercises/go/retry_backoff/retry_backoff_test.go · Python exercises/python/retry_backoff/test_retry_backoff.py

When to Use

  • Network requests — HTTP calls, database connections, RPC
  • Distributed systems — service-to-service calls that may transiently fail
  • Rate-limited APIs — back off when hitting rate limits (often 429 responses)
  • Queue consumers — retry failed message processing

When NOT to Use

  • Non-transient errors — 400 Bad Request won't succeed on retry; validate input instead
  • Idempotency not guaranteed — retrying a non-idempotent POST could create duplicates
  • User-facing latency — exponential backoff means 30+ second waits; show an error instead
  • Local operations — file not found, parse error — these won't fix themselves on retry

More Production Uses

PatternRelationship
Circuit BreakerCircuit breaker tells you when to stop retrying entirely
Batch ProcessingFailed batch items can be retried with backoff independently
Rate Limiter (Token Bucket)Jittered backoff prevents retry storms, similar to rate limiting's goal

Challenge Questions

Q1: You remove jitter from your retry logic to make delays "predictable." Under a thundering herd scenario, what happens?

Answer: All clients that failed at the same time retry at exactly the same intervals, repeatedly overloading the recovering service in synchronized waves.

Without jitter, 10,000 clients that got a 503 at t=0 all retry at t=1s, then t=2s, then t=4s — creating periodic traffic spikes that prevent recovery. Jitter spreads retries across the delay window so the recovering service sees a smooth trickle instead of synchronized bursts. This is why every production retry library includes jitter.

Q2: Your service retries a POST /create-order endpoint that is NOT idempotent. The first attempt times out but actually succeeded on the server. What happens on retry?

Answer: The retry creates a duplicate order. The customer gets charged twice.

A timeout does not mean the request failed — it means you don't know if it succeeded. Retrying a non-idempotent operation risks duplication. The fix is to make the operation idempotent using an idempotency key: the client generates a unique ID and the server deduplicates. Without idempotency, you should not retry write operations.

Q3: A downstream service returns HTTP 400 Bad Request. Should you retry with exponential backoff?

Answer: No. A 400 is a client error indicating bad input. Retrying the same request will produce the same error every time.

Retry with backoff is designed for transient failures — 503 Service Unavailable, 429 Too Many Requests, network timeouts, connection resets. A 400 means "your request is malformed," which won't fix itself with time. Retrying it wastes resources and delays the real fix (correcting the input). Always classify errors before deciding to retry.

Q4: Your retry config uses baseDelay=1s, maxDelay=30s, maxRetries=10. A junior engineer asks: "Why not set maxRetries=1000 so we never lose a request?" What's wrong with that?

Answer: With exponential backoff capped at 30s and 1000 retries, the client would spend up to 8+ hours retrying a single request, holding resources the entire time.

High retry counts consume connection pool slots, memory, goroutines/threads, and often hold database transactions or locks open. If the downstream service is truly down, those retries won't help — you need a circuit breaker to fail fast and shed load. In practice, 3-5 retries with backoff is enough to handle transient blips; anything longer should be handled by a persistent queue with dead-letter semantics.

Released under the MIT License.