Pattern: Work Stealing

Advanced

One Liner

Idle threads steal tasks from busy threads' queues — balancing load dynamically without central coordination.

Real-World Analogy

A team of cashiers at a supermarket. When one cashier finishes their line, they walk to the busiest cashier and take customers from the back of that line. Work naturally flows from overloaded lanes to idle ones.

Core Idea

Each worker owns a local deque (double-ended queue). Workers push/pop tasks from their own deque's top (LIFO for cache locality). When a worker's deque is empty, it steals from another worker's deque bottom (FIFO for fairness). This achieves automatic load balancing without a central scheduler bottleneck.

text

  Worker 0 (busy)         Worker 1 (idle)        Worker 2 (busy)
  ┌──────────────┐        ┌──────────────┐       ┌──────────────┐
  │ Task D ← pop │        │   (empty)    │       │ Task G ← pop │
  │ Task C       │        │              │       │ Task F       │
  │ Task B       │◄───────│  STEAL ────► │       │              │
  │ Task A       │  steal │              │       │              │
  └──────────────┘  from  └──────────────┘       └──────────────┘
        ↑ bottom                                        ↑ bottom

Property	Value
Push/pop own	O(1) — no synchronization needed
Steal	O(1) — CAS on victim's deque bottom
Load balance	Automatic, decentralized
Cache locality	High — LIFO on own work, FIFO on stolen

Try it yourself — add tasks to one worker and start processing to see idle workers steal tasks:

Production Proof

Project	Source	Usage
Go runtime	proc.go#L3836-L3903	`stealWork` — the goroutine scheduler's steal loop. Iterates 4× over all P's in random order, calling `runqsteal` (L7774-L7791) to CAS-grab half the runnable goroutines from a victim P's local run queue. Low-level `runqgrab` (L7706-L7769) uses atomic CAS on `runqhead`.
Tokio (Rust)	worker.rs#L1136-L1175	`Core::steal_work` — iterates over remote workers from a random index, calls `steal_into` on each worker's steal queue. Only attempts stealing if fewer than half the workers are already searching. Falls back to the global inject queue.

Implementation

TypeScriptRustGoPython

typescript

class WorkStealingScheduler {
  private queues: number[][];

  constructor(workerCount: number) {
    this.queues = Array.from({ length: workerCount }, () => []);
  }

  submit(task: number, workerIdx: number): void {
    this.queues[workerIdx]!.push(task);
  }

  run(process: (task: number) => number): number[] {
    const results: number[] = [];
    let anyWork = true;
    while (anyWork) {
      anyWork = false;
      for (let w = 0; w < this.queues.length; w++) {
        if (this.queues[w]!.length > 0) {
          anyWork = true;
          const task = this.queues[w]!.pop()!;
          results.push(process(task));
        } else {
          for (let other = 0; other < this.queues.length; other++) {
            if (other !== w && this.queues[other]!.length > 1) {
              anyWork = true;
              const stolen = this.queues[other]!.shift()!;
              results.push(process(stolen));
              break;
            }
          }
        }
      }
    }
    return results;
  }
}

rust

use std::collections::VecDeque;

pub struct WorkStealingScheduler {
    queues: Vec<VecDeque<i32>>,
}

impl WorkStealingScheduler {
    pub fn new(worker_count: usize) -> Self {
        WorkStealingScheduler {
            queues: (0..worker_count).map(|_| VecDeque::new()).collect(),
        }
    }

    pub fn submit(&mut self, task: i32, worker_idx: usize) {
        self.queues[worker_idx].push_back(task);
    }

    pub fn run(&mut self, process: fn(i32) -> i32) -> Vec<i32> {
        let mut results = Vec::new();
        loop {
            let mut any_work = false;
            for w in 0..self.queues.len() {
                if !self.queues[w].is_empty() {
                    any_work = true;
                    let task = self.queues[w].pop_back().unwrap();
                    results.push(process(task));
                } else {
                    let len = self.queues.len();
                    for other in 0..len {
                        if other != w && self.queues[other].len() > 1 {
                            any_work = true;
                            let stolen = self.queues[other].pop_front().unwrap();
                            results.push(process(stolen));
                            break;
                        }
                    }
                }
            }
            if !any_work { break; }
        }
        results
    }
}

type WorkStealingScheduler struct {
	queues [][]int
}

func NewScheduler(workerCount int) *WorkStealingScheduler {
	queues := make([][]int, workerCount)
	for i := range queues {
		queues[i] = []int{}
	}
	return &WorkStealingScheduler{queues: queues}
}

func (s *WorkStealingScheduler) Submit(task, workerIdx int) {
	s.queues[workerIdx] = append(s.queues[workerIdx], task)
}

func (s *WorkStealingScheduler) Run(process func(int) int) []int {
	var results []int
	for {
		anyWork := false
		for w := 0; w < len(s.queues); w++ {
			if len(s.queues[w]) > 0 {
				anyWork = true
				last := len(s.queues[w]) - 1
				task := s.queues[w][last]
				s.queues[w] = s.queues[w][:last]
				results = append(results, process(task))
			} else {
				for other := 0; other < len(s.queues); other++ {
					if other != w && len(s.queues[other]) > 1 {
						anyWork = true
						stolen := s.queues[other][0]
						s.queues[other] = s.queues[other][1:]
						results = append(results, process(stolen))
						break
					}
				}
			}
		}
		if !anyWork {
			break
		}
	}
	return results
}

python

from collections import deque

class WorkStealingScheduler:
    def __init__(self, worker_count: int):
        self.queues: list[deque[int]] = [deque() for _ in range(worker_count)]

    def submit(self, task: int, worker_idx: int) -> None:
        self.queues[worker_idx].append(task)

    def run(self, process) -> list[int]:
        results: list[int] = []
        while True:
            any_work = False
            for w in range(len(self.queues)):
                if self.queues[w]:
                    any_work = True
                    task = self.queues[w].pop()
                    results.append(process(task))
                else:
                    for other in range(len(self.queues)):
                        if other != w and len(self.queues[other]) > 1:
                            any_work = True
                            stolen = self.queues[other].popleft()
                            results.append(process(stolen))
                            break
            if not any_work:
                break
        return results

Exercises

Level	Exercise	File
Basic	Implement a work-stealing scheduler with local deques	`exercises/typescript/work-stealing/01-basic.test.ts`
Intermediate	Priority work stealing — high-priority tasks first	`exercises/typescript/work-stealing/02-intermediate.test.ts`

Run exercises: pnpm test:exercises (TypeScript) · cargo test (Rust) · go test ./... (Go) · pytest (Python)

Exercise files: Rust exercises/rust/src/work_stealing/mod.rs · Go exercises/go/work_stealing/work_stealing_test.go · Python exercises/python/work_stealing/test_work_stealing.py

When to Use

Parallel runtimes — goroutine scheduler (Go), task scheduler (Tokio, Java ForkJoinPool)
Divide-and-conquer — recursive task decomposition where subtasks vary in size
Irregular workloads — when task durations are unpredictable
NUMA-aware scheduling — steal from far only when local work is depleted

When NOT to Use

Single-threaded — no other workers to steal from
Uniform tasks — static partitioning is simpler and equally effective
Very short tasks — steal overhead dominates task execution time
Strict ordering — work stealing disrupts FIFO order by design

More Production Uses

Java ForkJoinPool — scan method with randomized work stealing
Rayon (Rust) — data-parallelism library with work-stealing thread pool
Intel TBB — task_arena with work-stealing scheduler
Cilk — pioneered work stealing for fork-join parallelism

Pattern	Relationship
Cooperative Scheduling	Work stealing distributes tasks across threads; cooperative scheduling yields within a thread
Object Pool	Worker threads use per-thread object pools to avoid contention
Free List	Per-thread free lists complement work stealing by providing lock-free allocation

Challenge Questions

Q1: Workers pop from their own deque using LIFO (top), but steal from others using FIFO (bottom). Why not use FIFO for both?

Answer: LIFO on your own deque gives cache locality — the most recently pushed task is likely still in CPU cache. FIFO stealing takes the oldest (largest) task from the victim, giving the thief more work to do before needing to steal again.

In divide-and-conquer workloads, the bottom of the deque holds the earliest-spawned (coarsest-grained) tasks. Stealing one large task is better than stealing many small ones because it amortizes the steal overhead and gives the thief a chunk of work it can subdivide locally. LIFO for local pops also naturally implements depth-first execution, which uses less stack space.

Q2: Go's runtime steals half the victim's run queue instead of just one task. Why is "steal half" better than "steal one"?

Answer: Stealing one task means the thief may finish quickly and immediately need to steal again, causing repeated contention on the victim's deque. Stealing half amortizes the synchronization cost.

Each steal operation requires atomic CAS on the victim's deque, which is expensive. If you steal only one task, a worker with an empty queue may steal dozens of times per millisecond. Stealing half the queue in one operation means the thief has enough local work to stay busy, reducing total steal attempts and contention. The Go runtime's runqgrab does exactly this with a single atomic operation.

Q3: What is the ABA problem in the context of a lock-free work-stealing deque, and why does it matter?

Answer: The ABA problem occurs when a CAS succeeds because the value matches, but the underlying state has changed between the read and the CAS — another thread modified and restored the original value.

In a lock-free deque, a thief reads the bottom index as value A, gets preempted, the owner pops and pushes (bottom goes A -> B -> A), and the thief's CAS on the bottom index succeeds even though the deque content is different. This can cause a task to be executed twice or skipped. The fix is to use a tagged pointer or generation counter so CAS detects the intermediate changes. This is why Tokio and Go use epoch/version counters alongside deque indices.

Q4: You have 8 workers and 8 identical long-running tasks, one per worker. Is work stealing helping here?

Answer: No. If every worker has exactly one task of equal duration, no worker finishes early, so no stealing ever occurs. Work stealing adds zero benefit and a small overhead from the idle-check logic.

Work stealing shines when workloads are irregular — some tasks finish quickly and the worker can help others. With perfectly balanced, uniform tasks, static partitioning (assign one task per worker) is simpler and equally effective. Work stealing's overhead (deque management, random victim selection, CAS operations) is wasted when there's nothing to steal.

Pattern: Work Stealing ​

One Liner ​

Real-World Analogy ​

Core Idea ​

Production Proof ​

Implementation ​

Exercises ​

When to Use ​

When NOT to Use ​

More Production Uses ​

Related Patterns ​

Challenge Questions ​