Pattern: Tombstone / Deferred Deletion
BeginnerOne Liner
Mark deleted entries with a tombstone marker instead of removing them — a background process reclaims space later.
Interactive Demo ↓Real-World Analogy
A library book marked 'withdrawn' with a sticker but left on the shelf. Patrons see it's no longer available, and the librarian collects withdrawn books in batch during the monthly shelf cleanup.
Core Idea
Instead of immediately removing data, write a special "tombstone" record that shadows the original. Reads check for tombstones and treat marked entries as deleted. A background compaction process later reclaims space by physically removing both the tombstone and the shadowed data. This decouples the fast path (mark deleted) from the slow path (reclaim space).
Write path: Read path:
delete("B") get("B")
│ │
▼ ▼
┌──────────┐ ┌───────────┐
│ Log/SST │ │ Lookup │
├──────────┤ ├───────────┤
│ A = "v1" │ │ Found: │
│ B = tomb │ ◄── tombstone │ B = tomb │──► return NOT FOUND
│ C = "v3" │ │ │
└──────────┘ └───────────┘
Compaction (background):
┌──────────┐ ┌──────────┐
│ A = "v1" │ │ A = "v1" │
│ B = "v2" │ ──► │ C = "v3" │ B removed (tombstone + original)
│ B = tomb │ └──────────┘
│ C = "v3" │
└──────────┘| Property | Value |
|---|---|
| Delete | O(1) — just append a tombstone marker |
| Space reclaim | Deferred — background compaction |
| Read overhead | Must check for tombstones |
| Consistency | Tombstone must propagate to all replicas before removal |
Try it yourself — write entries, delete them with tombstones, and compact to reclaim space:
Production Proof
| Project | Source | Usage |
|---|---|---|
| LevelDB | dbformat.h#L39-L43 | kTypeDeletion (value 0x0) marks a key as deleted in the write-ahead log and SSTables. During compaction (DoCompactionWork in db_impl.cc), tombstones are dropped once no older snapshot references the key. |
| Apache Cassandra | DeletionTime.java#L37-L99 | DeletionTime class represents tombstones with markedForDeleteAt timestamp. isLive() (L99) checks tombstone status on read. Tombstones propagate across replicas during gc_grace_seconds (default 10 days, referenced at L89) before compaction purges them. |
Implementation
interface Entry<V> {
value: V | null;
deleted: boolean;
timestamp: number;
}
class TombstoneStore<V> {
private store = new Map<string, Entry<V>>();
private tombstoneCount = 0;
put(key: string, value: V): void {
this.store.set(key, {
value,
deleted: false,
timestamp: Date.now(),
});
}
get(key: string): V | undefined {
const entry = this.store.get(key);
if (!entry || entry.deleted) return undefined;
return entry.value!;
}
delete(key: string): boolean {
const entry = this.store.get(key);
if (!entry || entry.deleted) return false;
entry.deleted = true;
entry.value = null;
entry.timestamp = Date.now();
this.tombstoneCount++;
return true;
}
/** Compact: remove tombstones older than maxAge ms. */
compact(maxAge: number): number {
const cutoff = Date.now() - maxAge;
let removed = 0;
for (const [key, entry] of this.store) {
if (entry.deleted && entry.timestamp < cutoff) {
this.store.delete(key);
removed++;
this.tombstoneCount--;
}
}
return removed;
}
get size(): number {
let count = 0;
for (const entry of this.store.values()) {
if (!entry.deleted) count++;
}
return count;
}
get pendingTombstones(): number {
return this.tombstoneCount;
}
}use std::collections::HashMap;
use std::time::{SystemTime, UNIX_EPOCH};
struct Entry {
value: Option<String>,
deleted: bool,
timestamp: u128,
}
pub struct TombstoneStore {
store: HashMap<String, Entry>,
tombstone_count: usize,
}
impl TombstoneStore {
pub fn new() -> Self {
TombstoneStore { store: HashMap::new(), tombstone_count: 0 }
}
fn now_ms() -> u128 {
SystemTime::now().duration_since(UNIX_EPOCH).unwrap().as_millis()
}
pub fn put(&mut self, key: &str, value: &str) {
self.store.insert(key.to_string(), Entry {
value: Some(value.to_string()),
deleted: false,
timestamp: Self::now_ms(),
});
}
pub fn get(&self, key: &str) -> Option<&str> {
self.store.get(key)
.filter(|e| !e.deleted)
.and_then(|e| e.value.as_deref())
}
pub fn delete(&mut self, key: &str) -> bool {
if let Some(entry) = self.store.get_mut(key) {
if !entry.deleted {
entry.deleted = true;
entry.value = None;
entry.timestamp = Self::now_ms();
self.tombstone_count += 1;
return true;
}
}
false
}
pub fn compact(&mut self, max_age_ms: u128) -> usize {
let cutoff = Self::now_ms().saturating_sub(max_age_ms);
let to_remove: Vec<String> = self.store.iter()
.filter(|(_, e)| e.deleted && e.timestamp < cutoff)
.map(|(k, _)| k.clone())
.collect();
let count = to_remove.len();
for key in to_remove {
self.store.remove(&key);
}
self.tombstone_count -= count;
count
}
pub fn size(&self) -> usize {
self.store.values().filter(|e| !e.deleted).count()
}
}type Entry struct {
Value string
Deleted bool
Timestamp int64
}
type TombstoneStore struct {
store map[string]*Entry
tombstoneCount int
}
func NewTombstoneStore() *TombstoneStore {
return &TombstoneStore{store: make(map[string]*Entry)}
}
func (s *TombstoneStore) Put(key, value string) {
s.store[key] = &Entry{Value: value, Deleted: false, Timestamp: time.Now().UnixMilli()}
}
func (s *TombstoneStore) Get(key string) (string, bool) {
entry, ok := s.store[key]
if !ok || entry.Deleted {
return "", false
}
return entry.Value, true
}
func (s *TombstoneStore) Delete(key string) bool {
entry, ok := s.store[key]
if !ok || entry.Deleted {
return false
}
entry.Deleted = true
entry.Value = ""
entry.Timestamp = time.Now().UnixMilli()
s.tombstoneCount++
return true
}
func (s *TombstoneStore) Compact(maxAgeMs int64) int {
cutoff := time.Now().UnixMilli() - maxAgeMs
removed := 0
for key, entry := range s.store {
if entry.Deleted && entry.Timestamp < cutoff {
delete(s.store, key)
removed++
s.tombstoneCount--
}
}
return removed
}
func (s *TombstoneStore) Size() int {
count := 0
for _, entry := range s.store {
if !entry.Deleted {
count++
}
}
return count
}import time
class TombstoneStore:
def __init__(self):
self._store: dict[str, dict] = {}
self._tombstone_count = 0
def put(self, key: str, value: str) -> None:
self._store[key] = {
"value": value,
"deleted": False,
"timestamp": time.time() * 1000,
}
def get(self, key: str) -> str | None:
entry = self._store.get(key)
if entry is None or entry["deleted"]:
return None
return entry["value"]
def delete(self, key: str) -> bool:
entry = self._store.get(key)
if entry is None or entry["deleted"]:
return False
entry["deleted"] = True
entry["value"] = None
entry["timestamp"] = time.time() * 1000
self._tombstone_count += 1
return True
def compact(self, max_age_ms: float) -> int:
cutoff = time.time() * 1000 - max_age_ms
to_remove = [
k for k, e in self._store.items()
if e["deleted"] and e["timestamp"] < cutoff
]
for k in to_remove:
del self._store[k]
self._tombstone_count -= len(to_remove)
return len(to_remove)
@property
def size(self) -> int:
return sum(1 for e in self._store.values() if not e["deleted"])
@property
def pending_tombstones(self) -> int:
return self._tombstone_countExercises
| Level | Exercise | File |
|---|---|---|
| Basic | Implement a key-value store with tombstone deletion | exercises/typescript/tombstone/01-basic.test.ts |
| Intermediate | Add time-based compaction and tombstone metrics | exercises/typescript/tombstone/02-intermediate.test.ts |
Run exercises: pnpm test:exercises (TypeScript) · cargo test (Rust) · go test ./... (Go) · pytest (Python)
Exercise files: Rust exercises/rust/src/tombstone/mod.rs · Go exercises/go/tombstone/tombstone_test.go · Python exercises/python/tombstone/test_tombstone.py
When to Use
- LSM-tree storage engines — LevelDB, RocksDB, Cassandra append tombstones; compaction cleans up
- Distributed databases — tombstones propagate deletion intent across replicas before physical removal
- Soft delete in applications — mark records as deleted but keep audit trail; purge after retention period
- Immutable/append-only logs — cannot modify existing entries, so deletion requires a shadow record
- Concurrent data structures — mark nodes deleted to avoid unsafe pointer manipulation during concurrent reads
When NOT to Use
- Mutable in-place storage — if you can directly remove the entry (hash table, mutable array), just remove it
- Memory-constrained systems — tombstones consume space until compaction; if space is tight, immediate deletion is better
- No background processing — compaction requires a background thread/process; if unavailable, tombstones accumulate forever
More Production Uses
- RocksDB —
kTypeDeletionandkTypeSingleDeletiontombstones with configurable compaction triggers - Apache HBase — delete markers propagate to all store files during major compaction
- CockroachDB — MVCC tombstones for range deletions, GC-ed by background job
- Elasticsearch — soft-deleted docs marked with
_deletedflag, purged on segment merge
Related Patterns
| Pattern | Relationship |
|---|---|
| LSM Tree (Log-Structured Merge Tree) | LSM trees use tombstones extensively — they're cleaned up during compaction |
| MVCC (Multi-Version Concurrency Control) | MVCC marks old versions with tombstones for garbage collection |
| Free List | After tombstone cleanup, freed slots can be managed by a free list |
| LRU Cache | LRU caches use tombstones to mark deleted entries in distributed scenarios |
| Reference Counting | Reference counting determines when tombstoned objects can be safely reclaimed |
Challenge Questions
Q1: A Cassandra cluster with gc_grace_seconds=10 days. Node C goes down for 15 days. What happens when C comes back online?
Answer: Node C may resurrect deleted data.
While C was down, other nodes deleted some keys and their tombstones expired (gc_grace_seconds=10 days). When C comes back, it still has the original data without tombstones. During anti-entropy repair, C's "live" data wins because there's no tombstone to contradict it. The deleted data reappears across the cluster.
Fix: Run nodetool repair before gc_grace_seconds expires, or increase gc_grace_seconds to exceed the maximum expected downtime.
Q2: Your LSM-tree database has a "tombstone accumulation" problem — reads are getting slower. Why?
Answer: Tombstones must be checked during reads.
When you read a key, the database must scan from the newest SSTable to the oldest. If it finds a tombstone, it knows the key is deleted — but it still had to read through all the levels to find it. Worse, range scans must check every tombstone in the range to filter deleted keys.
If compaction falls behind or the delete rate is high, tombstones pile up across levels. Solutions: trigger compaction more aggressively on tombstone-heavy SSTables, or use "single delete" (RocksDB) which cancels exactly one put, avoiding tombstone persistence.
Q3: Why can't you just immediately delete the tombstone after all replicas acknowledge the deletion?
Answer: Because of read-repair and anti-entropy.
Even if all currently-live replicas acknowledge the deletion, a temporarily-offline replica might still hold the original data. When it comes back, it would re-introduce the data. The tombstone must persist long enough to "win" conflict resolution against stale data from any replica that was down.
This is why Cassandra uses gc_grace_seconds — it's the maximum expected time for a node to be offline. The tombstone lives at least that long to guarantee it outlives any stale replica.
Q4: Your application performs a bulk delete of 10 million rows using tombstones. Immediately after, a range scan over the deleted range takes 30 seconds instead of the expected 0 seconds. Explain why the range scan isn't instant, even though all rows are "deleted."
Answer: The tombstones themselves are data that must be read and evaluated during the scan.
A range scan doesn't know which keys are deleted until it reads each entry and checks for the tombstone marker. With 10 million tombstones, the scan reads 10 million entries, evaluates each one, and returns zero results. This is the "tombstone scan" problem — the work is proportional to the number of tombstones, not the number of live results. Solutions include: range tombstones (RocksDB's DeleteRange marks an entire key range deleted with a single marker instead of per-key tombstones), immediate compaction of the affected range, or using a separate index that tracks only live keys.