Flaky tests are one of the most corrosive problems a QA team can face: they erode trust in the entire suite, eat hours in re-runs, and tend to hide real bugs behind a shrug of "just rerun it." Deterministic simulation testing offers a different way to handle the nondeterminism that causes most of this flakiness, by putting time, randomness, and I/O fully under the test's control instead of leaving them to chance. This post explains how DST works, builds a small working example in Python, and covers how to start using it in a QA workflow.
You've seen this ticket before. A test fails in CI once every few hundred runs. The stack trace points somewhere into a retry loop. You rerun it locally ten times and it passes every single time. You add some logging, rerun it in CI, and now it passes there too. Three weeks later it fails again, in production, during a deploy, and now it's an incident instead of a ticket.
This is the signature failure mode of systems with concurrency, networking, or timing baked into their behavior, which is most systems worth testing. Traditional tests struggle here because they rely on the same nondeterminism that's causing the bug: real clocks, real thread scheduling, real (or "real-ish") network calls. You can't reproduce the failure because you can't control the conditions that caused it.
Deterministic simulation testing (DST) is a response to exactly this problem. Instead of running your system against real time, real randomness, and real I/O, you run it inside a simulated environment where you control all three, driven entirely by a single seed. Same seed, same execution, same bug, every time. If a run fails, you don't need to "try to reproduce it." You just rerun the seed.
The core idea is simple to state: identify every source of nondeterminism in your system, and replace it with a version you control, all driven by one seed.
The usual suspects are:
Once those three are under your control, your system's behavior becomes a pure function of the seed. That unlocks three things that ordinary tests can't give you:
DST isn't a replacement for unit tests. It's a layer above them, aimed specifically at the bugs that only show up from the interaction of timing, faults, and concurrency.
Start with the most common offender: time.sleep() and time.time() scattered through your code. The fix is to never call them directly. Instead, depend on a clock object you can substitute, built up here piece by piece.
A virtual clock needs to track two things: what time it currently thinks it is, and what's scheduled to happen next. The "what's scheduled" part is just a list of pending callbacks ordered by when they should fire, which a heap handles well.
import heapq
class VirtualClock:
"""A clock that only advances when something asks it to."""
def __init__(self):
self._now = 0.0
self._events = [] # heap of (fire_time, seq, callback)
self._seq = 0
def now(self) -> float:
return self._now
Notice that now() never advances on its own. Time only moves forward when something explicitly schedules or processes an event, which is the entire point: nothing in the code under test can sneak in real wall-clock behavior through the back door.
Scheduling is handled by call_later, which is what replaces every time.sleep() in the code under test. Instead of blocking for a delay, it records "fire this callback at now + delay" and returns immediately. The _seq counter is just a tie-breaker: if two events land on the exact same simulated time, the heap needs some way to order them, since the callbacks themselves aren't directly comparable.
def call_later(self, delay: float, callback):
self._seq += 1
heapq.heappush(self._events, (self._now + delay, self._seq, callback))
The last piece, run_until_idle, is where simulated time actually advances. Rather than waiting in real time, it repeatedly pops the next scheduled event off the heap, jumps _now straight to that event's fire time, and runs its callback. If that callback schedules further events, as a retry loop would, those simply get pushed onto the same heap and picked up in the same loop, until nothing is left to fire.
def run_until_idle(self):
"""Advance time by jumping straight to the next scheduled event."""
while self._events:
fire_time, _, callback = heapq.heappop(self._events)
self._now = fire_time
callback()
Put together, the full class looks like this, and it's what gets imported in the examples below:
import heapq
class VirtualClock:
"""A clock that only advances when something asks it to."""
def __init__(self):
self._now = 0.0
self._events = [] # heap of (fire_time, seq, callback)
self._seq = 0
def now(self) -> float:
return self._now
def call_later(self, delay: float, callback):
self._seq += 1
heapq.heappush(self._events, (self._now + delay, self._seq, callback))
def run_until_idle(self):
"""Advance time by jumping straight to the next scheduled event."""
while self._events:
fire_time, _, callback = heapq.heappop(self._events)
self._now = fire_time
callback()
Instead of time.sleep(2), code calls clock.call_later(2, on_done). No wall-clock time passes at all. The clock jumps directly to the next scheduled event. A test involving a 30-second timeout and a 6-hour retry window can now run in a few milliseconds, deterministically, because "time passing" is just popping items off a heap.
The fix here is more about discipline than cleverness: never call the global random module directly. Instead, pass a random.Random(seed) instance into anything that needs randomness, and thread it through your whole call graph.
import random
class FlakyService:
def __init__(self, rng: random.Random, failure_rate: float = 0.1):
self.rng = rng
self.failure_rate = failure_rate
def handle_request(self) -> bool:
return self.rng.random() >= self.failure_rate
It looks trivial, but this one change is what makes everything downstream reproducible. If FlakyService reached into the global random module, two runs with "the same seed" could still diverge the moment something else in the process also drew from that global generator. Isolating the RNG per-component (or passing one shared, seeded instance everywhere) is what keeps the whole simulation's behavior pinned to a single seed.
This is the piece that does the most work in real-world DST setups. Instead of your client and server talking over real sockets, they talk through an in-process "network" that you control, one that can delay, drop, duplicate, or reorder messages, all driven by the seeded RNG and the virtual clock.
class SimulatedNetwork:
def __init__(self, clock: VirtualClock, rng: random.Random,
drop_rate: float = 0.0, max_latency: float = 0.5):
self.clock = clock
self.rng = rng
self.drop_rate = drop_rate
self.max_latency = max_latency
def send(self, message, deliver_callback):
if self.rng.random() < self.drop_rate:
return # message vanishes, as if lost in transit
latency = self.rng.uniform(0, self.max_latency)
self.clock.call_later(latency, lambda: deliver_callback(message))
Notice that nothing here is "random" in the uncontrolled sense: drop_rate and max_latency are knobs you choose, and the actual decisions (drop or not, how much latency) come from the seeded RNG. Run the same seed twice and you get the exact same drops, at the exact same simulated moments.
Let's put this together on something every QA engineer has tested before: a client that retries a request with jittered backoff.
The naive version looks reasonable, but is genuinely hard to test:
import time
import random
def fetch_with_retry(service, max_attempts=5):
for attempt in range(max_attempts):
if service.handle_request():
return "success"
backoff = (2 ** attempt) + random.uniform(0, 1)
time.sleep(backoff) # real wall-clock sleep
return "failed"
To test this properly you'd want to know: does it actually give up after 5 attempts? Does the backoff growth match what you expect? Does it behave correctly when the service fails on attempts 1, 3, and 4 specifically? With real time.sleep and global random, you can't pin any of that down. You're either mocking so much of the function that you're not really testing it, or you're waiting through real backoff delays and hoping the random failures line up the way you want.
The deterministic version swaps in the clock, RNG, and network we built above:
def fetch_with_retry_sim(clock, rng, service, max_attempts=5):
result = {"status": None}
def attempt(n):
if n >= max_attempts:
result["status"] = "failed"
return
if service.handle_request():
result["status"] = "success"
return
backoff = (2 ** n) + rng.uniform(0, 1)
clock.call_later(backoff, lambda: attempt(n + 1))
attempt(0)
clock.run_until_idle()
return result["status"]
Now we can build a tiny simulation harness that runs this end-to-end from a single seed:
def run_simulation(seed: int, failure_rate: float = 0.6) -> str:
rng = random.Random(seed)
clock = VirtualClock()
service = FlakyService(rng, failure_rate=failure_rate)
return fetch_with_retry_sim(clock, rng, service)
# Sweep seeds the way you'd run this in CI
for seed in range(20):
outcome = run_simulation(seed)
print(f"seed={seed:<3} -> {outcome}")
Running this prints a clean, fully reproducible table, something like seed=7 -> failed, seed=8 -> success, and so on. If seed 7 fails in a way you didn't expect, you don't need to "try to reproduce it." You call run_simulation(7) again, locally, right now, and get the identical failure. Add a print statement, step through it in a debugger, fix the bug, and keep seed 7 around permanently as a regression test.
This is the entire DST loop in miniature: seed in, deterministic behavior out, failing seeds become permanent tests. Real-world DST setups scale this up: more components, richer fault injection (crashes, disk corruption, partitions), and thousands of seeds run nightly, but the mechanism doesn't change.
These three get mentioned in the same breath a lot, and it's worth being precise about how they differ:
They're complementary, not competing. A mature test suite typically has unit tests for logic, property-based tests for pure functions, fuzzing at input boundaries, and DST for the temporal and distributed-systems behavior that the others can't reach.
A few things that make the difference between DST being a nice idea and DST actually catching bugs:
You don't need to invent this from scratch. A few teams have published in detail about how they built and use DST in production:
All three converge on the same underlying recipe described in this post (virtual time, seeded randomness, and simulated I/O), just at a much larger scale.
Deterministic simulation testing isn't a single tool you install; it's a discipline of refusing to let your code touch real clocks, real randomness, or real I/O directly, and instead routing everything through versions you control from a seed. The payoff is concrete and very QA-relevant: bugs that used to be "couldn't reproduce it" tickets become "here's the exact seed, here's the exact failure, here's the fix" tickets.
Start small: wrap your clock, run one flaky test through a simulation harness, and see what falls out. The retry-client example above is a complete, runnable starting point; the only real engineering work from here is widening it to cover the parts of your system that currently rely on hope instead of seeds. The full code, including a small regression-test suite that pins the failing seeds found above, is available on our GitHub page.