The Green Report | Deterministic Simulation Testing: A Practical Guide for QA Engineers

Deterministic Simulation Testing: A Practical Guide for QA Engineers

Jun 27th 2026 12 min read

medium

overview

dst

Flaky tests are one of the most corrosive problems a QA team can face: they erode trust in the entire suite, eat hours in re-runs, and tend to hide real bugs behind a shrug of "just rerun it." Deterministic simulation testing offers a different way to handle the nondeterminism that causes most of this flakiness, by putting time, randomness, and I/O fully under the test's control instead of leaving them to chance. This post explains how DST works, builds a small working example in Python, and covers how to start using it in a QA workflow.

The Bug You Can't Reproduce

You've seen this ticket before. A test fails in CI once every few hundred runs. The stack trace points somewhere into a retry loop. You rerun it locally ten times and it passes every single time. You add some logging, rerun it in CI, and now it passes there too. Three weeks later it fails again, in production, during a deploy, and now it's an incident instead of a ticket.

This is the signature failure mode of systems with concurrency, networking, or timing baked into their behavior, which is most systems worth testing. Traditional tests struggle here because they rely on the same nondeterminism that's causing the bug: real clocks, real thread scheduling, real (or "real-ish") network calls. You can't reproduce the failure because you can't control the conditions that caused it.

Deterministic simulation testing (DST) is a response to exactly this problem. Instead of running your system against real time, real randomness, and real I/O, you run it inside a simulated environment where you control all three, driven entirely by a single seed. Same seed, same execution, same bug, every time. If a run fails, you don't need to "try to reproduce it." You just rerun the seed.

What DST Actually Is

The core idea is simple to state: identify every source of nondeterminism in your system, and replace it with a version you control, all driven by one seed.

The usual suspects are:

Once those three are under your control, your system's behavior becomes a pure function of the seed. That unlocks three things that ordinary tests can't give you:

Exact reproducibility. A failing seed can be replayed forever, on any machine, and will fail the exact same way.
Time compression. A simulated clock can advance instantly between events. A test that "waits" 6 hours for a timeout can run in milliseconds.
Exhaustive, automatable exploration. You can run thousands of seeds overnight, each exercising a different combination of delays, drops, and reorderings, and treat every failing seed as a permanent regression test.

DST isn't a replacement for unit tests. It's a layer above them, aimed specifically at the bugs that only show up from the interaction of timing, faults, and concurrency.

Taming Time: A Virtual Clock

Start with the most common offender: time.sleep() and time.time() scattered through your code. The fix is to never call them directly. Instead, depend on a clock object you can substitute, built up here piece by piece.

A virtual clock needs to track two things: what time it currently thinks it is, and what's scheduled to happen next. The "what's scheduled" part is just a list of pending callbacks ordered by when they should fire, which a heap handles well.

                
import heapq

class VirtualClock:
    """A clock that only advances when something asks it to."""

    def __init__(self):
        self._now = 0.0
        self._events = []  # heap of (fire_time, seq, callback)
        self._seq = 0

    def now(self) -> float:
        return self._now

Notice that now() never advances on its own. Time only moves forward when something explicitly schedules or processes an event, which is the entire point: nothing in the code under test can sneak in real wall-clock behavior through the back door.

Scheduling is handled by call_later, which is what replaces every time.sleep() in the code under test. Instead of blocking for a delay, it records "fire this callback at now + delay" and returns immediately. The _seq counter is just a tie-breaker: if two events land on the exact same simulated time, the heap needs some way to order them, since the callbacks themselves aren't directly comparable.

                
def call_later(self, delay: float, callback):
        self._seq += 1
        heapq.heappush(self._events, (self._now + delay, self._seq, callback))

The last piece, run_until_idle, is where simulated time actually advances. Rather than waiting in real time, it repeatedly pops the next scheduled event off the heap, jumps _now straight to that event's fire time, and runs its callback. If that callback schedules further events, as a retry loop would, those simply get pushed onto the same heap and picked up in the same loop, until nothing is left to fire.

                
def run_until_idle(self):
        """Advance time by jumping straight to the next scheduled event."""
        while self._events:
            fire_time, _, callback = heapq.heappop(self._events)
            self._now = fire_time
            callback()

Put together, the full class looks like this, and it's what gets imported in the examples below:

                
import heapq

class VirtualClock:
    """A clock that only advances when something asks it to."""

    def __init__(self):
        self._now = 0.0
        self._events = []  # heap of (fire_time, seq, callback)
        self._seq = 0

    def now(self) -> float:
        return self._now

    def call_later(self, delay: float, callback):
        self._seq += 1
        heapq.heappush(self._events, (self._now + delay, self._seq, callback))

    def run_until_idle(self):
        """Advance time by jumping straight to the next scheduled event."""
        while self._events:
            fire_time, _, callback = heapq.heappop(self._events)
            self._now = fire_time
            callback()

Instead of time.sleep(2), code calls clock.call_later(2, on_done). No wall-clock time passes at all. The clock jumps directly to the next scheduled event. A test involving a 30-second timeout and a 6-hour retry window can now run in a few milliseconds, deterministically, because "time passing" is just popping items off a heap.

Taming Randomness: A Seeded RNG, Threaded Through Everything

The fix here is more about discipline than cleverness: never call the global random module directly. Instead, pass a random.Random(seed) instance into anything that needs randomness, and thread it through your whole call graph.

                
import random

class FlakyService:
    def __init__(self, rng: random.Random, failure_rate: float = 0.1):
        self.rng = rng
        self.failure_rate = failure_rate

    def handle_request(self) -> bool:
        return self.rng.random() >= self.failure_rate

It looks trivial, but this one change is what makes everything downstream reproducible. If FlakyService reached into the global random module, two runs with "the same seed" could still diverge the moment something else in the process also drew from that global generator. Isolating the RNG per-component (or passing one shared, seeded instance everywhere) is what keeps the whole simulation's behavior pinned to a single seed.

Taming I/O: A Simulated Network

This is the piece that does the most work in real-world DST setups. Instead of your client and server talking over real sockets, they talk through an in-process "network" that you control, one that can delay, drop, duplicate, or reorder messages, all driven by the seeded RNG and the virtual clock.

                
class SimulatedNetwork:
    def __init__(self, clock: VirtualClock, rng: random.Random,
                 drop_rate: float = 0.0, max_latency: float = 0.5):
        self.clock = clock
        self.rng = rng
        self.drop_rate = drop_rate
        self.max_latency = max_latency

    def send(self, message, deliver_callback):
        if self.rng.random() < self.drop_rate:
            return  # message vanishes, as if lost in transit
        latency = self.rng.uniform(0, self.max_latency)
        self.clock.call_later(latency, lambda: deliver_callback(message))

Notice that nothing here is "random" in the uncontrolled sense: drop_rate and max_latency are knobs you choose, and the actual decisions (drop or not, how much latency) come from the seeded RNG. Run the same seed twice and you get the exact same drops, at the exact same simulated moments.

Worked Example: A Flaky Retry Client

Let's put this together on something every QA engineer has tested before: a client that retries a request with jittered backoff.

The naive version looks reasonable, but is genuinely hard to test:

                
import time
import random

def fetch_with_retry(service, max_attempts=5):
    for attempt in range(max_attempts):
        if service.handle_request():
            return "success"
        backoff = (2 ** attempt) + random.uniform(0, 1)
        time.sleep(backoff)   # real wall-clock sleep
    return "failed"

To test this properly you'd want to know: does it actually give up after 5 attempts? Does the backoff growth match what you expect? Does it behave correctly when the service fails on attempts 1, 3, and 4 specifically? With real time.sleep and global random, you can't pin any of that down. You're either mocking so much of the function that you're not really testing it, or you're waiting through real backoff delays and hoping the random failures line up the way you want.

The deterministic version swaps in the clock, RNG, and network we built above:

                
def fetch_with_retry_sim(clock, rng, service, max_attempts=5):
    result = {"status": None}

    def attempt(n):
        if n >= max_attempts:
            result["status"] = "failed"
            return
        if service.handle_request():
            result["status"] = "success"
            return
        backoff = (2 ** n) + rng.uniform(0, 1)
        clock.call_later(backoff, lambda: attempt(n + 1))

    attempt(0)
    clock.run_until_idle()
    return result["status"]

Now we can build a tiny simulation harness that runs this end-to-end from a single seed:

                
def run_simulation(seed: int, failure_rate: float = 0.6) -> str:
    rng = random.Random(seed)
    clock = VirtualClock()
    service = FlakyService(rng, failure_rate=failure_rate)
    return fetch_with_retry_sim(clock, rng, service)

# Sweep seeds the way you'd run this in CI
for seed in range(20):
    outcome = run_simulation(seed)
    print(f"seed={seed:<3} -> {outcome}")

Running this prints a clean, fully reproducible table, something like seed=7 -> failed, seed=8 -> success, and so on. If seed 7 fails in a way you didn't expect, you don't need to "try to reproduce it." You call run_simulation(7) again, locally, right now, and get the identical failure. Add a print statement, step through it in a debugger, fix the bug, and keep seed 7 around permanently as a regression test.

This is the entire DST loop in miniature: seed in, deterministic behavior out, failing seeds become permanent tests. Real-world DST setups scale this up: more components, richer fault injection (crashes, disk corruption, partitions), and thousands of seeds run nightly, but the mechanism doesn't change.

DST vs. Property-Based Testing and Fuzzing

These three get mentioned in the same breath a lot, and it's worth being precise about how they differ:

Property-based testing (e.g. Hypothesis) generates randomized inputs to a function and checks that some property holds across all of them. It's great for pure functions and parsers, but it doesn't model time or concurrency.
Fuzzing throws malformed or random inputs at a system, usually looking for crashes or memory-safety issues, often at the boundary of a parser or API.
DST generates randomized conditions (delays, drops, reorderings, crashes) over the course of an extended, simulated run, looking for bugs that only emerge from the interaction of timing and faults over time.

They're complementary, not competing. A mature test suite typically has unit tests for logic, property-based tests for pure functions, fuzzing at input boundaries, and DST for the temporal and distributed-systems behavior that the others can't reach.

Practical Adoption Tips for QA Teams

A few things that make the difference between DST being a nice idea and DST actually catching bugs:

Start with one source of nondeterminism, not three. Injecting a virtual clock alone is often enough to make a whole class of timeout/retry bugs testable. Add the RNG and fake network once that's paying off.
Always log the seed on failure. A DST failure with no seed attached is just a flaky test again. The seed is the bug report.
Run more seeds, less often, and fewer seeds, more often. A handful of seeds on every pull request, thousands of seeds in a nightly job. This mirrors how teams running this in production actually structure it.
Treat every failing seed as a permanent regression test. Once you've found and fixed the bug behind seed 4821, keep run_simulation(4821) in the suite forever.
Model realistic faults, not just delay. A network that only adds latency will miss a lot. Drops, duplication, reordering, and partial writes are where the interesting bugs live.
Don't let simulated time replace real-world load testing. DST is excellent at correctness under faults; it doesn't replace performance testing against real infrastructure.

Real-World Examples Worth Studying

You don't need to invent this from scratch. A few teams have published in detail about how they built and use DST in production:

FoundationDB is the project most credited with popularizing this approach. Its team built a single-threaded simulator capable of running an entire simulated cluster, including simulated disks, networks, and machine crashes, deterministically, and has reportedly accumulated on the order of a trillion CPU-hours of simulated testing on the database over the years.
TigerBeetle, a financial transactions database, built a simulator nicknamed the VOPR (Viewstamped Operation Replicator), which runs an entire cluster on a single thread while injecting network, disk, and process faults, and can compress months of simulated operation into minutes of wall-clock time. Any failure it finds can be replayed exactly using the seed and the commit hash.
Antithesis turned this idea into a commercial product: a deterministic hypervisor that lets teams run their existing, unmodified systems inside a fully controlled, faulty, and reproducible environment without rewriting the system's I/O layer themselves.

All three converge on the same underlying recipe described in this post (virtual time, seeded randomness, and simulated I/O), just at a much larger scale.

Conclusion

Deterministic simulation testing isn't a single tool you install; it's a discipline of refusing to let your code touch real clocks, real randomness, or real I/O directly, and instead routing everything through versions you control from a seed. The payoff is concrete and very QA-relevant: bugs that used to be "couldn't reproduce it" tickets become "here's the exact seed, here's the exact failure, here's the fix" tickets.

Start small: wrap your clock, run one flaky test through a simulation harness, and see what falls out. The retry-client example above is a complete, runnable starting point; the only real engineering work from here is widening it to cover the parts of your system that currently rely on hope instead of seeds. The full code, including a small regression-test suite that pins the failing seeds found above, is available on our GitHub page.