The Green Report | Chaos Engineering for ML: What Happens When Your Model Lies?

Chaos Engineering for ML: What Happens When Your Model Lies?

Mar 1st 2026 5 min read

medium

ai/ml

strategy

If you've worked in QA long enough, you've probably run chaos experiments: killing a service, saturating a network, corrupting a message queue, and watching how your system responds. The goal isn't to break things for fun. It's to find out whether your system's failure modes are the ones you designed for.

Now machine learning is showing up in more of the systems you test. And it brings a new kind of failure mode: one that doesn't crash, doesn't throw an exception, and doesn't trip any of your existing monitors. The model just quietly returns the wrong answer, often with high confidence.

This is the ML equivalent of a lying dependency. And most QA teams aren't testing for it yet.

What "Model Fault Injection" Actually Means

In traditional chaos engineering, you inject faults at the infrastructure layer: latency, packet loss, node failures. With ML systems, you need to add a new layer: the model output itself.

Model fault injection means deliberately corrupting or degrading what the model returns, then observing how the rest of the system behaves. Concretely, this looks like:

Flipping predictions - change a classification result from the correct class to a plausible wrong one
Injecting overconfidence - return a high-confidence score alongside a wrong prediction
Degrading gradually - simulate concept drift by slowly shifting the model's accuracy downward over time
Introducing latency - make the model respond slowly and see if downstream components time out or compensate gracefully
Returning edge outputs - nulls, empty arrays, out-of-range probability scores, malformed responses

The point is to treat the model as an untrusted dependency because in production, that's exactly what it is.

Why This Is Different From Model Evaluation

Your data science team already evaluates the model. They have accuracy metrics, confusion matrices, benchmark datasets. So why do you need to do this too?

Because model evaluation answers a different question. It asks: how well does the model perform on representative data? Fault injection asks: what happens to my system when the model gets it wrong?

These are not the same question. A model can have 95% accuracy and still cause catastrophic system behavior on the 5% it gets wrong, if that 5% triggers a dangerous downstream action, creates a feedback loop, or fails silently in a way that accumulates over time.

Your job as a QA engineer is the second question. The data science team owns the first one.

A Simple Framework to Get Started

You don't need to rebuild your test infrastructure from scratch. Start with three questions:

1. What does the system do when the model is wrong? Pick your model's most common failure mode, say a misclassification, and simulate it. Mock the model client to return that wrong answer. Does the system handle it gracefully, degrade gracefully, or fail silently?

2. What does the system do when the model is confidently wrong? This is the harder case. A low-confidence wrong answer might trigger a fallback or a human review. A high-confidence wrong answer often doesn't. Inject a wrong prediction paired with a high confidence score and see whether any of your safety nets fire.

3. What does the system do when the model degrades over time? This simulates drift. Write a test that progressively worsens model accuracy across a sequence of requests, starting at 90% correct and dropping to 60%, and measure when (or whether) your system's monitoring catches it.

What You're Looking For

When you run these experiments, you're not measuring the model. You're measuring the system's resilience to model failure. The things worth looking for:

Amplification - does a small model error cause a disproportionately large system error?
Silent propagation - does the fault move downstream without triggering any alerts?
Feedback loops - does the wrong output influence future inputs, such as retraining data or user behavior that the model later sees?
Missing fallbacks - are there decision points that assume the model is always right?

Any of these is a finding worth filing.

The Mindset Shift

The hardest part of this isn't technical, it's conceptual. Traditional software has a contract: given input X, function F returns output Y, deterministically, every time. You can write assertions against that.

ML models don't have that contract. They return probably Y, with some confidence, most of the time. Your test suite needs to account for that uncertainty, not by accepting it, but by actively probing what the system does when the model falls short of probably.

Chaos engineering gave us a vocabulary for asking "what if this dependency fails?" Model fault injection extends that vocabulary to ask "what if this dependency lies?"

Start asking that question before your users find the answer for you.