The Green Report | Test-Retest Reliability: Is Your Rubric Consistent Run-to-Run?

Test-Retest Reliability: Is Your Rubric Consistent Run-to-Run?

May 24th 2026 4 min read

medium

ai/ml

promptfoo

strategy

QA engineers spend a lot of energy hunting down flaky tests in their code pipelines, but almost no one checks whether their LLM rubric gives the same verdict twice on the same input. This is a problem. LLM judges are non-deterministic by nature, which means even a carefully written rubric can silently flip its verdict between runs with no code change, no prompt change, and no obvious reason why. If your rubric is inconsistent with itself, every eval score becomes suspect and you lose the ability to tell a real regression from noise.

What Test-Retest Reliability Means (and Why It Matters Here)

Test-retest reliability is a concept borrowed from psychometrics: the same measurement instrument, applied to the same input, should produce the same result. For a unit test, this is a given. For an LLM rubric, it is not.

LLM judges are probabilistic. Even with temperature set to 0, subtle factors like context window state, model version drift, and prompt sensitivity can shift a verdict. A rubric that scores an output as "passing" on Monday might score the exact same output as "failing" on Wednesday, not because anything changed, but because the judge is not a deterministic function.

Most QA teams never surface this because they run their eval suite once, read the results, and move on. But if your rubric is the measurement tool, and the tool itself is unreliable, then everything built on top of it inherits that unreliability. You cannot confidently say a model regressed if you cannot first confirm that your rubric would have passed it consistently in the first place.

How to Measure It with promptfoo

The good news is that measuring rubric consistency does not require any special tooling beyond what you already have. The approach is straightforward: run the same test suite multiple times against a fixed input and compare results across runs.

Start by picking a rubric you already use in production and running it 3 to 5 times on the same promptfooconfig.yaml without changing anything. Then track two things across runs: the pass/fail verdict per test case, and the score if you are using a numeric rubric. What you are looking for is how often a test case flips its verdict between runs.

A simple metric to track is the flip rate: the percentage of test cases that produce a different verdict across runs. A rubric where more than 10% of cases flip is worth treating as flaky. Here is a minimal example of an assertion that is prone to this kind of variance:

                
assert:
  - type: llm-rubric
    value: "The response sounds professional and helpful to the user"

That single criterion bundles two subjective dimensions with no behavioral anchors, giving the judge too much interpretive room. It will not score consistently. Once you have your flip rate baseline, you have something concrete to improve against, and a threshold you can eventually enforce in CI the same way you would a code coverage floor.

Common Causes of Rubric Flakiness

Once you measure your flip rate, the next question is why it is happening. There are a few patterns that come up repeatedly.

Vague criteria are the most common culprit. Words like "professional", "clear", or "helpful" seem descriptive but they give the judge no behavioral anchor to work from. Two runs of the same judge can reasonably disagree on whether something sounds "helpful" because the word itself is underspecified. The rubric is not wrong, it is just not precise enough to constrain the judge's interpretation.

Wide scoring scales compound the problem. A 1 to 10 range gives the judge too much room to land in different spots on repeated runs even when its overall assessment is roughly the same. A score of 6 one run and 7 the next can push a test case across your pass threshold without anything actually changing.

Criteria bundling is another common issue. When a single rubric string asks the judge to evaluate multiple dimensions at once, the dimensions interfere with each other. A strong result on tone can bleed into a weaker result on accuracy and produce an overall score that neither criterion would have earned alone.

Judge model choice matters more than most teams realize. Smaller or cheaper models are more variance-prone, and using the same model family as the system under test introduces self-preference bias, where the judge subtly favors outputs that resemble its own style. Both of these increase your flip rate in ways that are hard to diagnose without knowing to look for them.

How to Fix It

Most rubric flakiness is fixable with a few targeted changes, and you do not need to redesign your entire eval suite to see improvement.

Tighten your criterion wording. Replace vague adjectives with observable behaviors. Instead of "the response sounds friendly", write "the response uses first-person address, avoids passive voice, and contains no technical jargon." The judge now has something concrete to check rather than something subjective to interpret, and two runs of the same judge are far more likely to agree.

Narrow your scoring scale. Use binary or 3-point scales by default. A 3-point scale with clearly defined levels ("does not meet", "partially meets", "fully meets") is significantly more stable than a 1 to 10 range. Only go wider if you have a specific reason and have validated that the wider scale does not increase your flip rate.

Split bundled criteria into separate assertions. Instead of one rubric string that covers tone, accuracy, and relevance, write three separate llm-rubric assertions in your promptfoo config. Each call evaluates one thing, which prevents criteria from bleeding into each other and makes it easier to pinpoint which dimension is actually failing.

Add few-shot examples to your rubric prompt. Using the rubricPrompt property in promptfoo, you can provide the judge with a concrete passing example and a concrete failing example. This anchors the judge's interpretation and significantly reduces run-to-run variance on borderline cases.

Pin your judge model. Treat the judge model string in your promptfoo config the same way you treat a dependency version. A silent upgrade to the judge model can shift your entire eval baseline overnight, making it impossible to distinguish a model regression from a judge regression.

Making Reliability a Gate, Not an Afterthought

Measuring flip rate once is useful. Catching it automatically before it reaches production is better.

The simplest approach is a rubric health check that runs as part of your CI pipeline. Take a fixed golden set of inputs, ones where you know what the correct verdict should be, and run each rubric against them 3 times per build. If the flip rate across those runs exceeds your threshold, fail the pipeline. This does not replace your main eval suite, it runs alongside it as a check on the measurement tool itself rather than the model being measured.

This pattern will feel familiar to QA engineers because it mirrors practices already common in other testing disciplines. Mutation testing checks whether your tests can detect real changes. Property-based testing checks whether your assertions hold across a range of inputs. A rubric health check does the same thing for your eval infrastructure: it asks whether your grading instrument is stable enough to be trusted.

The goal is to get to a place where rubric reliability is tracked over time alongside model performance metrics. When a flip rate spikes, that is a signal worth investigating before you draw any conclusions from that eval run. A rubric that is quietly becoming less stable is just as much of a regression as a model that is quietly becoming less accurate.

Treat Your Rubric Like Production Code

Rubrics are not just prompts. They are measurement instruments, and like any instrument they can have bugs, regressions, and flakiness that silently corrupt everything downstream. The difference is that a flaky unit test is visible immediately. A flaky rubric can go undetected for weeks while your eval results quietly mislead you.

The fix is not complicated. Pick one rubric you are using today, run it five times on the same input, and check whether it agrees with itself. If it does not, you now know where to start. Tighten the criteria, narrow the scale, split the assertions, and add a few-shot anchor. Then make consistency a gate in CI so it stays fixed.

The teams getting the most value out of LLM evaluation are the ones who apply the same rigor to their eval infrastructure that they apply to their product code. Your rubric deserves a test suite too.