The Green Report | Position Bias: The Silent Test Order Dependency

Position Bias: The Silent Test Order Dependency

Jul 5th 2026 8 min read

medium

ai/ml

api

gpt

Picture a typical LLM-as-judge setup: one model generates a response, and a second, different model grades it, a common safeguard meant to avoid the bias that creeps in when a model evaluates its own output. It sounds solid on paper. Run a promptfoo suite comparing two model outputs, and the judge returns a clear verdict. Then, almost as an afterthought, swap the order the two outputs appear in the judge prompt. Same outputs. Same judge. Same everything, except which one came first. The verdict flips.

Why QA Engineers Should Care

QA engineers already have a name for this kind of problem, they just haven't applied it here yet. Test order dependency is a classic smell: a suite that passes when tests run in one sequence and fails in another, usually because of shared state, leftover side effects, or an assumption that got baked in without anyone noticing. The fix is almost never "run the tests in the right order." It's "find out why order matters at all, then remove the dependency."

An LLM judge prompt deserves the exact same suspicion. If a grading prompt hands two outputs to a model and asks which one is better, the order those outputs appear in is an input, whether anyone intended it to be or not. Nobody labels "output A goes first" as a variable in the test plan, but it is one, and it's currently untested. A judge that flips its verdict based on position isn't a subtle edge case. It's the eval equivalent of a test suite that only passes on Tuesdays.

The uncomfortable part is that this bias hides well specifically because most people never think to look for it. Standard test hygiene, run it once, get a result, move on, will never catch an order dependency, because the whole point of the bug is that it only reveals itself when something changes that nobody thought was a variable in the first place.

What's Actually Happening

The mechanism itself isn't mysterious. When a language model is asked to compare two things in a single prompt, it doesn't evaluate them in true isolation the way a human might with two documents side by side on a desk. It reads through the prompt sequentially, and whatever it reads first tends to anchor its judgment. Depending on the model and the phrasing of the prompt, that can show up as a preference for the first option presented, or a preference for the second, sometimes called primacy or recency bias, borrowed from the same psychology terms used to describe how people remember lists.

Researchers studying LLM-as-judge setups have documented this consistently enough that it has a name: position bias. It's not universal across every model or every prompt structure, and the strength of the effect varies, but it shows up often enough that assuming your judge is immune to it is itself the risky part. The practical takeaway for a test suite isn't the underlying psychology or the model architecture. It's simpler: if a grading prompt presents two things for comparison, the order of presentation is a variable in the outcome, and until proven otherwise, it should be treated as one.

The Experiment

Talking about position bias in the abstract is one thing. Seeing it flip a verdict in an actual test run is more convincing. Here's a minimal setup that reproduces it using promptfoo.

Start with two candidate responses to the same prompt, ideally ones that are close in quality so the judge doesn't have an easy, obvious call to make. Something like:

                
# prompts/candidates.yaml
responseA: "Paris is the capital of France, known for the Eiffel Tower and the Louvre."
responseB: "The capital of France is Paris, home to landmarks like the Eiffel Tower and the Louvre museum."

Then write a judge prompt that asks a model to pick the better of the two:

                
# promptfooconfig.yaml
prompts:
  - "Compare these two responses and say which is better, A or B, and why.\n\nResponse A: {{responseA}}\n\nResponse B: {{responseB}}"

providers:
  - id: openai:gpt-4o-mini

tests:
  - vars:
      responseA: "Paris is the capital of France, known for the Eiffel Tower and the Louvre."
      responseB: "The capital of France is Paris, home to landmarks like the Eiffel Tower and the Louvre museum."
  - vars:
      responseA: "The capital of France is Paris, home to landmarks like the Eiffel Tower and the Louvre museum."
      responseB: "Paris is the capital of France, known for the Eiffel Tower and the Louvre."

The two test cases contain identical content. Only the labels A and B have swapped which response they point to. Run the suite, and instead of two consistent verdicts pointing at the same underlying response, the judge's stated preference tracks the position label, not the content. Whichever response lands in slot A wins, regardless of which actual response it happens to be.

To make this more than a one-off anecdote, the same pair can be run through several rounds, alternating which response occupies which slot, and the verdicts logged each time. A judge with no position bias would pick the same underlying response as better across every round, since nothing about the actual content changed. A judge with position bias will show a pattern where the win rate correlates with slot position instead of content quality, which is exactly the kind of pattern a QA mindset is already trained to notice in a flaky test log.

The Results

Running that swapped-order experiment a handful of times tends to produce a pattern that's hard to explain away as noise. Instead of the judge landing on the same underlying response as the better one regardless of which slot it occupies, the win count clusters around a position instead of a response. In a lot of published tests on this topic, the first slot wins somewhere in the range of 60 to 80 percent of the time across many models and prompt styles, even when the two responses are nearly identical in quality. Some models lean the other way and favor the second slot instead. The exact number varies by model and prompt phrasing, but the direction of the effect is rarely zero.

What makes this worth sitting with is how confident the judge sounds while doing it. The stated reasoning in the judge's response usually reads as coherent and specific: it might praise response A's phrasing or structure in a way that sounds like a genuine, considered evaluation. Swap the labels, and the judge produces equally confident, equally specific reasoning in favor of whichever response now sits in that same first slot. The justification isn't fabricated nonsense, it's plausible sounding, which is exactly what makes it risky. A verdict that came with a shrug would be easy to distrust. A verdict that comes with a clear, articulate explanation is much easier to mistake for a real, content based judgment, even when the content had nothing to do with the outcome.

For a QA engineer, this should land as a familiar shape of problem: a test that passes with a clean, readable log message, giving every appearance of a legitimate result, while the actual cause of the pass or fail lives somewhere else entirely.

Mitigations You Can Actually Ship

Once position bias is confirmed rather than assumed, the fix doesn't need to be complicated. A few practical options, roughly in order of how easy they are to add to an existing suite:

Run it twice and average. For any pairwise comparison, run the judge once with the original order and once with the responses swapped, then combine the results. If the judge picks the same response both times, that's a real signal. If the verdict flips depending on order, treat it as a tie or flag it for review rather than trusting whichever run happened to execute first. This is the eval equivalent of retrying a flaky test and checking whether the result is consistent, except here the "retry" is a deliberate swap rather than a blind rerun.

Randomize the order per test run. Instead of always presenting responses in the same fixed order, randomize which one lands in the first slot each time the suite runs. This won't eliminate the bias, but it prevents it from being consistently invisible in one direction, and it turns the win rate into something that can be tracked and audited over time rather than trusted at face value.

Grade independently instead of comparatively. Rather than asking a judge to pick the better of two responses in a single prompt, have it score each response separately against a rubric, with no visibility into the other candidate. This sidesteps position bias entirely, since there's no "first" or "second" slot to anchor on, though it trades away the direct, head to head comparison some evals are specifically trying to capture.

Track agreement across orderings as a metric. For teams running evals regularly, the swap and compare approach can become a standing check, similar to a coverage report. If the judge's agreement rate between the two orderings drops below some threshold, that's worth investigating on its own, independent of whatever specific comparison is being run that day.

None of these require replacing the judge model or overhauling the whole eval pipeline. They just require treating prompt order as a variable worth testing, instead of an implementation detail that happens to be invisible until someone swaps it by accident.

Conclusion

None of this means LLM judges are unusable, only that they come with untested assumptions baked in, the same way any new piece of infrastructure does before someone pokes at it. Order dependency is not a reason to throw out comparative evals. It's a reason to test the eval itself before trusting what it tells you about everything else.

That's the shift worth making. A judge prompt is still a piece of the system under test, not a neutral oracle sitting outside it. It has inputs, it has variables that quietly influence the output, and some of those variables, like which response happens to load first, have nothing to do with the thing actually being measured. Treating an eval suite with the same scrutiny as any other part of the pipeline, complete with regression checks, known failure modes, and a healthy default suspicion, is just extending the same instincts QA engineers already apply everywhere else in the stack. The judge doesn't get a pass just because it sounds confident.