Most test suites are great at catching the obvious failures: broken endpoints, malformed responses, timeouts. But there is a whole class of LLM failure that produces none of those signals. The model responds with a 200, the JSON parses cleanly, and the answer is simply wrong. Context overload is one of the most common culprits. Feed a model too much text and it starts dropping information, quietly, with no indication that anything went wrong. This post walks through a practical automation test you can drop into any CI pipeline to catch it before your users do.
Every LLM has a context window: a hard limit on how many tokens it can process in a single request. A token is roughly three to four characters, so a 200,000 token window sounds enormous. In practice, most real world requests sit well below that ceiling. The problem is not hitting the limit. The problem is what happens to quality as you approach it.
Think of the context window less like RAM and more like human working memory. When you ask someone to remember a short grocery list, they nail it. Give them a 40 item list and something gets dropped, usually the items in the middle. LLMs behave remarkably similarly. Research has repeatedly shown that models are most reliable at recalling information placed near the beginning or end of a long input. Information buried in the middle gets attended to less, sometimes dramatically so. This is known as the "lost in the middle" problem, and it shows up consistently across model families and sizes.
What makes this tricky for QA engineers is that the degradation is gradual and invisible. There is no threshold you cross where the model suddenly starts failing. Instead, recall accuracy erodes as context grows, and the erosion is uneven depending on where in the context the relevant information lives. A fact at position 5% of a 10,000 token input might be recalled perfectly. The same fact at position 50% might be missed entirely. The same fact again at position 95% comes back reliably. Position matters more than most engineers expect, and unlike most software bugs, there is no stack trace pointing you to the cause.
This is why standard API testing is not enough. Checking that a response is non-empty, well formatted, or semantically related to the prompt will not surface this class of failure. You need a test that deliberately probes recall at multiple positions across a range of context sizes, which is exactly what the needle in a haystack method does.
The needle in a haystack test is one of the most widely used evaluation techniques in LLM research, and it translates directly into an automation context. The idea is simple: hide a specific, unambiguous piece of information somewhere inside a large block of filler text, then ask the model to retrieve it. If the model answers correctly, it read and retained the needle. If it does not, the context swallowed it.
The filler text matters more than it might seem. You want content that is coherent enough to resemble a real document but generic enough that the needle stands out as the only possible answer. Padding with random characters or repeated sentences makes the task artificially easy. Padding with plausible, topic-adjacent prose is a much more honest test of how the model performs on real inputs.
The secret fact itself should be something the model could not guess or infer. A made up codeword, a random numeric string, or a nonsense identifier works well. Anything semantically meaningful risks the model reconstructing a plausible answer from training data rather than actually recalling what was in the context.
Where the method gets powerful is in running it across multiple positions in a single test suite. Rather than dropping the needle once and calling it a pass or fail, you inject it at five different offsets: near the beginning, at the 25% mark, at the midpoint, at 75%, and near the end. Each position gets its own API call and its own verdict. The five results together give you a positional recall map: a picture of where in the context window the model is reliable and where it starts to lose the thread.
A perfect score means the model recalled the fact regardless of where it appeared. A failure clustered around the middle positions is the classic lost in the middle signature. Failures spread across all positions suggest the context size itself is the problem. Each pattern points to a different root cause and a different fix, which makes the positional breakdown far more actionable than a single pass or fail result ever could be.
The full script is a single Node.js file with no external dependencies. It runs against the Anthropic API directly using the native fetch API available in Node 18 and above. Here is how each meaningful piece works.
The secret fact is defined in one place as a plain object. The injection field is what gets inserted into the filler text. The question field is what gets sent to the model at the end of the prompt. Keeping them together makes it easy to swap in different secrets for different test runs.
const SECRET = {
value: "HELIOSPHERE-42",
injection: "THE SECRET CODEWORD IS: HELIOSPHERE-42. Remember this exact value.",
question: "What is the secret codeword mentioned in the text? Reply with only the exact value, nothing else.",
};
This is the core of the test. The function takes a position value between 0 and 1, calculates which chunk index the needle should be inserted at, and builds the full context string by stitching filler sentences together with the needle dropped in at the right spot.
function buildContext(position) {
const total = CONTEXT_SIZE.chunks;
const insertAt = Math.floor(total * position);
const chunks = [];
for (let i = 0; i < total; i++) {
if (i === insertAt) chunks.push(SECRET.injection);
chunks.push(FILLER[i % FILLER.length]);
}
return chunks.join(" ");
}
A position of 0.05 places the needle near the start. A position of 0.5 places it dead center. A position of 0.95 places it near the end. The same function handles all five test cases just by changing that single argument.
The query function sends the assembled context plus the question to the model as a single user message. The prompt structure is deliberately simple: the context comes first, a horizontal rule separates it from the question, and the question asks for only the exact value with no elaboration. Keeping the prompt tight reduces the chance of the model padding its answer in a way that makes scoring ambiguous.
body: JSON.stringify({
model: MODEL,
max_tokens: 50,
messages: [{
role: "user",
content: `${context}\n\n---\n\n${SECRET.question}`,
}],
}),
Setting max_tokens to 50 is intentional. The expected answer is short. A generous token budget invites the model to hedge or explain itself, which complicates the pass or fail check.
The pass check is a single line. It normalises both strings to uppercase and checks whether the model's answer contains the secret value. A contains check rather than strict equality gives the model a little room to include surrounding punctuation without failing.
function pass(answer) {
return answer.toUpperCase().includes(SECRET.value.toUpperCase());
}
The runner loops through the five position offsets in sequence, calls buildContext and query for each, scores the result, and prints a line to stdout. A 500 millisecond delay between calls keeps the test polite against the API rate limiter.
for (const pos of POSITIONS) {
const context = buildContext(pos.offset);
const answer = await query(context);
const ok = pass(answer);
results.push({ ...pos, answer, ok });
await sleep(500);
}
At the end of the loop, the results array holds everything needed to compute the final score and print the positional breakdown.
The complete code example from this post is available on our GitHub page, where you can clone it and adapt it for your own model and pipeline.
A failing test is only useful if it points you toward a fix. With positional recall tests, the pattern of failures matters as much as the count. Three distinct outcomes come up repeatedly in practice, and each one calls for a different response.
The model recalled the needle regardless of where it appeared in the context. For the context size you tested, there is no overload problem. If you are seeing correct behaviour in production but ran this test as a precaution, you can move on with confidence. Consider running the next context size up to find where the boundary actually is.
This is the classic lost in the middle signature. The beginning and end of the context are reliable but the centre is not. This pattern tells you the model is working correctly at a mechanical level but your content is being arranged in a way that puts important information in the blind spot.
The fix is usually structural rather than a model swap. Reorder your prompt so that the most critical information appears near the top or bottom. If you are injecting retrieved documents into the context, rank them so the highest relevance chunks land at the edges. If you are summarising long conversations, compress the middle turns aggressively rather than passing them verbatim. You do not necessarily need a bigger or different model. You need a smarter prompt architecture.
Failures spread across every position suggest the context size itself has exceeded what the model handles reliably, regardless of where the needle sits. This is a harder problem because rearranging content will not solve it.
A few options are worth considering here. Chunking the input and making multiple smaller API calls rather than one large one is often the most straightforward fix. Retrieval augmented generation, where you fetch only the most relevant fragments of a document rather than passing the whole thing, directly addresses the root cause by keeping context lean. If neither approach fits your use case, this is also a legitimate signal to evaluate whether a model with a larger effective context window is worth the cost for your application.
Beyond individual test runs, the most valuable thing you can do is track the score across time. A model that scores 5 out of 5 today and 3 out of 5 next month without any changes on your end is a signal that something has shifted on the provider side. Logging results to a simple datastore and charting them gives your team an early warning system that no single test run can provide on its own.
Context overload is not a problem you solve once and close the ticket on. Models update, prompts evolve, and the volume of content flowing through your application grows. The needle in a haystack test is not a one-time audit. It is a recurring check on a failure mode that will keep finding new ways to surface if you are not looking for it.
The script as written is a solid foundation, but there are five areas worth improving as you mature the test.
Measure real tokens. The context size labels are estimates based on chunk counts, not actual tokens. Since model limits are defined in tokens, plugging in a proper tokenizer like tiktoken or gpt-tokenizer and logging the real count removes that ambiguity entirely.
Diversify the filler. All the current filler sentences are AI-related, which means the model can compress them efficiently. Mixing in legal text, sports articles, random JSON, and historical paragraphs creates genuine context noise and makes the test harder to game.
Add multiple needles. Hiding three secrets and asking for one specific one tests attention precision rather than just recall. It better reflects real applications where multiple important facts coexist in a long context.
Tighten the pass condition. The current check uses includes, which passes answers that contain extra surrounding text. Since the prompt explicitly asks for only the exact value, a strict equality check after trimming whitespace is more honest.
Run each position multiple times. LLMs are non-deterministic. A single call per position can mislead. Running each position three to five times and taking a majority result gives you a statistically meaningful signal rather than a snapshot.