The Green Report | Pause, Resume, and Retry: Surviving Interrupted Eval Runs in Promptfoo

Pause, Resume, and Retry: Surviving Interrupted Eval Runs in Promptfoo

Apr 25th 2026 6 min read

easy

ai/ml

gpt

ci/cd

Running a 50-test eval suite is satisfying right up until your VPN drops on test 31, your CI job times out at the 10-minute mark, or OpenAI decides it's a great time to rate limit you. Promptfoo has a pair of CLI flags that most engineers never notice: --resume and --retry-errors. Together they mean a failed run is never a full restart. This post shows you exactly how they work, with a setup you can run yourself in under five minutes.

The Demo Setup

The project is four files in a single folder:

promptfoo-resume-demo/
├── promptfooconfig.yaml
├── prompt.txt
├── tests.csv
└── hooks.js

prompt.txt is a simple customer support system prompt with a single {{message}} variable. tests.csv holds 15 test cases, each one a realistic customer inquiry with a basic contains assertion. hooks.js logs each test result to the terminal as it completes and prints a pass rate summary at the end. The config ties everything together and sets two options that matter for this demo: maxConcurrency: 2 keeps the run slow enough to interrupt comfortably, and delay: 800 adds an 800ms pause between tests so you have a clear window to hit Ctrl+C.

Before running anything, install promptfoo and export your OpenAI key:

                
npm install -g promptfoo
export OPENAI_API_KEY=sk-...

Then drop all four files into your project folder. The complete code for each file is available in our GitHub repository. Once the files are in place, you are ready to run your first eval.

Simulating an Interruption

Run the eval from your project folder:

                
promptfoo eval

You will see the hooks logging each test to the terminal as it completes:

Once you have seen four or five tests complete, hit Ctrl+C. Promptfoo will stop immediately and you will see something like this:

^C
Evaluation interrupted. Completed 4/16 tests.
Results saved. Run `promptfoo eval --resume` to continue.

That last line is the important one. Promptfoo has already written the completed results to disk. Nothing you have run so far is lost, and you will not be billed for those API calls again on resume. The interrupted run is stored locally and identified by an eval ID that --resume will pick up automatically in the next section.

Resuming with --resume

Pick up the run exactly where it stopped:

                
promptfoo eval --resume

Promptfoo finds the latest incomplete eval automatically and skips every test that already has a result. You will see this reflected immediately in the hook output — the first few tests are silent, and logging only starts from where you interrupted:

If you need to resume a specific run rather than the latest one, pass the eval ID explicitly:

                
promptfoo eval --resume <evalId>

You can find the ID by running promptfoo list evals before resuming.

One thing worth knowing: --resume does not just skip completed tests. It also reuses the original run's concurrency, delay, and cache settings automatically. You do not need to remember what flags you passed the first time. This matters in CI where the original run might have been triggered by a pipeline with specific environment variables or concurrency limits that you would otherwise have to reconstruct manually.

Once the run finishes, open the web UI to see the full picture:

                
promptfoo view

All 16 results appear together in a single eval — the 4 that completed before the interruption and the 12 that finished on resume. From promptfoo's perspective, and from your CI report's perspective, it is one unbroken run.

Handling Errors with --retry-errors

--retry-errors solves a different problem than --resume. Resume is for runs that never finished. Retry-errors is for runs that completed but left some tests in an ERROR state due to something transient: a rate limit spike, a network timeout, or an API blip that had nothing to do with your prompt or model.

After a run completes, check for errors in the terminal summary or in promptfoo view. If you see any, a single command re-runs just those tests:

                
promptfoo eval --retry-errors

Promptfoo finds every test that returned an ERROR in the latest eval and re-runs only those, leaving all passing and failing results untouched. The output will look familiar:

[hooks] ✓ PASS | I'm getting a 403 error on the API | 1044ms
[hooks] ✓ PASS | Can I use the API without a paid plan? | 976ms

── Suite Summary ─────────────────────────
Passed : 16
Failed : 0
Rate : 100.0%
──────────────────────────────────────────

There is a data safety guarantee here that is worth understanding before you rely on this in CI. Promptfoo does not overwrite the original ERROR results until the retry succeeds. If the retry itself fails — say the API is still down — your original results are preserved exactly as they were. You can run --retry-errors as many times as you need without risking the results you already have.

Two constraints to be aware of: --retry-errors cannot be combined with --resume or --no-write, and it always operates on the latest eval. If you need to retry errors from a specific older run, export it first with promptfoo export eval <evalId> and re-import it as your working eval.

Wiring It Into CI

The natural place for these two commands is a GitHub Actions workflow. The pattern is straightforward: run the eval, and if it exits with an error code, attempt a retry before failing the build. This catches the case where a transient API error is the only thing standing between you and a green pipeline.

                
name: LLM Eval

on:
  push:
    branches: [main]
  pull_request:

jobs:
  eval:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v6

      - name: Cache promptfoo results
        uses: actions/cache@v5
        with:
          path: ~/.cache/promptfoo
          key: promptfoo-${{ hashFiles('prompt.txt', 'tests.csv', 'promptfooconfig.yaml') }}

      - name: Install promptfoo
        run: npm install -g promptfoo

      - name: Run eval
        id: eval
        run: promptfoo eval --no-progress-bar --no-table
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        continue-on-error: true

      - name: Retry transient errors
        if: steps.eval.outcome == 'failure'
        run: promptfoo eval --retry-errors --no-progress-bar --no-table
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

The cache step is important here. It keys on a hash of your prompt, tests, and config files, so identical test cases are served from cache rather than hitting the API again. Combined with --retry-errors, this means the only tests that ever make a live API call on a retry are the ones that genuinely errored — not the ones that were already cached from a previous run.

One small detail: --no-progress-bar and --no-table are added to both eval steps because the progress bar and table use special characters that produce noisy output in CI logs. They have no effect on results.

Knowing What Was Skipped

If you followed the previous post on extension hooks, the hooks.js file in this demo is already doing something useful beyond the summary report. The beforeEach hook stamps a run_id onto every test as it starts, and the afterEach hook logs each result immediately after it completes:

                
if (hookName === 'beforeEach') {
  context.test.vars.run_id = `run_${Date.now()}`;
  return context;
}

if (hookName === 'afterEach') {
  const desc = context.test.description ?? 'unnamed';
  const elapsed = Date.now() - (testStartTimes[desc] ?? Date.now());
  const status = context.result.success ? '✓ PASS' : '✗ FAIL';
  console.log(`[hooks] ${status} | ${desc} | ${elapsed}ms`);
}

When you run --resume, this logging makes the skipped tests immediately visible. The terminal stays silent for every completed test and only starts printing once promptfoo reaches the first uncompleted one. In a suite of 50 or 100 tests that is a quick sanity check that resume picked up in the right place, without having to open the web UI or inspect the eval ID.

The run_id variable serves a subtler purpose. Because it is stamped fresh on every test that actually executes, tests that were skipped on resume will have a different run_id prefix than tests that ran in the original session. If you are writing results to an external system in afterEach, that distinction lets you tag which session produced each result without any extra bookkeeping.

Conclusion

A flaky network or a CI timeout used to mean restarting your entire eval suite from scratch and absorbing the full API cost again. With --resume and --retry-errors, neither of those things needs to happen. Interrupted runs pick up exactly where they stopped, transient errors get a second chance without touching results that are already good, and your CI pipeline stays green for the right reasons.

If you want to go deeper on controlling the evaluation lifecycle, the extension hooks post covers how to inject test cases at runtime, enforce quality gates, and push results to external systems. The full code for this demo is on our GitHub page and the official CLI reference can be found here.