The Green Report | Multi-Step Grading Rubrics with LLMs for Answer Evaluation

Multi-Step Grading Rubrics with LLMs for Answer Evaluation

May 10th 2025 9 min read

medium

ai/ml

api

gpt

For QA engineers testing AI-powered applications, binary pass/fail testing often falls short. When evaluating user responses, LLMs offer a more nuanced approach that better mirrors human judgment. Rather than settling for "correct" or "incorrect," implementing multi-step grading rubrics with LLMs enables granular, consistent evaluation across multiple dimensions—turning our quality assurance from black-and-white verdicts into rich, actionable insights that better reflect real-world usage patterns.

Why Use a Grading Rubric?

In testing systems that evaluate user-generated content—like quiz answers, form explanations, or chatbot interactions—"correct or not" isn't enough.

To ensure fairness, depth, and reproducibility, we can use a multi-step rubric, asking the AI to evaluate across several dimensions, such as:

This gives us structured insight into why an answer passed or failed—and allows for automated breakdowns and test logic per dimension.

Python Example: Rubric-Based Evaluation with GPT

We start by importing the OpenAI SDK and creating a client instance. This client is needed to interact with OpenAI's API and send requests to models like gpt-4o-mini.

                
import openai

client = openai.OpenAI(api_key="your-api-key")

Warning: Never hardcode API keys directly into your code. For production environments, always store them securely in environment variables to avoid accidental exposure or leaks.

Next, we define a function called evaluate_with_rubric. It accepts three arguments: the original question, the expected answer, and the user's actual response. These are used to frame the context for the AI.

                
def evaluate_with_rubric(question, expected_answer, user_answer):

Inside the function, we prepare a prompt that gives GPT a specific task: to act as a grader using a structured rubric. The prompt includes the question, the expected answer, and the user's answer. It instructs the model to evaluate across four categories and return the result in a structured JSON format.

                
prompt = f"""
You are a grader using a structured rubric.
Question: {question}
Expected Answer: {expected_answer}
User Answer: {user_answer}
                
Evaluate the answer in four categories:
1. Correctness (0-3)
2. Clarity (0-3)
3. Depth of understanding (0-2)
4. Conciseness (0-2)
                
Respond in JSON format:
{{
    "correctness": ,
    "clarity": ,
    "depth": ,
    "conciseness": ,
    "total": ,
    "comments": ""
}}
"""

The prompt is then sent to the model using the client.chat.completions.create method. We specify the model to use, pass the message from the user, and instruct the model to return a structured JSON object. The temperature is set to 0 to ensure deterministic and consistent results.

                
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    response_format={"type": "json_object"},
    temperature=0
)

Finally, we extract and return the model's output, removing any extra whitespace. This result is a JSON-formatted string containing the rubric scores and a brief explanation.

                
return response.choices[0].message.content.strip()

To see this function in action, we define a sample question, the correct expected answer, and a user-provided response. We call the evaluate_with_rubric function and print out the result.

                
# Example
question = "Explain why the sky is blue."
expected = "Due to Rayleigh scattering of sunlight in the atmosphere."
user = "Because of how sunlight bends through air."
print(evaluate_with_rubric(question, expected, user))

Here is what an output might look like:

{
"correctness": 1,
"clarity": 2,
"depth": 1,
"conciseness": 2,
"total": 6,
"comments": "The answer mentions sunlight and air but does not accurately explain the phenomenon of Rayleigh scattering, which is the key reason for the blue sky."
}

Handling LLM Variability

Even with temperature=0, LLM evaluations can vary slightly. Consider:

Error Handling Example

We start by importing a few standard Python modules. json is used for parsing the AI's response, logging lets us log any issues during the process, and time helps us introduce a delay between retry attempts.

                
import json
import logging
import time

Next, we import the evaluate_with_rubric function that we just implemented.

                
from rubric_based_evaluation import evaluate_with_rubric

We now define a new function, safe_evaluate, that wraps the grading logic with error handling. It accepts the same inputs as the original evaluator, plus an optional retries parameter (defaulting to 3), which controls how many times the function will attempt a retry if an error occurs.

                
def safe_evaluate(question, expected, user_answer, retries=3):

Inside the function, we begin a loop that will run up to the number of allowed retry attempts. Each time through the loop is a chance to try evaluating again if something previously failed.

                
for attempt in range(retries):

We then attempt to evaluate the answer. The result is expected to be a JSON-formatted string, so we immediately parse it using json.loads. We also define a list of required keys and verify that all of them are present in the parsed result. If everything checks out, the parsed dictionary is returned.

                
try:
    result = evaluate_with_rubric(question, expected, user_answer)
    parsed = json.loads(result)
    required_keys = ["correctness", "clarity", "depth", "conciseness", "total"]
    if all(key in parsed for key in required_keys):
        return parsed

If anything goes wrong—whether during the API call, JSON parsing, or key validation—the except block catches the exception. The error is logged with the current attempt number, and the function waits one second before retrying.

                
except Exception as e:
    logging.error(f"Evaluation attempt {attempt + 1} failed: {e}")
    time.sleep(1)

If all retry attempts fail, the function exits the loop and returns a fallback dictionary. This prevents the calling code from crashing and makes it clear that something went wrong.

                
return {"error": "Evaluation failed", "total": 0}

Example Test Case

We begin by defining a test function named test_science_quiz_evaluation. This function simulates an automated quality check of a user's answer to a science question using the grading system we've built.

                
def test_science_quiz_evaluation():

Inside the function, we set up a quiz question about the water cycle. We also define what would be considered the expected ideal answer and a shorter, more general user-provided answer. These three strings are the inputs to the evaluation logic.

                
question = "Explain the water cycle."
expected = "The water cycle is the continuous movement of water between the Earth's surface, atmosphere, and underground. It involves evaporation, condensation, precipitation, and collection."
user_answer = "Water evaporates, forms clouds, and then rains down."

Next, we call safe_evaluate, which wraps our rubric-based evaluator with retry logic. This gives us a reliable, structured evaluation result, even if something goes wrong under the hood.

                
result = safe_evaluate(question, expected, user_answer)

We then add two test assertions. The first checks that the overall score meets a minimum threshold—suggesting the answer is at least somewhat acceptable. The second checks that the “correctness” score meets a specific bar. If either condition fails, the test will raise an error with a helpful message.

                
assert result["total"] >= 5, f"Total score too low: {result['total']}"
assert result["correctness"] >= 2, f"Correctness score too low: {result['correctness']}"

Finally, whether the test passes or fails, we log the detailed evaluation result. This makes it easier to review what the model scored and why, especially during debugging or result analysis.

                
logging.info(f"Evaluation details: {result}")

Calibrating our LLM Evaluator

Before rolling out in production we should:

Have human experts grade a set of sample answers
Run the same samples through our LLM evaluator
Compare results and refine our prompt to minimize discrepancies
Periodically re-calibrate as our question bank or models change

Why This Helps in QA

Implementing multi-step rubrics transforms our QA approach in several meaningful ways. With this methodology, our team gains more granular test coverage by asserting specific rubric scores in test cases rather than relying on simple pass/fail conditions.

This granularity enables us to catch subtle issues that traditional testing might miss—for instance, when an answer contains factually correct information but is poorly worded or lacks clarity. Such distinctions are critical for user-facing systems where communication quality matters as much as factual accuracy.

Throughout our development cycles, these rubrics provide powerful regression tracking capabilities. They allow us to quickly identify which specific dimension failed across test runs and address the root cause without losing sight of the bigger picture.

Perhaps most valuable is the customizability. QA teams can tailor rubrics to match their application's specific goals, incorporating dimensions like tone, originality, or domain-specific criteria that matter most to your users.

This comprehensive evaluation approach yields more meaningful, actionable test results that better reflect real-world usage patterns—leading to improved quality, user satisfaction, and confidence in our releases.

Conclusion

Structured grading rubrics unlock a powerful new layer of precision in automated QA testing with LLMs. Instead of treating AI like a black box, rubrics provide transparency, consistency, and actionable insight across multiple evaluation dimensions.

We can set meaningful thresholds—such as requiring a minimum total score (e.g., ≥ 7/10) or ensuring no individual category falls below a 2—to enforce quality standards. These thresholds help guard against answers that are technically correct but fail in areas like clarity or depth. Additionally, low scores in specific dimensions like “clarity” can trigger manual UX reviews, making our QA efforts even more user-focused.

This kind of evaluative logic mirrors the nuanced decisions real users and testers make, allowing teams to catch subtle regressions and raise the bar on overall software quality. When LLMs are paired with structured rubrics, they become more than just smart models—they evolve into dependable, auditable tools that align tightly with our product's goals.

The complete code example featured in this post is available on our GitHub page, ready for you to explore, adapt, and integrate into your own test automation workflows.