You're a QA automation engineer who wants to master LLM evaluation, but those API costs add up fast, especially when you're just learning and experimenting. What if you could build a robust evaluation framework, test your logic, and gain hands-on experience without spending a dime on API calls? In this guide, we'll explore practical techniques for evaluating LLMs using mock responses, local models, and clever testing strategies that mirror real-world scenarios. By the end, you'll have a complete Python-based evaluation suite that you can develop and validate for free, then seamlessly adapt when you're ready to test against production APIs.
Before diving into complex evaluation logic, let's build a foundation that mimics real LLM interactions without making actual API calls. This approach lets you develop and test your evaluation framework with complete control over responses, ensuring your tests are both reproducible and free.
The key to effective mock testing is creating responses that closely resemble what real LLMs produce. Start by identifying common response patterns from your target LLM: successful completions, refusals, structured outputs like JSON, and edge cases such as truncated responses or unexpected formats. Document these patterns as they'll form the basis of your test scenarios.
Python's pytest fixtures provide a powerful testing foundation for creating reusable mock components. Fixtures allow you to define mock responses that can be shared across multiple tests, ensuring consistent test data and reducing duplication. This approach keeps your tests DRY (Don't Repeat Yourself) and maintainable, while giving you complete control over response behavior and test scenarios.
Here's a basic mock LLM class that simulates OpenAI's API structure:
import pytest
class MockLLM:
def __init__(self, responses=None):
self.responses = responses or {}
self.call_count = 0
self.history = []
def complete(self, prompt, **kwargs):
self.history.append({"prompt": prompt, "kwargs": kwargs})
self.call_count += 1
if prompt in self.responses:
return self.responses[prompt]
return {
"choices": [{
"message": {
"content": f"Mock response for: {prompt[:50]}..."
}
}],
"usage": {"total_tokens": 100}
}
@pytest.fixture
def mock_llm():
return MockLLM({
"test prompt": {"choices": [{"message": {"content": "Expected response"}}]},
"json prompt": {"choices": [{"message": {"content": '{"key": "value"}'}}]}
})
def test_llm_evaluation(mock_llm):
response = mock_llm.complete("test prompt")
assert response["choices"][0]["message"]["content"] == "Expected response"
assert mock_llm.call_count == 1
This structure provides everything you need: response control, call tracking, and test isolation. Your evaluations run instantly, cost nothing, and remain completely predictable.
Now that we have our mock environment, let's construct a comprehensive evaluation framework that validates LLM outputs against your quality criteria. This framework will work identically whether you're testing mock responses or real API calls, making it a valuable investment in your testing infrastructure.
Start by identifying the core metrics that matter for your use case. Response format validation ensures outputs match expected patterns, whether that's markdown formatting, specific delimiters, or structured templates. Output length constraints verify responses stay within token or character limits, critical for downstream systems with size restrictions. JSON schema compliance validates structured outputs against predefined schemas, essential for LLMs generating API responses or configuration files. Keyword presence and absence checking ensures required information appears while sensitive or prohibited content stays out.
The key to maintainable evaluation code is creating modular, reusable functions that each handle a single metric. Let's build an evaluation class that implements these core metrics step by step.
First, we'll set up our base class with the necessary imports:
import json
import re
from typing import Dict, List, Any
from jsonschema import validate, ValidationError
class LLMEvaluator:
def __init__(self):
self.results = []
Format validation is often your first line of defense against malformed responses. This method uses regex patterns to ensure responses follow expected structures, such as numbered lists, markdown headers, or custom templates:
def evaluate_format(self, response: str, pattern: str) -> Dict[str, Any]:
try:
matches = bool(re.match(pattern, response, re.DOTALL))
return {
"metric": "format_validation",
"passed": matches,
"details": f"Pattern {'matched' if matches else 'failed'}"
}
except re.error as e:
return {
"metric": "format_validation",
"passed": False,
"details": f"Invalid pattern: {e}"
}
Length constraints prevent responses from being too verbose or too brief. This is especially important when integrating with systems that have strict character or token limits:
def evaluate_length(self, response: str, min_length: int = 0,
max_length: int = float('inf')) -> Dict[str, Any]:
length = len(response)
passed = min_length <= length <= max_length
return {
"metric": "length_constraint",
"passed": passed,
"actual": length,
"expected_range": f"{min_length}-{max_length}",
"details": f"Length {length} {'within' if passed else 'outside'} range"
}
For structured data outputs, JSON schema validation ensures your LLM generates properly formatted data that downstream systems can consume. This method handles both JSON parsing errors and schema validation failures:
def evaluate_json_schema(self, response: str, schema: Dict) -> Dict[str, Any]:
try:
data = json.loads(response)
validate(instance=data, schema=schema)
return {
"metric": "json_schema",
"passed": True,
"details": "Valid JSON matching schema"
}
except json.JSONDecodeError as e:
return {
"metric": "json_schema",
"passed": False,
"details": f"Invalid JSON: {e}"
}
except ValidationError as e:
return {
"metric": "json_schema",
"passed": False,
"details": f"Schema validation failed: {e.message}"
}
Keyword checking helps ensure responses contain necessary information while avoiding prohibited content. This is crucial for compliance, safety, and quality assurance:
def evaluate_keywords(self, response: str, required: List[str] = None,
forbidden: List[str] = None) -> Dict[str, Any]:
required = required or []
forbidden = forbidden or []
response_lower = response.lower()
missing = [kw for kw in required if kw.lower() not in response_lower]
found_forbidden = [kw for kw in forbidden if kw.lower() in response_lower]
passed = len(missing) == 0 and len(found_forbidden) == 0
return {
"metric": "keyword_check",
"passed": passed,
"missing_required": missing,
"found_forbidden": found_forbidden,
"details": f"Missing: {missing}, Forbidden found: {found_forbidden}"
}
Finally, we need an orchestrator method that runs multiple evaluations in sequence. This method accepts a configuration list, allowing you to mix and match evaluations based on your specific test requirements:
def run_evaluation(self, response: str, evaluations: List[Dict]) -> Dict[str, Any]:
results = []
all_passed = True
for eval_config in evaluations:
metric_type = eval_config.get("type")
params = eval_config.get("params", {})
if metric_type == "format":
result = self.evaluate_format(response, **params)
elif metric_type == "length":
result = self.evaluate_length(response, **params)
elif metric_type == "json_schema":
result = self.evaluate_json_schema(response, **params)
elif metric_type == "keywords":
result = self.evaluate_keywords(response, **params)
else:
result = {"metric": metric_type, "passed": False,
"details": "Unknown metric type"}
results.append(result)
all_passed = all_passed and result["passed"]
return {
"all_passed": all_passed,
"results": results,
"summary": f"{sum(r['passed'] for r in results)}/{len(results)} passed"
}
This framework provides immediate feedback on response quality without requiring any API calls. Each evaluation returns detailed results, making debugging straightforward when tests fail. The modular design means you can easily add custom metrics specific to your domain while maintaining the same clean interface.
Testing against a single happy path won't prepare your evaluation framework for production. Real LLMs exhibit varied behaviors: they might refuse requests, produce inconsistent formatting, hit token limits, or return malformed JSON. By simulating these patterns locally, you can ensure your evaluation logic handles anything an actual LLM might produce.
Start by creating test datasets that mirror your actual use cases. Collect example prompts, categorize them by type, and document expected response characteristics for each category.
Let's build a test data generator that produces different response patterns:
class TestDataGenerator:
@staticmethod
def generate_response(response_type: str) -> str:
responses = {
# Success cases
"summary": "The article discusses three main points. First, temperatures have risen. Second, human activities are the cause. Third, immediate action is required.",
"json": '{"status": "success", "data": {"id": 123}, "message": "Done"}',
"list": "1. Initialize\n2. Configure\n3. Test\n4. Deploy",
# Failure cases
"refusal": "I cannot provide information on that topic.",
"truncated": "The analysis shows that... [Output truncated]",
"malformed_json": '{"status": "success", "data": {"id": 123',
"empty": "",
# Edge cases
"special_chars": "Response with @#$% and émojis 🚀",
"excessive_whitespace": "Text with\n\n\nirregular spacing",
"boundary_length": "x" * 3999
}
return responses.get(response_type, "Default response")
Now let's create tests that use our generator with the evaluation framework:
import pytest
class TestLLMEvaluations:
@pytest.fixture
def evaluator(self):
return LLMEvaluator()
@pytest.fixture
def generator(self):
return TestDataGenerator()
def test_json_validation(self, evaluator, generator):
# Test valid JSON
valid = generator.generate_response("json")
result = evaluator.evaluate_json_schema(
valid,
{"type": "object", "required": ["status"]}
)
assert result["passed"] == True
# Test malformed JSON
invalid = generator.generate_response("malformed_json")
result = evaluator.evaluate_json_schema(
invalid,
{"type": "object"}
)
assert result["passed"] == False
assert "Invalid JSON" in result["details"]
Parametrized tests let you efficiently validate multiple scenarios:
@pytest.mark.parametrize("response_type,min_len,max_len,should_pass", [
("summary", 50, 500, True),
("json", 20, 200, True),
("empty", 1, 100, False),
("boundary_length", 100, 4000, True),
])
def test_length_constraints(self, evaluator, generator,
response_type, min_len, max_len, should_pass):
response = generator.generate_response(response_type)
result = evaluator.evaluate_length(response, min_len, max_len)
assert result["passed"] == should_pass
Test edge cases to ensure robustness:
@pytest.mark.parametrize("edge_type", [
"special_chars", "excessive_whitespace", "boundary_length"
])
def test_edge_case_handling(self, evaluator, generator, edge_type):
response = generator.generate_response(edge_type)
# Should not crash on any input
evaluations = [
{"type": "length", "params": {"min_length": 1}},
{"type": "format", "params": {"pattern": ".*"}}
]
results = evaluator.run_evaluation(response, evaluations)
assert "results" in results
assert len(results["results"]) == 2
Finally, create an integration test that simulates a complete evaluation workflow:
def test_complete_workflow(self, evaluator, generator):
test_cases = [
("summary", [
{"type": "length", "params": {"min_length": 50, "max_length": 500}},
{"type": "keywords", "params": {"required": ["main points"]}}
]),
("json", [
{"type": "json_schema", "params": {"schema": {"type": "object"}}},
{"type": "keywords", "params": {"forbidden": ["error"]}}
])
]
for response_type, evaluations in test_cases:
response = generator.generate_response(response_type)
results = evaluator.run_evaluation(response, evaluations)
# Verify structure
assert "all_passed" in results
assert "summary" in results
assert len(results["results"]) == len(evaluations)
This streamlined test suite covers the essential scenarios without overwhelming complexity. By testing against these patterns before connecting to actual APIs, you'll catch evaluation bugs early and build confidence in your system's production readiness.
While mocking is excellent for development, testing against actual models reveals real-world behavior. Open-source local models offer a middle ground: you get authentic LLM responses without API costs or rate limits.
Ollama provides the simplest path to running local models. Install it from ollama.com, then pull a model:
ollama pull phi4-mini
ollama serve
Ollama exposes an OpenAI-compatible API at localhost:11434, making integration trivial.
Several compact models run efficiently on consumer hardware:
These models run comfortably on modern laptops with 16GB RAM. Even the larger 14B parameter models respond in seconds rather than the milliseconds of mocks, making them practical for development workflows.
Mocks validate your integration logic; local models validate your prompts. A mock might return "Paris" to "What's the capital of France?" but won't tell you if your prompt engineering actually works. Local models expose issues like:
Use mocks for CI/CD pipelines and rapid iteration. Use local models for prompt refinement and integration testing before deploying to production APIs.
from openai import OpenAI
# Point to Ollama's OpenAI-compatible endpoint
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Ollama doesn't require real keys
)
def query_local_model(prompt: str, model: str = "phi4-mini") -> str:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.7,
max_tokens=500
)
return response.choices[0].message.content
# Batch testing multiple prompts
test_prompts = [
"What is Python's GIL?",
"Explain async/await briefly.",
"What's the difference between lists and tuples?"
]
for prompt in test_prompts:
print(f"\nQ: {prompt}")
print(f"A: {query_local_model(prompt)}")
This setup lets you test hundreds of prompts locally before committing to expensive API calls, catching issues early while maintaining realistic LLM behavior.
Your evaluation framework might run perfectly but still produce incorrect results. Before trusting it with real LLM outputs, you need to validate that your evaluation logic actually measures what you intend it to measure.
The most effective validation strategy uses "known good" and "known bad" test cases with predetermined outcomes. Create response pairs where you absolutely know which should pass and which should fail. This acts as your ground truth for validating evaluation logic.
Cross-validation adds another layer of confidence. Run the same response through multiple related evaluations to ensure consistency. If your JSON validator passes but your format validator fails on valid JSON, you've found a logic error.
Start by defining test cases with predetermined outcomes. Each test case includes a response, evaluation configuration, and whether it must pass:
class TestEvaluationValidation:
def test_known_good_bad_pairs(self, evaluator):
test_pairs = [
# (response, evaluation_config, must_pass)
('{"valid": "json"}',
{"type": "json_schema", "params": {"schema": {"type": "object"}}},
True),
('invalid json',
{"type": "json_schema", "params": {"schema": {"type": "object"}}},
False),
('Contains required keyword',
{"type": "keywords", "params": {"required": ["required"]}},
True),
('Missing keyword',
{"type": "keywords", "params": {"required": ["required"]}},
False),
]
This structure pairs each response with its expected evaluation outcome. Valid JSON must pass JSON validation; invalid JSON must fail. Keywords present must pass keyword checks; missing keywords must fail.
For each test pair, run the evaluation and assert the result matches expectations:
for response, config, must_pass in test_pairs:
result = evaluator.run_evaluation(response, [config])
assert result["all_passed"] == must_pass, \
f"Evaluation logic error: {config['type']} gave wrong result"
If any assertion fails, your evaluation logic has a bug. The error message identifies which evaluation type produced incorrect results, pointing you directly to the faulty logic.
Related evaluations should agree on valid responses. Test this by running the same response through multiple evaluation types:
def test_cross_validation_consistency(self, evaluator):
json_response = '{"name": "test", "count": 5}'
# These should all pass for valid JSON
evaluations = [
{"type": "json_schema", "params": {"schema": {"type": "object"}}},
{"type": "format", "params": {"pattern": r"^\{.*\}$"}},
{"type": "keywords", "params": {"required": ["name", "count"]}}
]
Here we define a valid JSON response and three different ways to validate it: schema validation, format pattern matching, and keyword presence checking.
Run all evaluations and verify they agree the response is valid:
results = evaluator.run_evaluation(json_response, evaluations)
assert results["all_passed"], "Valid JSON failed cross-validation"
# All should agree the response is valid
assert all(r["passed"] for r in results["results"])
If cross-validation fails, one of your evaluators has incorrect logic. Valid JSON should pass JSON schema validation, match the JSON format pattern, and contain the specified keywords. Disagreement indicates a bug.
This validation approach ensures your evaluations produce reliable results before you invest in API testing.
The beauty of this mock-based approach is how seamlessly it transitions to real API usage. When you're ready to test against actual LLMs, you'll only need to modify a single component: your mock LLM class.
Replace your MockLLM.complete() method with actual API calls:
import openai # or anthropic, cohere, etc.
class ProductionLLM:
def __init__(self, api_key):
self.client = openai.OpenAI(api_key=api_key)
self.call_count = 0
self.history = []
def complete(self, prompt, **kwargs):
self.call_count += 1
self.history.append({"prompt": prompt, "kwargs": kwargs})
response = self.client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
**kwargs
)
return response.model_dump() # Returns same structure as mock
Your evaluation framework remains completely unchanged. All tests, evaluations, and validation logic work identically.
For cost estimation, track token usage during mock testing. Most providers charge $0.001-0.01 per 1K tokens. If your test suite uses 100K tokens, expect $0.10-1.00 per run. Start with a small subset of critical tests, monitor costs, then gradually expand coverage as budget allows.
You now have a complete framework for evaluating LLMs without spending a cent on API calls. By combining mock responses, comprehensive evaluation metrics, and thorough test scenarios, you can develop and validate your evaluation logic with confidence. When you're ready for production, simply swap your mock for real API calls and your entire framework continues working seamlessly.
Remember these key practices: start with mock testing to iterate quickly, validate your evaluation logic with known good/bad cases, and simulate edge cases before they surprise you in production.
All complete code examples from this post are available on our GitHub page, including additional evaluation metrics, test scenarios, and helper utilities. Happy testing!