The Green Report | Implementing Semantic Validation for QA Fixtures with LLMs

Implementing Semantic Validation for QA Fixtures with LLMs

Nov 9th 2025 19 min read

medium

ai/ml

claude

Every QA engineer has encountered test suites that pass perfectly while testing with nonsensical data: orders delivered before they were placed, three-year-old customers, or $500 shipping for a $1 item. Traditional validators check if your JSON is syntactically correct, but they can't tell if your test data makes business sense. In this post, we'll build a Python semantic validator using Claude AI and LangChain that catches these logical inconsistencies automatically. With just 150 lines of code, you'll add a "common sense" layer to your test data validation pipeline.

What Are Rule-Based Evaluations?

Rule-based evaluations apply predefined criteria to assess whether data meets specific business logic requirements, going beyond structural correctness to examine semantic validity. While schema validation asks "Is this valid JSON with the right field types?", rule-based evaluation asks "Does this data make logical sense for our business domain?"

Consider an e-commerce order fixture. Schema validation verifies that order_date is a valid datetime string and customer_age is an integer. It passes perfectly when order_date is "2024-12-01" and customer_age is 3. Rule-based evaluation, however, applies business logic: customers should be at least 13 years old to place orders, and delivery dates must occur after order dates. These rules catch the semantic issues that schema validation misses.

Traditional approaches to rule-based validation require developers to manually code every possible check. This becomes overwhelming quickly. How do you anticipate every edge case? What about relationships between fields that only become apparent with specific data combinations? This is where AI transforms the approach.

Why rule-based evaluations excel at test data validation:

They catch business logic violations that cause false positives in tests
They validate relationships between fields, not just individual field formats
They ensure test scenarios reflect realistic production conditions
They prevent tests from passing with impossible data combinations

The integration of AI, specifically Large Language Models like Claude, creates an intelligent middle ground between rigid hardcoded rules and complete flexibility. Instead of writing hundreds of specific validation functions, you define high-level rules in natural language: "shipping costs should be proportional to order value" or "user ages should be realistic." The AI interprets these rules contextually, understanding that a $10 shipping fee makes sense for a $100 order but not for a $0.01 order.

This approach maintains the consistency and reliability of rule-based systems while adding the contextual understanding that typically requires human review. The AI doesn't replace your validation logic; it enhances it with common sense reasoning that would be impractical to code manually. The result is test data that's not just structurally valid but actually represents realistic scenarios your application might encounter in production.

Setting Up Your Environment

Getting started with the test data validator requires minimal setup. The implementation uses LangChain to orchestrate the AI calls and the Anthropic library to connect with Claude. You'll have everything running in under five minutes.

Install the required packages:

                
pip install langchain langchain-anthropic pydantic

These three packages provide everything needed: LangChain for the AI framework, langchain-anthropic for the Claude integration, and Pydantic for data validation models.

Configure your API key:

You'll need an Anthropic API key from console.anthropic.com. Once you have it, set it as an environment variable:

                
# Linux/Mac
export ANTHROPIC_API_KEY='your-api-key-here'

# Windows Command Prompt
set ANTHROPIC_API_KEY=your-api-key-here

Building the Core Validator

The test data validator consists of a single Python class that orchestrates Claude AI to analyze test fixtures for semantic correctness. Let's build it step by step, understanding how each component contributes to catching those subtle data issues that traditional validators miss.

                
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from pydantic import BaseModel, Field
from typing import Dict, List, Optional
import json

class ValidationResult(BaseModel):
    """Structure for validation results"""
    valid: bool = Field(description="Whether the test data is valid")
    violations: List[str] = Field(default_factory=list, description="List of rule violations found")
    severity: Dict[str, str] = Field(default_factory=dict, description="Severity level for each violation")
    suggestions: List[str] = Field(default_factory=list, description="Suggestions for fixing issues")

The ValidationResult model defines the structure of our validation output using Pydantic. This ensures type safety and gives us a clean interface for handling results. Each validation returns whether the data is valid, what violations were found, their severity levels, and actionable suggestions for fixes.

                
class TestDataValidator:
    """Validates test data for semantic correctness using Claude AI"""
    
    def __init__(self, model: str = "claude-3-haiku-20240307"):
        """Initialize validator with Claude AI"""
        self.llm = ChatAnthropic(
            model=model,
            temperature=0,  # Deterministic validation
            max_tokens=1000
        )
        self.json_parser = JsonOutputParser()

The validator initializes with Claude's Haiku model by default, which provides the best balance of speed and cost for validation tasks. Setting temperature to 0 ensures consistent, deterministic results across multiple validation runs. The JsonOutputParser handles the conversion of Claude's responses into structured data we can work with programmatically.

                
def create_validation_rules(self, data_type: str) -> List[str]:
    """Generate validation rules based on data type"""
    
    default_rules = {
        "order": [
            "Order date must be before or equal to current date",
            "Delivery date must be after or equal to order date",
            "Customer age should be realistic (13-120 years)",
            "Total price should be greater than 0",
            "Shipping cost should be proportional to order total",
            "Items count should match items array length",
            "Order status progression should be logical"
        ],
        "user": [
            "Age should be between 0 and 150 years",
            "Created date should not be in the future",
            "Email should be valid format if present",
            "Account balance should not be negative unless overdraft",
            "Last login should be after account creation"
        ]
    }
    
    return default_rules.get(data_type.lower(), [
        "All dates should be logically consistent",
        "Numeric values should be within reasonable ranges",
        "Required relationships between fields should be valid"
    ])

This method returns appropriate validation rules based on the data type. Notice how these rules express business logic in natural language rather than code. The AI interprets these contextually, understanding that "proportional" shipping costs means something different for a $10 order versus a $1000 order. This flexibility would require complex conditional logic in traditional validators.

                
def validate_fixture(self, fixture: Dict, data_type: Optional[str] = None) -> ValidationResult:
    """Validate test data against semantic rules"""
    
    rules = self.create_validation_rules(data_type) if data_type else []
    
    prompt = ChatPromptTemplate.from_template("""
    You are a QA expert validating test data for semantic correctness.
    
    Test data to validate:
    {fixture}
    
    Validation rules to check:
    {rules}
    
    Identify violations that could cause false test results. For each issue:
    - Mark as "critical" if it will definitely cause test failures
    - Mark as "warning" if it might cause unreliable tests
    
    Return JSON in this format:
    {{
        "valid": true/false,
        "violations": ["violation 1", "violation 2"],
        "severity": {{"violation 1": "critical", "violation 2": "warning"}},
        "suggestions": ["fix 1", "fix 2"]
    }}
    """)

The prompt engineering here is crucial. We frame Claude as a QA expert, provide clear context about what we're validating, and specify exactly what output format we need. The severity classification helps teams prioritize which issues to fix first. Critical issues will definitely cause test failures, while warnings indicate potential reliability problems.

                
message = prompt.format_messages(
        fixture=json.dumps(fixture, indent=2),
        rules="\n".join([f"- {rule}" for rule in rules])
    )
    
    response = self.llm.invoke(message)
    result = self.json_parser.parse(response.content)
    return ValidationResult(**result)

The formatted message combines our test data with the validation rules. Claude analyzes the data, identifies violations, and returns structured JSON that we parse into our ValidationResult model. If Claude's response doesn't match our expected format, Pydantic will raise a validation error, which you can catch and handle appropriately in production code.

                
def validate_batch(self, fixtures: List[Dict], data_type: str) -> Dict:
    """Validate multiple fixtures and return summary"""
    
    results = []
    for i, fixture in enumerate(fixtures):
        print(f"Validating fixture {i + 1}/{len(fixtures)}...")
        result = self.validate_fixture(fixture, data_type)
        results.append({
            "index": i,
            "valid": result.valid,
            "violations": result.violations,
            "result": result
        })
    
    total_valid = sum(1 for r in results if r["valid"])
    total_critical = sum(
        1 for r in results
        for severity in r["result"].severity.values()
        if severity == "critical"
    )
    
    return {
        "total": len(fixtures),
        "valid": total_valid,
        "invalid": len(fixtures) - total_valid,
        "critical_issues": total_critical,
        "results": results
    }

Batch validation processes multiple fixtures sequentially, collecting results and generating summary statistics. This is particularly useful for validating entire test suites or fixture directories. The summary provides a quick overview of data quality across your test suite.

Error handling considerations:

While our simplified example doesn't include explicit error handling, production implementations should wrap API calls in try/except blocks to handle network issues, API rate limits, or unexpected response formats. Consider implementing exponential backoff for rate limiting and logging failed validations for debugging.

Real-World Example: E-commerce Test Suite

Let's walk through a practical example using the validator on a problematic e-commerce order that represents the kind of test data issues QA teams encounter daily. This fixture looks perfectly valid to traditional validators but contains multiple semantic problems that would cause false positives in testing.

                
suspicious_order = {
    "order_id": "ORD001",
    "order_date": "2024-12-01",
    "delivery_date": "2024-11-15",  # Delivered before ordered!
    "customer_age": 3,  # 3-year-old customer?
    "order_total": 0.01,
    "items_count": 47,  # Claims 47 items
    "items": [{"id": "item1", "price": 0.01}],  # But only has 1
    "shipping_cost": 500.00  # $500 shipping for $0.01?
}

This single fixture contains five distinct categories of issues that commonly plague test data. Each would pass JSON schema validation, yet all would cause unreliable test results. Let's see what our validator catches.

                
validator = TestDataValidator()
result = validator.validate_fixture(suspicious_order, data_type="order")

print(f"✅ Valid: {result.valid}")

print(f"\n🚨 Violations found ({len(result.violations)}):")
for violation in result.violations:
    severity = result.severity.get(violation, "unknown")
    emoji = "🔴" if severity == "critical" else "🟡"
    print(f"  {emoji} {violation}")

**Actual output from the validator:**

✅ Valid: False

🚨 Violations found (3):
🔴 Delivery date is before order date
🔴 Customer age is not realistic
🟡 Shipping cost is disproportionate to order total

💡 Suggestions:
• Set the delivery date to be after the order date
• Set the customer age to a realistic value between 13 and 120 years
• Adjust the shipping cost to be proportional to the order total

Let's examine each category of issue the validator identifies and why they matter for test reliability.

Temporal inconsistencies:

The validator immediately flags that the delivery date (2024-11-15) occurs before the order date (2024-12-01). This temporal impossibility would likely cause your application's business logic to behave unexpectedly. Tests using this data might pass because the code handles the dates, but they wouldn't be testing realistic scenarios. In production, such data would never exist, making these tests effectively worthless for catching real bugs.

Business rule violations:

The three-year-old customer represents a clear business rule violation. Most e-commerce platforms require customers to be at least 13 or 18 years old for legal reasons. Tests running with this data might validate that the age field accepts integers, but they fail to test whether your application properly enforces age restrictions. The validator marks this as critical because any test results using this fixture would be misleading about your application's compliance with business rules.

Statistical anomalies:

The shipping cost of $500 for a one-cent order is statistically absurd. While technically possible in the real world for special circumstances, using such anomalous data in tests means you're not testing normal application behavior. The validator identifies this disproportionate relationship and marks it as a warning. Tests might pass, but they wouldn't reflect how your application handles typical orders where shipping costs range from 5% to 20% of the order total.

Missing correlations:

Although not explicitly flagged in this output, the validator's rules check for correlations between fields. The fixture claims 47 items (items_count: 47) but the items array contains only one element. This mismatch between correlated fields is exactly the kind of issue that causes tests to pass while testing impossible states. Your application might handle the count field and the array separately, leading to tests that verify each in isolation but miss integration bugs.

Edge cases that are too edgy:

The one-cent order total represents an edge case pushed to an unrealistic extreme. While testing edge cases is important, test data should still represent plausible scenarios. An order total of $0.01 might technically pass validation, but it's so far outside normal business operations that tests using this data provide little value. Real edge cases might be orders of $1 or $5, not fractions of a cent.

Batch validation reveals patterns:

When we run batch validation on multiple fixtures, patterns emerge:

                
batch_results = validator.validate_batch(test_orders, "order")
print(f"Results Summary:")
print(f"  Total fixtures: {batch_results['total']}")
print(f"  Invalid: {batch_results['invalid']}")
print(f"  Critical issues: {batch_results['critical_issues']}")

The summary shows that these aren't isolated incidents. Invalid test data tends to cluster, suggesting systematic issues in how test fixtures are created or maintained. Teams often copy and modify existing fixtures, propagating these semantic issues throughout their test suites.

The validator transforms these hidden problems into visible, actionable issues. Instead of discovering these problems through failed deployments or flaky tests, teams can proactively clean their test data. The result is more reliable tests that actually validate realistic application behavior rather than impossible edge cases that would never occur in production.

Customizing Rules for Your Domain

The power of semantic validation lies in its adaptability to your specific business context. While the default rules catch common issues, real value emerges when you tailor validation to your domain's unique constraints and business logic.

Writing Effective Validation Prompts

Effective validation rules balance specificity with flexibility. Instead of writing "age must be between 0 and 150," write "user age should reflect realistic demographics for our platform's target audience." This gives Claude context to understand that a gaming platform might expect ages 13 to 65, while a retirement planning app expects 50 to 90. The AI interprets these rules based on the actual data patterns it sees.

Frame rules as business requirements rather than technical constraints. "Payment amounts should align with our pricing tiers" works better than "payment must be 9.99, 19.99, or 29.99" because it allows for promotional pricing, regional variations, and special offers without requiring constant rule updates.

Domain-Specific Examples

Fintech applications need rules that reflect regulatory compliance and financial logic. Consider validation rules like "transaction timestamps must follow market hours for stock trades," "account balances should reconcile with transaction history," and "risk scores should correlate with portfolio composition." These rules catch test data that might cause compliance violations or trading logic errors.

Healthcare systems require validation that respects medical logic and privacy constraints. Rules might include "prescription dates must fall within valid physician-patient relationship timeframes," "dosage amounts should align with standard medical practices for patient age and weight," and "diagnosis codes should be compatible with prescribed treatments." This prevents tests from validating impossible medical scenarios.

SaaS platforms benefit from rules about user behavior and subscription logic. Examples include "trial end dates should occur after trial start dates," "feature access should match subscription tier," and "usage metrics should reflect realistic engagement patterns." These rules ensure tests validate actual user journeys rather than impossible account states.

Creating a Rules Library

Organize your validation rules into reusable modules by domain area. Structure your rules library as a Python dictionary or YAML file that maps data types to their validation criteria:

                
VALIDATION_RULES = {
    "financial_transaction": [
        "Transaction amounts should match currency precision rules",
        "Timestamps should follow chronological order in transaction chains",
        "Account balances should never violate regulatory minimums"
    ],
    "user_subscription": [
        "Subscription end dates should occur after start dates",
        "Plan features should match the subscription tier",
        "Payment history should align with billing cycles"
    ],
    "medical_record": [
        "Patient age should be consistent across all visit records",
        "Medication dosages should be appropriate for patient demographics",
        "Lab results should fall within medically possible ranges"
    ]
}

This approach allows teams to share validation rules across projects and maintain consistency in how different test suites validate similar data types. New team members can reference the rules library to understand domain constraints without reading through extensive documentation.

Version Controlling Your Validation Rules

Treat validation rules as code and commit them to version control alongside your tests. Create a dedicated directory like test_validation/rules/ in your repository. This practice provides several benefits: you can track how validation requirements evolve over time, review rule changes during code reviews, and roll back problematic rule updates if they generate false positives.

Document why each rule exists in comments. A rule like "order totals should not exceed $50,000" becomes much more valuable when accompanied by "// Exceeds our fraud detection threshold and requires manual review in production." This context helps future maintainers understand whether rules still apply as business requirements change.

Tag rule versions when deploying to different environments. Your staging environment might enforce stricter validation than development, catching edge cases before they reach production tests. Use semantic versioning for your rules library, incrementing versions when you add new rules (minor version) or change existing rule interpretations (major version).

Performance and Cost Optimization

Semantic validation adds intelligence to your test suite, but unchecked API calls can become expensive and slow. Smart optimization strategies keep validation practical for daily development workflows while maintaining data quality.

Caching Validation Results

Implement content-based caching to avoid revalidating unchanged fixtures. Generate a hash of your test data and store validation results keyed by that hash. When the same fixture appears again, return the cached result immediately instead of making another API call.

                
import hashlib
import json

class CachedValidator(TestDataValidator):
    def __init__(self, model: str = "claude-3-haiku-20240307"):
        super().__init__(model)
        self.cache = {}
    
    def validate_fixture(self, fixture: Dict, data_type: Optional[str] = None) -> ValidationResult:
        cache_key = hashlib.md5(
            json.dumps(fixture, sort_keys=True).encode()
        ).hexdigest()
        
        if cache_key in self.cache:
            return self.cache[cache_key]
        
        result = super().validate_fixture(fixture, data_type)
        self.cache[cache_key] = result
        return result

This simple addition can reduce API costs by 70% or more in typical test suites where fixtures change infrequently. The cache persists for the validation session, eliminating duplicate calls when the same fixture appears in multiple test files.

Batch Processing Strategies

Process fixtures in parallel when validating large test suites. The validator already includes a batch method, but you can enhance it with concurrent processing using Python's concurrent.futures:

                
from concurrent.futures import ThreadPoolExecutor, as_completed

def validate_batch_parallel(self, fixtures: List[Dict], data_type: str, max_workers: int = 5) -> Dict:
    results = []
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_idx = {
            executor.submit(self.validate_fixture, fixture, data_type): i 
            for i, fixture in enumerate(fixtures)
        }
        
        for future in as_completed(future_to_idx):
            idx = future_to_idx[future]
            result = future.result()
            results.append({"index": idx, "valid": result.valid, "result": result})
    
    return self._summarize_results(results)

Parallel processing reduces wall clock time significantly. Validating 100 fixtures that each take 2 seconds drops from over 3 minutes sequentially to under 40 seconds with 5 workers. Keep max_workers around 5 to 10 to respect API rate limits while gaining substantial speed improvements.

When to Validate

Not every test run needs full validation. Integrate validation strategically into your development workflow to balance thoroughness with speed.

Run validation during fixture creation or modification. Set up a pre-commit hook that validates only the fixtures changed in the current commit. This catches issues immediately when developers create or update test data, providing instant feedback without slowing down test execution.

Schedule nightly validation of your entire fixture library. This comprehensive check ensures no semantic drift occurs as your application evolves. Business rules change, and fixtures that were valid six months ago might violate new constraints. Nightly runs catch these issues before they cause confusing test failures.

Validate in CI/CD pipelines before deploying to staging environments. This gate prevents invalid test data from reaching shared environments where it might cause flaky tests that block other developers. The validation runs once per deployment rather than on every local test run.

Cost Estimates per Validation

Understanding the economics helps you optimize wisely. Claude Haiku costs approximately $0.25 per million input tokens and $1.25 per million output tokens. A typical validation request uses about 500 input tokens (your fixture plus rules) and receives 300 output tokens (the validation result).

Breaking down the math: validating one fixture costs roughly $0.000125 for input and $0.000375 for output, totaling about $0.0005 per validation. For 1,000 fixtures, you spend around $0.50. Even large test suites with 10,000 fixtures cost only $5 to validate completely.

Compare this to the cost of debugging a production issue caused by invalid test data. A single incident might consume hours of engineering time worth hundreds or thousands of dollars. The validation cost becomes negligible insurance against much larger potential losses.

Switching to Claude Sonnet increases quality slightly but costs about 3x more per validation. For most test data validation, Haiku provides the optimal balance. Reserve Sonnet for validating critical production data imports or complex multi-entity relationships where the extra reasoning capability justifies the cost.

Conclusion

Semantic validation transforms test data from a potential liability into a reliable foundation for quality assurance. By adding an AI-powered layer of common sense to your validation pipeline, you catch the subtle logical inconsistencies that traditional schema validators miss entirely.

The 150 lines of code we've built deliver immediate practical value. Orders no longer deliver before they're placed in your test suites. Customer ages stay realistic. Shipping costs make sense. These improvements translate directly into more reliable tests that actually validate production scenarios rather than impossible edge cases.

Start small with a single test suite or fixture directory. Validate your most critical test data first, where semantic issues cause the most pain. As you see results, expand validation to other areas of your testing infrastructure. The cached, parallel approach keeps costs under a few dollars even for large test suites.

The real power emerges when you customize rules for your specific domain. Generic validators can't understand your business logic, but with natural language rules tailored to your constraints, the AI becomes a QA team member that never gets tired of checking whether your test data makes sense.

Your test suite is only as good as the data it tests with. Make sure that data reflects reality, and your tests will catch real bugs instead of passing with nonsense. The complete code example is available on our GitHub page, the setup takes five minutes, and your next deploy will be more confident because of it.