The Green Report | Beyond Tool Correctness: Building Custom Efficiency Metrics for DeepEval

Beyond Tool Correctness: Building Custom Efficiency Metrics for DeepEval

Dec 28th 2025 16 min read

medium

ai/ml

Your LLM agent just passed all its tests. It selected a tool, completed the task, and returned the right answer. Success, right? Not quite. What the tests didn't catch is that your agent used a premium API call when a local calculation would've sufficed, or made three tool calls when one would've done the job. In production, these "technically correct" decisions compound into real costs in both dollars and user experience. DeepEval's ToolCorrectnessMetric does an excellent job of validating whether your agent can choose functionally suitable tools, but it doesn't tell us if those choices are optimal. It's time to extend our testing beyond correctness and start measuring the efficiency and quality of our agent's decision making.

What DeepEval's Tool Correctness Gets Right

DeepEval's ToolCorrectnessMetric addresses one of the most critical challenges in LLM agent testing: validating that your agent selects functionally appropriate tools for a given task. At its core, the metric compares the tools your agent actually invokes against a set of expected tools you define in your test cases. When a user asks "What's the weather in Seattle?", you expect your agent to call a weather API, not a calculator or database query tool. The ToolCorrectnessMetric catches these fundamental mismatches.

What makes this metric particularly useful is its flexibility in handling real world complexity. It doesn't demand perfect one to one matching. You can specify multiple acceptable tools for scenarios where different approaches are valid, and it scores based on suitability rather than enforcing rigid expectations. The threshold parameter lets you define how strict you want to be, accommodating cases where partial tool overlap might still indicate acceptable behavior.

This functional validation is absolutely essential. Before worrying about whether your agent is choosing the best tool, you need to know it's choosing tools that can actually accomplish the task. An agent that tries to use a text search tool to perform mathematical calculations is fundamentally broken, regardless of cost or performance considerations. DeepEval's approach correctly prioritizes this baseline: can your agent map task requirements to tool capabilities?

But here's where production reality introduces new requirements. Functional correctness tells you whether your agent can solve a problem, not whether it solves it well. Consider an agent with access to both a lightweight REST API for currency conversion and a full featured financial data platform that happens to also offer currency conversion among hundreds of other capabilities. Both tools are functionally suitable. Both will return correct exchange rates. The ToolCorrectnessMetric would score either choice as acceptable.

Yet in production, these choices have drastically different implications. The financial platform might cost $0.05 per API call versus $0.001 for the simple converter. It might have 10x the latency because it's designed for complex analytical queries, not simple lookups. It might have rate limits that matter when you're making thousands of requests per hour. The "correct" choice becomes the expensive, slow choice.

This is the gap between correctness and optimization. DeepEval gets you to functional reliability, which is critical. But shipping production agents requires another layer: evaluating not just whether the tool works, but whether it's the right tool given cost constraints, performance requirements, and usage context. That's the layer we're going to build.

Defining Efficiency Dimensions for Tool Selection

To build meaningful efficiency metrics, we first need to define what "efficient" actually means in the context of tool selection. Unlike correctness, which is relatively binary, efficiency operates across multiple dimensions. A tool choice might excel in one area while performing poorly in another.

Cost Efficiency: The Direct Financial Impact

Cost efficiency measures the direct financial burden of tool selection. This includes API pricing, token usage in LLM-based tools, and compute resource consumption. A tool that costs $0.001 per call becomes a $10,000 annual expense at 10 million calls, while a free alternative saves that entire amount.

Token usage deserves special attention. If your agent can accomplish a task with a tool that uses 500 tokens versus one that uses 5,000 tokens, you're paying 10x more in LLM costs for the same outcome. These fractional differences compound at scale.

Performance Efficiency: Speed and Responsiveness

Performance efficiency captures how quickly a tool completes its task: individual request latency, network overhead, and throughput capacity under load.

Latency differences can be dramatic. A local calculation completes in microseconds, while an external API call takes 100-500ms. A properly indexed database query returns in 50ms, while a full-text search takes 2 seconds. These delays directly impact user experience, especially in conversational interfaces where users expect near-instant responses.

Throughput matters at scale. A tool with a 10 requests/second rate limit becomes a bottleneck when your application serves 100 concurrent users.

Contextual Efficiency: Right-Sizing Tool Capability

Contextual efficiency evaluates whether a tool's capabilities match the task's actual complexity. Using a heavyweight symbolic mathematics engine to add two numbers is contextually inefficient. The tool brings capabilities (solving differential equations, theorem proving) that the task doesn't require, introducing unnecessary cost, latency, and failure modes.

Think of it like transportation. You wouldn't rent a moving truck to buy groceries (over-powered), nor would you try to move furniture in a sedan (under-powered). Contextual efficiency matches tool sophistication to task requirements.

Operational Efficiency: Minimizing Complexity and Risk

Operational efficiency measures how tool choices affect system reliability and maintainability. Tool chains requiring multiple sequential calls introduce latency and fragility. Each additional tool increases the surface area for errors, timeouts, rate limiting, and authentication issues.

Tools with clear error messages and structured logging are more operationally efficient than black-box alternatives. Dependencies on external services introduce risk: a tool with 99.9% uptime that's used unnecessarily adds potential downtime to your system.

Balancing Multiple Dimensions

These dimensions often conflict. The fastest tool might be the most expensive. The cheapest option might lack reliability. Effective efficiency metrics need to capture these tradeoffs and let you weight dimensions based on your specific requirements. A cost-sensitive application might tolerate higher latency, while a real-time interface prioritizes performance above all else. Understanding these dimensions gives us the vocabulary to build metrics that reflect what actually matters for our use case.

Building Your First Custom Efficiency Metric

Now that we understand the efficiency dimensions, let's build a practical metric that extends DeepEval's testing capabilities. We'll create a ToolEfficiencyMetric that evaluates cost and performance efficiency alongside functional correctness.

Architecture: Extending DeepEval's Base Metrics

DeepEval's metrics follow a clear pattern: they inherit from a base BaseMetric class and implement a scoring mechanism that returns a value between 0 and 1. Rather than replacing ToolCorrectnessMetric, we'll create a complementary metric that can run alongside it. This approach lets you validate both correctness (does it work?) and efficiency (does it work well?) in the same test suite.

The key architectural decision is how to run both metrics together. While DeepEval supports assert_test() for batch metric execution, we'll use manual measurement for more control and better compatibility with custom metrics. This means calling measure() on each metric individually and then asserting their success.

Class Structure: The Foundation

Our custom metric needs to initialize several properties that DeepEval expects. Here's the essential initialization code:

                
from deepeval.metrics import BaseMetric
from typing import Dict, List, Optional

class ToolEfficiencyMetric(BaseMetric):
    def __init__(
        self,
        tool_costs: Dict[str, float],
        tool_latencies: Dict[str, float],
        optimal_tool: str,
        acceptable_tools: Optional[List[str]] = None,
        threshold: float = 0.7,
        cost_weight: float = 0.5,
        latency_weight: float = 0.5
    ):
        self.tool_costs = tool_costs
        self.tool_latencies = tool_latencies
        self.optimal_tool = optimal_tool
        self.acceptable_tools = acceptable_tools or []
        self.threshold = threshold
        self.cost_weight = cost_weight
        self.latency_weight = latency_weight
        
        # Initialize required properties for DeepEval
        self.success = False
        self.score = 0.0
        self.reason = None

What's happening here:

We accept tool costs and latencies as dictionaries mapping tool names to their respective values
optimal_tool defines which tool should be used for this specific scenario
acceptable_tools allows you to specify alternatives that aren't optimal but still acceptable
Weight parameters let you prioritize cost vs. latency based on your use case
The three properties (success, score, reason) must be initialized as DeepEval's framework expects them

Integrating Cost and Latency Data into Scoring

The core of our metric is the measure() method, which calculates efficiency scores. Here's how we extract tool information and compute the final score:

                
def measure(self, test_case: LLMTestCase) -> float:
    # Get tools from additional_metadata
    if hasattr(test_case, 'additional_metadata') and test_case.additional_metadata:
        actual_tools = test_case.additional_metadata.get("tools_used", [])
    else:
        actual_tools = []
    
    if not actual_tools:
        self.success = False
        self.score = 0.0
        self.reason = "No tools were used"
        return self.score
    
    # Get the primary tool used
    primary_tool = actual_tools[0]
    
    # Calculate efficiency scores
    cost_score = self._calculate_cost_score(primary_tool)
    latency_score = self._calculate_latency_score(primary_tool)
    
    # Weighted combination
    self.score = (
        self.cost_weight * cost_score + 
        self.latency_weight * latency_score
    )
    
    # Determine success
    self.success = self.score >= self.threshold
    self.reason = self._generate_reason(primary_tool, cost_score, latency_score)
    
    return self.score

Key points:

We read tool information from additional_metadata (DeepEval requires actual_output to be a string, so we store tool info separately)
The metric evaluates the first tool in the list as the primary tool
Cost and latency scores are calculated separately, then combined using the configured weights
Success is determined by comparing the final score against the threshold

Now let's look at how we calculate the cost efficiency score:

                
def _calculate_cost_score(self, tool: str) -> float:
    actual_cost = self.tool_costs.get(tool, float('inf'))
    optimal_cost = self.tool_costs.get(self.optimal_tool, 0)
    
    # Both free = perfect score
    if actual_cost == 0 and optimal_cost == 0:
        return 1.0
    
    # Unknown tool = worst score
    if actual_cost == float('inf'):
        return 0.0
    
    # Free tool when optimal costs money = wrong tool
    if actual_cost == 0 and optimal_cost > 0:
        return 0.0
    
    # Costly tool when optimal is free = very inefficient
    if optimal_cost == 0 and actual_cost > 0:
        return 0.0
    
    # Both have cost: score inversely proportional to cost ratio
    cost_ratio = actual_cost / optimal_cost
    return max(0.0, min(1.0, 1.0 / cost_ratio))

The logic:

If the actual tool costs the same as the optimal tool, score = 1.0
If the actual tool costs 2x the optimal, score = 0.5 (inverse relationship)
We handle edge cases like free tools and missing tool definitions
Even if a tool is cheaper, it scores 0 if it's not the contextually correct tool (preventing "cheap but wrong" choices)

The latency scoring follows the same pattern, just using latency values instead of costs.

Combining with ToolCorrectnessMetric for Comprehensive Evaluation

Here's how you use both metrics together in a test. First, we need to structure our test case properly:

                
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import ToolCorrectnessMetric

test_case = LLMTestCase(
    input="What is 15% of 240?",
    actual_output="36",
    expected_output="36",
    tools_called=[ToolCall(name="basic_calculator")],
    expected_tools=[ToolCall(name="basic_calculator")],
    additional_metadata={"tools_used": ["basic_calculator"]}
)

Structure explanation:

Now we can measure both metrics:

                
# Initialize metrics
correctness = ToolCorrectnessMetric()

efficiency = ToolEfficiencyMetric(
    tool_costs={"basic_calculator": 0.0, "code_interpreter": 0.002},
    tool_latencies={"basic_calculator": 50, "code_interpreter": 200},
    optimal_tool="basic_calculator",
    threshold=0.9,
    cost_weight=0.8,
    latency_weight=0.2
)

# Measure both
correctness.measure(test_case)
efficiency.measure(test_case)

# Assert both pass
assert correctness.is_successful()
assert efficiency.is_successful()

Why manual measurement:

The beauty of this approach is that you now have visibility into both dimensions. A test might pass correctness (the tool works) but fail efficiency (it's too expensive), giving you actionable feedback on where to optimize your agent's decision-making logic.

This foundation gives you a working efficiency metric. You can extend it further by adding more dimensions (like the ContextualComplexityMetric example that validates tool sophistication matches task complexity), or by creating custom scoring logic for your specific use case.

Practical Implementation Examples

With our efficiency metric built, let's see how to apply it across different scenarios. Each example demonstrates how adjusting metric parameters adapts the same foundation to different optimization priorities.

Example 1: Cost-Aware Calculator Selection

When processing thousands of simple calculations daily, cost efficiency matters most. Here's how to test that your agent chooses the cheapest viable tool:

                
import pytest
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import ToolCorrectnessMetric
from metrics.tool_efficiency_metric import ToolEfficiencyMetric
from config.tool_config import CALCULATOR_TOOLS

@pytest.mark.parametrize("calculation,expected_result", [
    ("What is 25 + 37?", "62"),
    ("Calculate 15% of 240", "36"),
])
def test_basic_math_uses_efficient_tool(calculation, expected_result):
    test_case = LLMTestCase(
        input=calculation,
        actual_output=expected_result,
        expected_output=expected_result,
        tools_called=[ToolCall(name="basic_calculator")],
        expected_tools=[ToolCall(name="basic_calculator")],
        additional_metadata={"tools_used": ["basic_calculator"]}
    )
    
    correctness = ToolCorrectnessMetric()
    
    efficiency = ToolEfficiencyMetric(
        tool_costs=CALCULATOR_TOOLS["costs"],
        tool_latencies=CALCULATOR_TOOLS["latencies"],
        optimal_tool="basic_calculator",
        threshold=0.9,
        cost_weight=0.8,  # Prioritize cost
        latency_weight=0.2
    )
    
    correctness.measure(test_case)
    efficiency.measure(test_case)
    
    assert correctness.is_successful()
    assert efficiency.is_successful()

What this tests: The agent selects basic_calculator (free, 50ms) over wolfram_alpha ($0.01, 800ms) for simple arithmetic. With 80% cost weighting, expensive tools fail even if they work correctly.

Example 2: Latency Optimization for User-Facing Features

For FAQ queries where users expect instant responses, speed trumps cost:

                
def test_faq_search_prioritizes_speed():
    test_case = LLMTestCase(
        input="What is your return policy?",
        actual_output="Our return policy allows 30-day returns...",
        expected_output="Our return policy allows 30-day returns...",
        tools_called=[ToolCall(name="local_index")],
        expected_tools=[ToolCall(name="local_index")],
        additional_metadata={"tools_used": ["local_index"]}
    )
    
    efficiency = ToolEfficiencyMetric(
        tool_costs={"local_index": 0.0, "web_search": 0.003},
        tool_latencies={"local_index": 30, "web_search": 400},
        optimal_tool="local_index",
        threshold=0.8,
        cost_weight=0.2,  # Cost matters less
        latency_weight=0.8  # Speed is critical
    )
    
    efficiency.measure(test_case)
    assert efficiency.is_successful()

What this tests: The agent uses local_index (30ms) instead of web_search (400ms) for FAQs. The 80% latency weighting means slow tools fail regardless of cost savings.

Detecting inefficiency: Here's what happens when the agent chooses poorly:

                
def test_faq_detects_slow_tool():
    test_case = LLMTestCase(
        input="What are your business hours?",
        actual_output="We're open Monday-Friday 9am-5pm EST",
        tools_called=[ToolCall(name="web_search")],  # Too slow!
        expected_tools=[ToolCall(name="local_index")],
        additional_metadata={"tools_used": ["web_search"]}
    )
    
    efficiency = ToolEfficiencyMetric(
        tool_costs={"local_index": 0.0, "web_search": 0.003},
        tool_latencies={"local_index": 30, "web_search": 400},
        optimal_tool="local_index",
        threshold=0.8,
        cost_weight=0.2,
        latency_weight=0.8
    )
    
    efficiency.measure(test_case)
    assert not efficiency.is_successful()  # Should fail
    
    # Output: Efficiency score: 0.08
    # Reason: Suboptimal tool 'web_search' selected. 
    #         Cost efficiency: 0.00, Latency efficiency: 0.08

Example 3: Context-Appropriate Tool Complexity

Our ContextualComplexityMetric extends the base metric to catch both over-engineering and under-engineering:

                
from metrics.tool_efficiency_metric import ContextualComplexityMetric

def test_simple_filtering_avoids_overengineering():
    """Simple tasks shouldn't trigger ML pipelines"""
    test_case = LLMTestCase(
        input="Show me orders from last week",
        actual_output="Here are 47 orders from last week...",
        tools_called=[ToolCall(name="simple_filter")],
        expected_tools=[ToolCall(name="simple_filter")],
        additional_metadata={"tools_used": ["simple_filter"]}
    )
    
    efficiency = ContextualComplexityMetric(
        task_complexity="simple",  # Expected complexity
        tool_costs={"simple_filter": 0.0, "ml_pipeline": 0.10},
        tool_latencies={"simple_filter": 100, "ml_pipeline": 8000},
        optimal_tool="simple_filter",
        threshold=0.85
    )
    
    efficiency.measure(test_case)
    assert efficiency.is_successful()

Catching under-engineering:

                
def test_underengineered_solution_detected():
    """Complex tasks need appropriate tools"""
    test_case = LLMTestCase(
        input="Predict next quarter's revenue based on trends",
        actual_output="Estimated revenue: $1.2M",
        tools_called=[ToolCall(name="simple_filter")],  # Too simple!
        expected_tools=[ToolCall(name="ml_pipeline")],
        additional_metadata={"tools_used": ["simple_filter"]}
    )
    
    efficiency = ContextualComplexityMetric(
        task_complexity="advanced",
        tool_costs={"simple_filter": 0.0, "ml_pipeline": 0.10},
        tool_latencies={"simple_filter": 100, "ml_pipeline": 8000},
        optimal_tool="ml_pipeline",
        threshold=0.8
    )
    
    efficiency.measure(test_case)
    assert not efficiency.is_successful()  # Should fail
    
    # Output includes: "Complexity mismatch: advanced task using simple tool"

Test Case Structure Summary

Every test follows this pattern:

                
# 1. Create test case with all required fields
test_case = LLMTestCase(
    input="query",
    actual_output="response",
    expected_output="expected",
    tools_called=[ToolCall(name="actual_tool")],
    expected_tools=[ToolCall(name="expected_tool")],
    additional_metadata={"tools_used": ["actual_tool"]}
)

# 2. Initialize metrics
correctness = ToolCorrectnessMetric()
efficiency = ToolEfficiencyMetric(...)

# 3. Measure
correctness.measure(test_case)
efficiency.measure(test_case)

# 4. Assert
assert correctness.is_successful()
assert efficiency.is_successful()

Key takeaways:

Adjust cost_weight and latency_weight based on your scenario (batch jobs prioritize cost, user interfaces prioritize speed)
Set threshold based on how strict you want to be (0.9 for production-critical, 0.7 for development)
Use acceptable_tools to acknowledge multiple valid approaches
The same metric framework adapts to different optimization priorities by simply changing parameters

These examples give you ready-to-use patterns for the most common efficiency testing scenarios. Combine them in your test suite to ensure your agent not only works correctly, but works optimally.

Advanced Patterns: Composite Scoring

As your testing strategy matures, you'll encounter scenarios where simple cost-versus-latency tradeoffs aren't enough. Here are patterns for building more sophisticated efficiency evaluations.

Weighting Multiple Efficiency Dimensions

Don't treat all efficiency dimensions equally across your entire application. A single agent might need different weights for different tasks. Consider creating task-specific weight profiles: your checkout flow might weight latency at 90% because every millisecond affects conversion rates, while your nightly reporting batch jobs weight cost at 90% since speed doesn't matter. Store these profiles in your configuration alongside tool definitions, making it easy to apply the right priorities to each test scenario.

Creating Profiles for Different Scenarios

Codify your efficiency priorities into reusable profiles. Define three standard profiles that cover most use cases: "cost_critical" for batch processing and background jobs where minimizing expenses matters most, "latency_critical" for user-facing features where speed is paramount, and "balanced" for general-purpose operations. Each profile specifies not just weights, but also appropriate thresholds. This prevents the common mistake of using the same strict efficiency standards for a real-time chat interface and a monthly analytics report.

Handling Acceptable Tradeoffs

Real-world optimization rarely has a single "correct" answer. Your efficiency metric should acknowledge this through the acceptable_tools parameter. A tool that's 20% slower but 80% cheaper might be perfectly acceptable for certain tasks. Define these tradeoffs explicitly in your tests rather than treating everything as optimal-or-failure. This prevents false negatives where your agent makes reasonable engineering decisions but fails overly rigid tests.

Similarly, consider graduated thresholds rather than hard pass/fail lines. A score of 0.85 might warrant a warning in development but not fail the build, while 0.65 could fail in staging, and anything below 0.8 fails in production. This creates a progressive tightening of efficiency standards as code moves through your pipeline.

Threshold Strategies for Different Environments

Your efficiency requirements should evolve with your deployment environment. In development, use lenient thresholds (0.6-0.7) to catch only egregious inefficiencies without blocking iteration. In staging, tighten to moderate thresholds (0.75-0.85) to surface optimization opportunities before they reach production. In production, enforce strict thresholds (0.85-0.95) because real money and user experience are at stake.

Consider also implementing threshold decay over time. When you first deploy efficiency testing, set forgiving thresholds to establish a baseline without breaking existing functionality. Then gradually increase thresholds over sprints as you improve your agent's decision-making logic. This prevents the "boil the ocean" problem where you try to fix every inefficiency at once.

The goal isn't to achieve perfect efficiency scores everywhere—it's to make conscious, measured tradeoffs that align with your business priorities. These advanced patterns give you the flexibility to encode those tradeoffs into your testing strategy, ensuring your efficiency metrics guide improvement rather than becoming obstacles to shipping.

Conclusion

We started this article with a simple observation: passing tests doesn't mean your agent is production-ready. DeepEval's ToolCorrectnessMetric gives you the floor of functional reliability. The efficiency metrics we've built give you the ceiling of operational excellence. Correctness ensures your agent can accomplish tasks, but efficiency ensures it does so without wasting money, frustrating users with latency, or introducing unnecessary operational complexity.

Don't try to implement everything at once. Start with the efficiency dimension that hurts most right now. If costs are climbing, focus on cost efficiency; if users complain about speed, prioritize latency. Implement it for your highest impact workflows first, tune your thresholds, and expand gradually. This incremental approach delivers quick wins and builds momentum without overwhelming your team.

The metrics we've built are starting points for your specific needs. Extend them, adapt them, and if they prove valuable, share them back with the community. All complete code examples, including the ToolEfficiencyMetric implementation, test suites, and configuration templates, are available on our GitHub repository. Start small, measure what matters, and iterate toward excellence. Your users and infrastructure bills will thank you.