The Green Report | How to Pick the Metrics That Actually Matter for Your AI

How to Pick the Metrics That Actually Matter for Your AI

Jan 25th 2026 18 min read

easy

ai/ml

strategy

You've just been asked to evaluate an AI model. Someone mentions it has "95% accuracy," and heads nod around the room. Sounds impressive, right? But here's the thing: accuracy is often the most misleading number you can look at. If your model is detecting fraudulent transactions and only 2% of transactions are actually fraudulent, a model that blindly labels everything as "not fraud" would score 98% accuracy while catching exactly zero bad actors. As a QA engineer, your job isn't to rubber-stamp impressive-sounding numbers. It's to ask the uncomfortable questions: What does this metric actually tell us? What is it hiding? And most importantly, does this model work for the people who will rely on it? Choosing the right evaluation metrics isn't a data science formality. It's a core QA responsibility. This guide will help you navigate that decision with confidence.

The accuracy trap: why default metrics mislead

Accuracy feels intuitive. It answers a simple question: out of all predictions, how many did the model get right? When someone tells you a model is 94% accurate, your brain registers that as an A grade. The model passes. Ship it.

But accuracy has a dirty secret. It treats all correct predictions as equally valuable and all errors as equally costly. In the real world, that's almost never true.

Imbalanced datasets and the illusion of performance

Most interesting problems in AI involve imbalanced data. In fraud detection, legitimate transactions vastly outnumber fraudulent ones. In medical screening, healthy patients far exceed those with a rare disease. In manufacturing quality control, defective items are a tiny fraction of total output.

When your classes are imbalanced, accuracy becomes a nearly meaningless number. The math works against you. If 99% of your data belongs to one class, a model can achieve 99% accuracy by simply predicting that class every single time. It learns nothing. It detects nothing. But on paper, it looks nearly perfect.

This is the accuracy trap. The metric rewards lazy predictions that follow the majority class while punishing models that actually try to identify the rare but important cases you care about.

A concrete example showing how "high accuracy" can mean a useless model

Let's make this tangible. Imagine you're a QA engineer at a bank, and you're evaluating a model designed to flag fraudulent credit card transactions. Your dataset contains 100,000 transactions. Of these, 500 are confirmed fraud and 99,500 are legitimate.

Your data science team presents Model A with 99.2% accuracy. Impressive at first glance. But when you dig into the predictions, you discover that Model A flagged only 50 of the 500 fraudulent transactions. It missed 450 cases of actual fraud. Those 450 missed cases represent stolen money, damaged customer trust, and regulatory headaches.

How did it still score 99.2%? Because it correctly identified 99,200 legitimate transactions as legitimate. The sheer volume of easy correct predictions drowned out its failure on the hard problem you actually hired it to solve.

Now consider Model B. It has 97% accuracy, which sounds worse. But this model caught 400 of the 500 fraudulent transactions. Yes, it also flagged 2,600 legitimate transactions as suspicious, creating more false alarms for your team to review. But it caught 350 more fraud cases than Model A.

Which model is actually better? That depends entirely on what matters to your business. If you only looked at accuracy, you would have chosen the worse model. And this is exactly why QA engineers need to push past the default metric and ask harder questions.

Start with consequences, not calculations

Before you open a spreadsheet or run a single evaluation script, stop. The most important work in metric selection happens away from the data. It happens in conversations with stakeholders, product managers, and the people who will live with the consequences of your model's mistakes.

Every model makes errors. That's not a bug; it's a fundamental reality of probabilistic systems. Your job as a QA engineer is not to demand perfection but to understand which errors are tolerable and which are catastrophic. That understanding should drive every metric decision you make.

False positives vs. false negatives and who pays the price

Every prediction error falls into one of two categories. A false positive occurs when the model says "yes" but the answer is actually "no." A false negative occurs when the model says "no" but the answer is actually "yes." These two error types almost never carry equal weight.

Think about email spam filtering. A false positive means a legitimate email lands in your spam folder. You might miss an important message from a client or a job offer. A false negative means a spam email reaches your inbox. Annoying, but you just delete it and move on. In this context, false positives are more damaging. You'd rather let a few spam messages through than risk losing real communication.

Now flip the scenario to disease screening. A false positive means telling a healthy patient they might have cancer and need further testing. Stressful and costly, but the follow-up tests will reveal the truth. A false negative means telling a sick patient they're healthy. They go home, the disease progresses untreated, and outcomes worsen. Here, false negatives carry life-or-death consequences.

Same error types. Completely different stakes. The math doesn't change, but the priorities absolutely should.

Mapping business impact to metric priorities

Once you understand which error type hurts more, you can start mapping that understanding to specific metrics.

If false positives are your primary concern, precision becomes your focus. Precision measures how many of the positive predictions were actually correct. A high precision model rarely cries wolf. When it flags something, you can trust that flag.

If false negatives keep you up at night, recall is your metric. Recall measures how many of the actual positive cases the model managed to catch. A high recall model casts a wide net. It might generate more false alarms, but it rarely lets a true case slip through.

Sometimes both matter, and you need a balanced view. That's where F1 score comes in, giving you a single number that accounts for both precision and recall. But be careful with F1. It assumes equal weighting, which may not reflect your actual priorities. A weighted F1 or a custom threshold might serve you better.

The key insight is this: metrics are not neutral. Each one encodes assumptions about what matters. Your job is to pick the metric whose assumptions align with your business reality.

Questions to ask stakeholders before you touch the data

Before you evaluate a single prediction, gather your stakeholders and ask these questions. Write down the answers. They will guide every decision that follows.

What happens when the model predicts positive but it's wrong? Walk through the concrete consequences. Does a human review the case? Does the customer get blocked, delayed, or inconvenienced? What's the cost in time, money, or trust?

What happens when the model misses a positive case? Again, trace the real-world impact. Does fraud go undetected? Does a defective product reach a customer? Does a patient miss early treatment? What's the worst-case outcome?

Which error would you rather explain to a customer, regulator, or executive? This question cuts through abstract debates. It forces stakeholders to confront the human reality of model errors.

Is there a threshold where one error type becomes acceptable? Sometimes a few false alarms are fine if the review process is cheap. Sometimes missing even one critical case is unacceptable. Understanding these thresholds helps you set realistic targets.

What does the current process look like without the model? This gives you a baseline. If human reviewers currently catch 60% of fraud cases, a model with 75% recall is a meaningful improvement even if it's not perfect.

These conversations might feel slow compared to jumping into the data. But they're the foundation of meaningful evaluation. A metric without context is just a number. A metric tied to consequences is a decision-making tool.

Metrics decoded: a practical guide

Now that you understand the importance of consequences, let's get practical. This section breaks down the most common evaluation metrics, explains what each one actually measures, and tells you when to reach for it. No formulas here. Just intuition and real-world guidance.

Precision and when false alarms are costly

Precision answers one question: when the model says "yes," how often is it right?

Think of precision as a measure of trust. A high precision model is confident and usually correct. When it raises a flag, you can act on that flag without second-guessing. A low precision model is the colleague who constantly sends "urgent" messages about things that turn out to be nothing. You start ignoring the alerts, and eventually you miss the one that matters.

Prioritize precision when false positives create real harm or friction. Content moderation is a good example. If your model flags user posts as policy violations, each false positive means a legitimate post gets removed and a user gets frustrated. Flag too many innocent posts, and you erode trust in your platform. You need a model that speaks up only when it's confident.

Another example is recommendation systems for high-stakes decisions. If your model recommends candidates for job interviews, a false positive means wasting hiring managers' time on unqualified applicants. Too many bad recommendations, and people stop trusting the system entirely.

The trade-off with precision is that improving it often means being more conservative. The model becomes pickier, which means it might miss some true positives. That's acceptable when the cost of false alarms outweighs the cost of missed detections.

Recall and when missing cases is unacceptable

Recall answers a different question: of all the actual positive cases out there, how many did the model catch?

Think of recall as a measure of coverage. A high recall model is thorough. It might flag some things incorrectly, but it rarely lets a true case slip through the cracks. A low recall model is like a security guard who only checks every tenth bag at the entrance. Sure, the people who get checked are thoroughly screened. But most of the risk walks right past.

Prioritize recall when missing a positive case has severe consequences. Medical diagnostics is the classic example. If your model screens mammograms for signs of cancer, a false negative means a patient with cancer gets told they're healthy. They don't seek treatment. The disease progresses. By the time symptoms appear, outcomes are worse. In this context, you'd rather have some false positives that lead to additional testing than miss actual cases of disease.

Security and safety applications often fall into this category too. If your model detects weapons in luggage scans, recall is paramount. Missing a threat is far worse than stopping a few extra bags for manual inspection.

The trade-off with recall is that improving it often means casting a wider net. The model becomes more aggressive, which means more false alarms. That's acceptable when the cost of missing cases outweighs the burden of additional review.

F1 score for balancing the trade-off

What if both precision and recall matter? What if false positives and false negatives both carry significant costs, and you can't clearly prioritize one over the other?

F1 score offers a compromise. It combines precision and recall into a single number using what's called a harmonic mean. Without getting into the math, just know that F1 punishes extreme imbalances. A model with 95% precision but 10% recall will have a low F1. A model with decent scores on both will have a higher F1 than a model that excels at one while failing at the other.

F1 is useful when you need a single summary metric for model comparison. If you're evaluating ten different models and need to rank them quickly, F1 gives you a reasonable starting point. It's also helpful in situations where the costs of false positives and false negatives are roughly similar or hard to quantify precisely.

But be cautious. F1 assumes equal weighting between precision and recall. In reality, that balance is rarely perfect. If you know that false negatives are twice as costly as false positives, a standard F1 score won't reflect that priority. You might need a weighted version or separate tracking of both metrics with explicit targets for each.

Use F1 as a useful shorthand, not as a replacement for thinking carefully about your specific trade-offs.

AUC-ROC for evaluating across thresholds

Most classification models don't just output "yes" or "no." They output a probability or confidence score. A transaction might be flagged as 73% likely to be fraud. An email might score 0.4 on the spam scale. You then choose a threshold to convert that score into a binary decision. Everything above 0.5 is positive. Or maybe everything above 0.7. Or 0.3.

The choice of threshold dramatically affects your precision and recall. A high threshold makes the model conservative, boosting precision but lowering recall. A low threshold makes the model aggressive, boosting recall but lowering precision.

AUC-ROC helps you evaluate model performance across all possible thresholds. ROC stands for Receiver Operating Characteristic, and it's a curve that plots the true positive rate against the false positive rate at every threshold. AUC is the Area Under that Curve. A perfect model scores 1.0. A model that guesses randomly scores 0.5.

AUC-ROC is useful when you haven't finalized your threshold yet. It tells you about the model's overall discriminative ability separate from any specific operating point. Two models might have identical precision and recall at threshold 0.5, but very different AUC scores, indicating that one will give you more flexibility to tune performance later.

It's also helpful when comparing models across different use cases where thresholds might vary. A single AUC number lets you say "this model generally separates positive and negative cases better" without committing to a specific precision/recall balance.

The limitation is that AUC-ROC can be misleading on highly imbalanced datasets. When positive cases are rare, the false positive rate can look great even when the model produces many false alarms in absolute terms. In these situations, precision-recall curves and AUC-PR might serve you better.

Brief mention of others like specificity and MCC for edge cases

A few other metrics deserve brief mention for specialized situations.

Specificity measures how well the model identifies negative cases. It's the flip side of recall. While recall asks "of all actual positives, how many did we catch," specificity asks "of all actual negatives, how many did we correctly leave alone." Specificity matters when correctly identifying negatives is just as important as catching positives. In some medical contexts, you need high recall for detecting disease and high specificity for avoiding unnecessary treatment of healthy patients.

Matthews Correlation Coefficient, or MCC, provides a balanced measure that accounts for all four cells of the confusion matrix: true positives, true negatives, false positives, and false negatives. MCC produces a score between negative one and positive one, where one is perfect prediction, zero is random guessing, and negative one is total disagreement. MCC is particularly useful when your classes are imbalanced and you want a single number that won't be skewed by the majority class. Some researchers argue it's more reliable than F1 for imbalanced problems, though it's less intuitive to interpret.

Log loss and Brier score evaluate the quality of probability estimates rather than just final classifications. If your model outputs a 70% confidence and you need to trust that calibration, these metrics help you assess whether the probabilities are meaningful.

You won't need these every day. But knowing they exist means you can reach for them when standard metrics fall short.

Building your evaluation checklist

Theory is useful, but you need something practical. Something you can pull up before every model evaluation to make sure you're asking the right questions and choosing the right metrics. This section gives you that checklist, along with guidance for common scenarios and advice on when to push back against oversimplified evaluation.

Key questions every QA engineer should ask

What problem is this model solving? Get specific. Not "fraud detection" but "identifying fraudulent credit card transactions at point of sale before authorization." The more precise your problem definition, the clearer your metric choices become.

What does the class distribution look like? Ask for the breakdown of positive and negative cases in both training and test data. If positives represent less than 10% of the dataset, accuracy is almost certainly misleading. You'll need metrics that focus on the minority class performance.

What are the real-world consequences of each error type? Map out false positives and false negatives separately. Who is affected? What actions follow from each error? What's the financial, operational, or human cost? Document these consequences explicitly.

Who reviews or acts on the model's predictions? If every positive prediction triggers an expensive human review, false positives have direct cost implications. If predictions feed directly into automated decisions with no human oversight, the stakes for both error types increase.

What's the current baseline? How is this problem handled today without the model? What's the current detection rate, false alarm rate, or error rate? This gives you a benchmark for whether the model actually improves things.

Are there regulatory or compliance requirements? Some industries have explicit standards for sensitivity, specificity, or error rates. Healthcare, finance, and safety-critical systems often have thresholds that aren't negotiable. Know these before you evaluate.

What threshold will be used in production? If the team has already decided on a confidence threshold, evaluate at that threshold specifically. If the threshold is still undetermined, request AUC curves so you can see performance across the full range.

How will the model be monitored after deployment? Evaluation doesn't end at launch. Understand what metrics will be tracked in production and make sure they align with what you're measuring during QA.

Matching common scenarios to recommended metrics

Different problems call for different metrics. Here's a quick reference guide for common scenarios you'll encounter.

For fraud detection, where missing fraud is costly but some false alarms are acceptable, prioritize recall first and track precision as a secondary constraint. Set a minimum acceptable recall threshold based on business requirements, then optimize precision within that constraint. AUC-PR is often more informative than AUC-ROC given the class imbalance.

For spam filtering, where false positives annoy users and erode trust, prioritize precision. Users will tolerate occasional spam reaching their inbox far more than they'll tolerate legitimate messages disappearing. Track recall to ensure you're still catching a reasonable proportion of spam, but let precision drive your decisions.

For medical screening, where missing a disease can be life-threatening, prioritize recall aggressively. The downstream process typically involves confirmatory testing, so false positives lead to additional tests rather than immediate harm. Specificity becomes a secondary consideration to manage healthcare system burden.

For content moderation, where both errors damage user experience, use F1 as a starting point but track precision and recall separately. False positives silence legitimate speech. False negatives allow harmful content to spread. Neither is acceptable at high rates, so you need visibility into both.

For recommendation systems, where bad recommendations waste user attention and erode trust, precision at the top of the list matters most. Users typically only see the first few recommendations. Metrics like precision at k, which measures precision in the top k results, often matter more than overall precision across all predictions.

For quality control in manufacturing, where defective products reaching customers creates liability and reputation damage, prioritize recall for defect detection. The cost of pulling a few extra items for manual inspection is trivial compared to shipping defective products. Track the false positive rate to ensure you're not creating unsustainable inspection bottlenecks.

For predictive maintenance, where both missed failures and unnecessary interventions are expensive, balance matters. A missed failure means unplanned downtime, which is costly. A false alarm means unnecessary maintenance, which is also costly. Use F1 or a custom weighted metric that reflects the relative costs in your specific operation.

When to push back and request multiple metrics

Here's a truth that will serve you well: anyone who insists on evaluating a model with a single metric is either oversimplifying or hiding something.

Push back when you're only shown accuracy. This is the most common red flag. Accuracy on its own tells you almost nothing about model performance on the problem you actually care about. Ask for precision and recall at minimum. If the team can't provide these, that's a sign the evaluation process is immature.

Push back when metrics don't match stated priorities. If stakeholders say "we absolutely cannot miss fraud cases" but the evaluation focuses on precision, there's a disconnect. The metrics should reflect what the business actually cares about. Surface this inconsistency and resolve it before signing off.

Push back when you don't see performance on subgroups. Overall metrics can hide poor performance on important segments. A model might have great recall overall but terrible recall on transactions from a specific region or user demographic. Ask for breakdowns by relevant categories, especially if fairness or equity is a concern.

Push back when threshold selection isn't justified. If the team picked 0.5 as the classification threshold because that's the default, challenge that assumption. Request precision-recall curves or ROC curves so you can see how performance changes across thresholds. The optimal threshold depends on your cost trade-offs, not software defaults.

Push back when there's no comparison to baseline. A model with 80% recall sounds good until you learn that the existing rule-based system achieves 78% recall with much simpler infrastructure. Always ask what improvement the model provides over the current solution or a naive baseline.

Request a dashboard, not a single number. The best evaluation approach shows multiple metrics together: precision, recall, F1, and AUC at minimum, with breakdowns by relevant segments and performance at multiple thresholds. If the tooling doesn't support this, advocate for building it. Single-metric evaluation is a shortcut that creates blind spots.

Document your metric rationale. When you choose to prioritize certain metrics, write down why. This creates accountability and helps future team members understand the reasoning. It also protects you when someone later asks why the model performs poorly on a metric you deliberately deprioritized.

Conclusion

At its core, QA has always been about advocacy. You're the voice of the user in rooms full of deadlines, technical constraints, and competing priorities. Evaluating AI models is no different. When you push past accuracy and ask harder questions about precision, recall, and real-world consequences, you're not being difficult. You're doing your job. You're ensuring that the model works for the people who will depend on it, not just for the dashboard that reports on it.

Choosing metrics is choosing what matters. Every metric you prioritize encodes a value judgment about which errors are tolerable and which are not. That's not a data science decision. That's a human decision, and it belongs in the hands of someone who understands users, consequences, and quality. That someone is you.

Additional reads: