Classification Metrics for ML Interviews: Precision, Recall, ROC, and PR-AUC
A practical ML interview guide to classification metrics: how precision, recall, F1, ROC-AUC, PR-AUC, calibration, and thresholds work, and how to choose the right one for business tradeoffs.
Classification Metrics for ML Interviews: Precision, Recall, ROC, and PR-AUC
Classification metrics for ML interviews are really questions about tradeoffs. Precision, recall, ROC, and PR-AUC are not just formulas to memorize; they are ways to describe what happens when a model makes decisions under uncertainty. Strong candidates can build the confusion matrix, compute the basic metrics, explain threshold movement, choose between ROC-AUC and PR-AUC for imbalanced data, and connect the metric to the cost of false positives and false negatives.
Use this guide as an interview playbook. If you can answer classification metric questions with examples, decision rules, and caveats, you will stand out from candidates who only recite definitions.
Start with the confusion matrix
Every binary classification metric starts with four counts:
| | Predicted positive | Predicted negative | |---|---:|---:| | Actual positive | True positive (TP) | False negative (FN) | | Actual negative | False positive (FP) | True negative (TN) |
A fraud model flags transactions. A flagged fraud transaction is TP. A legitimate transaction flagged as fraud is FP. A fraud transaction missed by the model is FN. A legitimate transaction allowed through is TN.
From that table:
- Accuracy = (TP + TN) / total
- Precision = TP / (TP + FP)
- Recall or true positive rate = TP / (TP + FN)
- False positive rate = FP / (FP + TN)
- Specificity = TN / (TN + FP)
- F1 = 2 precision recall / (precision + recall)
In an interview, always ground formulas in a use case. “Precision answers: of the items I flagged, how many were truly positive? Recall answers: of all true positives, how many did I catch?” That language is usually more memorable than the algebra.
Precision vs recall: the core tradeoff
Precision and recall usually move against each other as you change the classification threshold. A lower threshold marks more examples positive. That tends to increase recall because you catch more actual positives, but it may decrease precision because you also admit more false positives. A higher threshold does the opposite.
Use this decision rule:
| Business situation | Metric emphasis | Reason | |---|---|---| | Spam folder for personal email | Precision | False positives hide real mail and anger users | | Cancer screening | Recall | Missing disease is worse than extra follow-up | | Fraud review queue with limited analysts | Precision at k or precision at capacity | Reviewers can only handle so many alerts | | Safety-critical detection | Recall plus strict guardrails | Misses can be costly, but false alarms still matter | | Marketing lead scoring | Precision/recall by budget | Sales capacity defines acceptable alert volume |
Never say “precision is better” or “recall is better” without the cost context. The right metric depends on the action taken after prediction.
Thresholds, scores, and operating points
Most classifiers output a score or probability-like value. The threshold converts that score into a class label. Metrics like precision and recall depend on the chosen threshold. AUC metrics summarize behavior across thresholds.
A strong interview answer separates three layers:
- Ranking quality: Are positives generally scored above negatives? AUC-style metrics help.
- Probability quality: Are predicted probabilities calibrated? Calibration metrics help.
- Decision quality: At the chosen threshold, do business outcomes improve? Precision, recall, cost, and online metrics help.
For example, a model can have good ROC-AUC but poor calibration. It ranks risky users above safe users, but a score of 0.8 may not mean an 80% event probability. That matters if downstream systems interpret scores as probabilities.
ROC curve and ROC-AUC
The ROC curve plots true positive rate against false positive rate across thresholds. ROC-AUC is the area under that curve. Intuitively, ROC-AUC measures how often a randomly chosen positive example receives a higher score than a randomly chosen negative example.
ROC-AUC is useful when:
- You care about ranking across many possible thresholds.
- Classes are not extremely imbalanced, or false positive rate is the right x-axis.
- You want a threshold-independent comparison between models.
- You need a common metric for baseline model selection.
But ROC-AUC can look optimistic on heavily imbalanced problems. If only 0.1% of examples are positive, a small false positive rate may still produce a huge number of false alerts. That is why PR-AUC is often more informative for rare-event detection.
Interview caveat: ROC-AUC does not tell you whether the chosen threshold is good, whether the model is calibrated, or whether the review team can handle the alert volume. It is a ranking metric, not a launch decision by itself.
Precision-recall curve and PR-AUC
The precision-recall curve plots precision against recall across thresholds. PR-AUC summarizes that curve. It is especially useful when the positive class is rare and false positives are operationally expensive.
PR-AUC answers a more action-oriented question: as I try to catch more positives, how much precision do I retain? For fraud, abuse, medical screening follow-up, recruiting outreach, or defect detection, that is often closer to the business problem than ROC-AUC.
A key baseline: the no-skill PR-AUC baseline is the positive class prevalence. If 1% of examples are positive, a random ranking has expected precision around 1%. That makes PR-AUC sensitive to class imbalance and harder to compare across datasets with different prevalence. In interviews, mention this caveat if comparing PR-AUC across time, countries, or product surfaces.
F1, F-beta, and when they are useful
F1 is the harmonic mean of precision and recall. It is high only when both are high. It is useful when you need one number and you value precision and recall roughly equally.
F-beta generalizes this:
- F2 weights recall more than precision.
- F0.5 weights precision more than recall.
Use F1 carefully. It ignores true negatives, depends on a threshold, and may not match business cost. If false positives are ten times cheaper than false negatives, F1 is probably not the right final metric. A cost-weighted metric or explicit threshold policy is better.
Good interview line: “F1 is a reasonable offline summary when positive class performance matters and precision/recall are similarly important, but I would still choose a threshold based on business costs or capacity.”
Micro, macro, and weighted averaging
For multiclass classification, metrics can be averaged in different ways.
Micro average aggregates TP, FP, and FN globally before computing the metric. It gives more weight to common classes.
Macro average computes the metric per class and then takes a simple average. It treats rare and common classes equally, which is useful when minority class performance matters.
Weighted macro average computes per-class metrics and weights by class support. It sits between micro and macro.
Example: in a support ticket classifier, the “billing” class may dominate volume while “security incident” is rare but important. Micro F1 can look strong even if security incident recall is poor. Macro recall or per-class recall would reveal the risk.
Calibration, log loss, and Brier score
Classification metrics are not only about hard labels. If the model outputs probabilities, calibration matters.
Log loss rewards confident correct predictions and heavily penalizes confident wrong predictions. It is common for probabilistic classifiers and training objectives.
Brier score is the mean squared error between predicted probabilities and actual outcomes. It is easier to explain and directly measures probability accuracy.
Calibration curves compare predicted probability buckets against observed frequency. If users scored around 0.7 convert about 70% of the time, that bucket is well calibrated.
Calibration matters when decisions depend on expected value. For example, if a loan model score is used to price risk, ranking alone is not enough. The score must correspond to actual probability.
Choosing the right metric in interviews
Use this framework:
- Define the positive class. Be explicit. In fraud, positive likely means fraud. In churn, positive means churn.
- Name the action. Flag, block, send to review, recommend, diagnose, route, or rank.
- Compare error costs. Which is worse: FP or FN? By how much?
- Check capacity. Is there a fixed review budget, top-k list, or latency constraint?
- Account for prevalence. Is the positive class rare? If yes, PR metrics often matter.
- Separate offline and online. Offline metrics select candidates; online metrics prove product impact.
- Inspect slices. Overall performance can hide failures by segment, geography, language, or device.
A sample answer: “For a rare fraud detection problem, I would track PR-AUC offline because precision at useful recall is more informative than ROC-AUC. For launch, I would choose a threshold based on analyst capacity and expected fraud loss, monitor precision at the review queue size, recall on confirmed fraud, false-positive complaints, and latency.”
Common metric traps
- Using accuracy on imbalanced data. A model that predicts “not fraud” for everything can have high accuracy and zero usefulness.
- Optimizing AUC but launching a bad threshold. AUC is threshold-independent; the product action is not.
- Ignoring calibration. Good ranking does not guarantee meaningful probabilities.
- Comparing PR-AUC across different base rates. Prevalence changes the baseline.
- Only reporting averages. Slice performance can reveal fairness, safety, or data-quality issues.
- Forgetting label delay. Fraud, churn, and medical outcomes may be confirmed later, so evaluation windows matter.
- Treating F1 as business value. F1 is a proxy, not a cost model.
How to practice metric questions
Take three products and define metrics: fraud detection, job recommendation, and medical triage. For each, write the confusion matrix, define the positive class, choose an offline ranking metric, choose a threshold metric, and name two guardrails. Then explain what you would do if precision improves but recall drops.
Strong candidates answer metric questions like product-minded scientists. They know the formulas, but they also know formulas are not the decision. Classification metrics are a language for surfacing tradeoffs; the interview win is showing that you can choose the metric that matches the real cost of being wrong.
Classification metrics for ML interviews: a reusable answer pattern
When an interviewer asks “Which metric would you use?”, do not answer with a single metric first. Use a pattern. Start by defining the positive class and the decision. Then say which error is more expensive. Then choose an offline metric, a threshold metric, and an online or operational metric.
For example: “If the model flags abusive content for human review, positive means abusive. False negatives leave harmful content up; false positives waste reviewer time and may punish good users. Offline I would look at PR-AUC because positives are likely rare, plus recall at a precision floor. For the operating point, I would choose a threshold based on reviewer capacity and monitor precision in the review queue. Online, I would track confirmed abuse removed, appeal rate, reviewer backlog, and latency.”
That answer is stronger than “use F1” because it links the metric to a workflow. It also leaves room for constraints. If legal or safety policy requires a minimum recall, say that the threshold must satisfy that constraint first. If the business has a fixed intervention budget, precision@k or recall@k may be more actionable than global F1.
Practice this pattern with fraud, medical triage, churn prediction, and search safety. The formulas stay the same, but the right metric changes because the cost of being wrong changes. That is the core idea interviewers are testing.
Related guides
- ML Evaluation Metrics for Interviews: Offline vs Online and Choosing the Right Metric — A senior-level guide to ML evaluation metrics in interviews: how to separate offline validation from online impact, pick metrics by task type, avoid leakage, and defend launch decisions.
- Feature Engineering for ML Interviews — Encoding, Scaling, and Leakage Avoidance — A practical interview guide to feature engineering for ML interviews: how to choose encodings, scale safely, prevent leakage, and explain tradeoffs under whiteboard pressure.
- ML System Design Interview Template — Problem Framing, Metrics, Modeling, and Serving — A reusable ML system design interview template covering problem framing, data, labels, offline and online metrics, model choices, serving architecture, monitoring, and common traps.
- Big-O Complexity Cheatsheet for Coding Interviews 2026 — A no-fluff Big-O reference card covering every complexity class, data structure, and algorithm pattern you'll face in coding interviews.
- Browser Rendering Interview Guide — CRP, Repaint vs Reflow, and Performance Metrics — A browser rendering interview guide covering the critical rendering path, DOM/CSSOM, repaint vs reflow, compositing, Core Web Vitals, DevTools, and practical performance fixes.
