Skip to main content
Guides Skills and frameworks ML Evaluation Metrics for Interviews: Offline vs Online and Choosing the Right Metric
Skills and frameworks

ML Evaluation Metrics for Interviews: Offline vs Online and Choosing the Right Metric

9 min read · April 25, 2026

A senior-level guide to ML evaluation metrics in interviews: how to separate offline validation from online impact, pick metrics by task type, avoid leakage, and defend launch decisions.

ML Evaluation Metrics for Interviews: Offline vs Online and Choosing the Right Metric

ML evaluation metrics for interviews are about proving that a model is useful, not just that it scores well in a notebook. Offline metrics help you compare models before launch. Online metrics show whether the product, user, or business outcome actually improves. Choosing the right metric means understanding the task, the decision being made, the cost of errors, the data-generating process, and the ways a metric can be gamed.

This guide gives you a practical framework for answering evaluation questions in ML interviews. The strongest answers separate offline vs online measurement, name the metric that fits the problem, call out guardrails, and explain how they would debug disagreement between validation results and production behavior.

Start with the evaluation question

Before naming metrics, define the problem. A good interview answer begins with clarifying questions:

  • What is the model predicting or ranking?
  • What action is taken from the prediction?
  • Is the output a class, probability, score, ranking, generated text, or numeric estimate?
  • What is the cost of a false positive, false negative, bad rank, or large error?
  • Is there a capacity limit, such as a review queue or notification budget?
  • How delayed or noisy are labels?
  • Are there fairness, safety, latency, privacy, or compliance guardrails?

Metrics are not universal. A churn model, search ranker, fraud detector, ETA predictor, and support-ticket classifier all need different evaluation plans even if they use similar algorithms.

Offline vs online metrics

Offline metrics are computed on historical validation or test data. They are fast, cheap, repeatable, and useful for model selection. Examples include accuracy, F1, ROC-AUC, PR-AUC, RMSE, MAE, NDCG, log loss, calibration error, and slice metrics.

Online metrics are measured in production, often through an A/B test or staged rollout. Examples include conversion, retention, revenue, fraud loss, time saved, click-through rate, user complaints, latency, manual review load, and downstream task success.

The relationship is simple: offline metrics are evidence that a model might help; online metrics are evidence that it does help. A model can improve offline AUC and hurt product metrics if the data is stale, labels are biased, the threshold is wrong, latency increases, or users react differently than history suggests.

A strong interview phrase: “I would use offline metrics to narrow candidates, but I would not launch solely on offline performance. I’d define an online primary metric and guardrails before the experiment.”

Choosing metrics by task type

Use the model output to choose the first metric family.

| Task | Useful offline metrics | Common online metrics | |---|---|---| | Binary classification | PR-AUC, ROC-AUC, F1, log loss, calibration | Conversion, fraud loss, review precision, complaints | | Multiclass classification | Macro/micro F1, per-class recall, log loss | Routing accuracy, handle time, escalation rate | | Regression | MAE, RMSE, MAPE, pinball loss | ETA error, user satisfaction, cost impact | | Ranking/search | NDCG, MAP, MRR, recall@k | CTR, long clicks, reformulations, retention | | Recommendation | Recall@k, precision@k, NDCG, coverage | Engagement, saves, purchases, diversity guardrails | | Clustering | Silhouette, purity if labels exist, stability | Analyst usefulness, segment actionability | | Generative AI | Task success, factuality checks, rubric scores | Resolution rate, CSAT, escalation, safety incidents |

This table is a starting point, not a substitute for product thinking. The same metric can be right or wrong depending on the action. For example, precision@k is natural when only k recommendations are shown; recall across the full catalog may be less relevant to the user experience.

Classification: beyond accuracy

For classification, accuracy is only safe when classes are balanced and error costs are similar. In rare-event problems, accuracy can be misleading. A fraud model that predicts “not fraud” every time may be highly accurate and completely useless.

Use ROC-AUC when you care about ranking positives above negatives across thresholds and class imbalance is not extreme. Use PR-AUC when the positive class is rare and precision at useful recall matters. Use F1 or F-beta when you need a thresholded summary of precision and recall. Use log loss and calibration when predicted probabilities drive expected-value decisions.

For multiclass tasks, inspect per-class performance. Macro averages treat each class equally; micro averages emphasize common classes. If a rare class is high risk, macro recall or minimum per-class recall may matter more than overall accuracy.

Regression: MAE, RMSE, and business cost

For regression, the main question is how errors should be penalized.

MAE measures average absolute error. It is robust and easy to explain: “We are off by this much on average.”

RMSE penalizes large errors more because it squares them. It is useful when big misses are disproportionately bad, such as delivery ETA promises or demand forecasts where stockouts are costly.

MAPE expresses error as a percentage, but it breaks down near zero and can overemphasize small denominators. Use it carefully.

Pinball loss is useful for quantile prediction. If the product needs a 90th percentile delivery estimate, average error is not enough; the model must estimate a distribution or quantile.

Always connect regression metrics to the decision. For ETA, a late estimate and an early estimate may have different user costs. For pricing, underpricing and overpricing may have asymmetric business effects.

Ranking and recommendation metrics

Ranking models are evaluated by order, not just item-level correctness. Common metrics:

  • Precision@k: Of the top k items shown, how many are relevant?
  • Recall@k: Of all relevant items, how many appeared in the top k?
  • MRR: Mean reciprocal rank of the first relevant item. Useful when one good answer matters.
  • MAP: Mean average precision. Useful when multiple relevant items matter and rank order matters.
  • NDCG: Discounts relevance by position and supports graded relevance. Very common for search and recommendations.

Offline ranking metrics can disagree with online metrics because logged data is biased by the old ranking system. Items shown in the past are more likely to have labels, while hidden items have missing feedback. Mention position bias, selection bias, and exploration if the interview is senior-level.

Online ranking guardrails often include latency, diversity, fairness, creator ecosystem health, user hides/reports, and long-term retention. Optimizing click-through alone can reward clickbait or short-term engagement at the expense of trust.

Guardrail and slice metrics

A launch metric is incomplete without guardrails. Guardrails are metrics that must not degrade while the primary metric improves.

Examples:

  • Latency and error rate for any production model
  • False-positive complaint rate for enforcement systems
  • Calibration drift for risk scores
  • Manual review backlog for fraud or abuse queues
  • Per-segment recall for safety-critical classifiers
  • Diversity and novelty for recommendation systems
  • Cost per prediction for expensive models

Slice metrics are equally important. Overall performance can hide failures for new users, low-resource languages, small sellers, rural locations, mobile devices, or edge-case content. In interviews, say you would inspect slices tied to product risk and data distribution, not every possible slice mechanically.

Data leakage, label delay, and validation design

Many evaluation failures come from bad validation, not bad algorithms.

Data leakage happens when training features include information unavailable at prediction time. Examples include using a post-conversion event to predict conversion, random splitting time-dependent data, or normalizing with statistics computed from the full dataset.

Label delay matters when outcomes arrive later. Fraud chargebacks, churn, loan default, and medical outcomes may be known days or months after prediction. A validation window must match the real decision timeline.

Train/test split design should reflect deployment. For time-series or product behavior, use time-based splits. For user-level models, split by user when interactions from the same user would leak across rows. For marketplace or geography-sensitive models, consider group-aware validation.

A strong answer includes: “I would make sure the offline test set reflects the production prediction point and that no features are computed using future information.”

When offline and online disagree

This is a favorite senior interview topic. If offline metrics improve but online metrics do not, investigate:

  1. Metric mismatch. The offline metric may not reflect the product goal.
  2. Bad threshold or policy. Ranking improved, but the decision cutoff is wrong.
  3. Data distribution shift. Production traffic differs from validation data.
  4. Logging or instrumentation bugs. Online measurement may be wrong.
  5. Latency or reliability cost. A better model may slow the product.
  6. Feedback loops. User behavior changes when the model changes.
  7. Biased labels. Historical labels reflect the old policy, not ground truth.
  8. Segment regressions. Gains in large segments hide losses in important small ones.

Do not immediately assume the model is bad. Debug the measurement system, policy layer, and deployment context.

A metric-selection script for interviews

Use this structure when answering:

  1. “First I’d define the action and error costs.”
  2. “For offline model selection, I’d use [metric] because [reason].”
  3. “I’d also track [secondary metric] to catch [risk].”
  4. “For launch, the primary online metric would be [business/user outcome].”
  5. “Guardrails would include [latency/safety/fairness/cost].”
  6. “I’d inspect slices by [important dimensions].”
  7. “If offline and online disagree, I’d check leakage, distribution shift, thresholding, and instrumentation.”

Example for support-ticket routing: “Offline I’d track macro F1 and per-class recall because rare urgent categories matter. I’d use log loss if probabilities feed routing confidence. Online I’d measure correct routing rate, handle time, escalation rate, and customer satisfaction, with guardrails on urgent-ticket recall and latency.”

Common mistakes candidates make

  1. Picking accuracy by default.
  2. Naming AUC without explaining threshold or action.
  3. Ignoring calibration when probabilities are used as probabilities.
  4. Treating offline improvement as launch proof.
  5. Forgetting latency, cost, and reliability.
  6. Averaging away important segments.
  7. Comparing metrics across datasets with different label definitions.
  8. Missing label delay and leakage.
  9. Optimizing a proxy that users can game or that the model can exploit.
  10. Failing to define the positive class or business objective.

What strong ML metric answers sound like

Strong answers are structured and humble. They acknowledge uncertainty, choose a metric based on the decision, and include a plan to validate the metric itself. They do not worship one number. For offline work, they pick metrics that match the model type and error costs. For online work, they define a primary outcome, guardrails, and slice checks. For launch, they ask whether the model improves the system, not whether it wins a leaderboard.

That is the heart of ML evaluation metrics in interviews: make the metric serve the decision. If you can explain that clearly, with examples and failure modes, you will sound like someone who has evaluated real models rather than someone who memorized a metric glossary.

ML evaluation metrics for interviews: a reusable answer pattern

A reliable interview answer has three layers: model quality, decision quality, and system quality. Model quality is the offline score: AUC, RMSE, NDCG, log loss, or whatever fits the task. Decision quality is whether the threshold, ranking cutoff, or policy creates better outcomes for users or operators. System quality is whether the model can run safely in production with acceptable latency, cost, reliability, and segment behavior.

For any prompt, try this sentence: “Offline I would optimize for the metric that best matches the prediction task, but I would launch on an online metric tied to the product action, with guardrails for safety, latency, cost, and important slices.” Then fill in the specifics. For a ranking model, that might be offline NDCG@k, online successful sessions, and guardrails for latency and diversity. For a risk model, it might be PR-AUC, expected loss reduction, and guardrails for false-positive complaints and calibration drift.

This structure prevents the most common mistake: treating evaluation as a leaderboard. Real ML systems are embedded in products, queues, policies, and user behavior. The right metric is the one that makes the decision better without breaking the system around it.