Skip to main content
Guides Skills and frameworks ML System Design Interview Template — Problem Framing, Metrics, Modeling, and Serving
Skills and frameworks

ML System Design Interview Template — Problem Framing, Metrics, Modeling, and Serving

10 min read · April 25, 2026

A reusable ML system design interview template covering problem framing, data, labels, offline and online metrics, model choices, serving architecture, monitoring, and common traps.

ML System Design Interview Template — Problem Framing, Metrics, Modeling, and Serving

An ML system design interview template is useful because most machine learning interviews are not really about naming the fanciest model. They test whether you can turn a product problem into a reliable prediction system: define the objective, choose data and labels, select metrics, design training and serving, and explain how the system behaves after launch. The winning answer is practical, not academic.

Use this guide as a reusable structure for prompts like “design a recommendation system,” “detect fraud,” “rank search results,” “predict churn,” or “build a content moderation classifier.”

The interviewer’s hidden scorecard

ML system design interviews usually evaluate five signals:

| Signal | What strong looks like | |---|---| | Problem framing | You define the user/business objective before model details | | Data judgment | You know what labels, features, and leakage risks matter | | Metric selection | You separate offline metrics from online product metrics | | Architecture | You can describe training, serving, fallback, and latency constraints | | Operational maturity | You discuss monitoring, retraining, fairness, abuse, and failure modes |

A senior answer keeps connecting model choices back to the product. If you jump directly to transformers, embeddings, or gradient boosting without defining the target, you look unstructured.

The 8-step ML system design template

Use this order unless the interviewer pulls you elsewhere.

  1. Clarify the product objective. What decision will the model support? Ranking, classification, generation, forecasting, anomaly detection, or matching?
  2. Define users and actions. Who consumes the output, and what action happens because of it?
  3. Choose success metrics. Pick online metrics, offline metrics, and guardrails.
  4. Specify labels. Decide what the model predicts and where labels come from.
  5. Design features and data pipelines. Include freshness, missing data, privacy, and leakage.
  6. Select modeling approach. Start with a baseline, then add complexity only when justified.
  7. Design serving architecture. Cover batch vs online inference, latency, caching, fallback, and experimentation.
  8. Monitor and iterate. Track drift, quality, bias, abuse, cost, and retraining cadence.

A crisp opening might be:

“I’ll first define the product decision and metrics, then work through labels, features, modeling, serving, and monitoring. I’ll start with a simple baseline and only add complexity where it changes the business outcome.”

That tells the interviewer you have a map.

Worked example: job recommendation ranking

Prompt: “Design an ML system to recommend jobs to candidates.”

Objective. Help active job seekers find roles they are likely to apply to and be qualified for. The product decision is ranking jobs in a feed or search results page.

Users. Candidates want relevant opportunities with a realistic chance of response. Employers want qualified applicants, not spray-and-pray volume. The marketplace has a two-sided quality constraint.

Online metrics.

| Metric | Purpose | |---|---| | Qualified apply rate | Primary product outcome | | Interview callback rate | Quality signal beyond clicks | | Employer rejection/spam rate | Guardrail against low-quality applications | | Candidate hide/report rate | Guardrail for relevance and trust | | Feed latency | Serving guardrail |

Offline metrics. Use NDCG or MAP for ranking quality, AUC or log loss for binary apply prediction, and calibration error if the score is shown or used in thresholds. But be clear: offline metrics are proxies. The launch decision depends on online behavior.

Labels. A naive label is “candidate clicked apply.” Better labels include completed application, employer viewed application, interview callback, or positive employer action. The challenge is delay: callback labels arrive days or weeks later. A practical system uses a short-term label for fast iteration and a delayed quality label for calibration.

Features. Candidate-side features: title history, skills, seniority, location, remote preference, salary preference, industries, recent searches, saved jobs, and negative signals like hidden companies. Job-side features: title, required skills, seniority, location, salary range if available, company size, industry, remote policy, freshness, and historical response rates. Cross features: skill overlap, seniority match, commute distance, salary compatibility, and similarity to jobs the candidate previously engaged with.

Leakage risks. Do not train on post-ranking exposure signals without controlling for position bias. If the old system showed certain jobs at the top, clicks reflect exposure as much as relevance. Use counterfactual logging, randomized exploration buckets, or position-aware training.

Modeling. Start with a two-stage ranking system. Stage one retrieves candidates using lexical matching, embedding similarity, and filters. Stage two ranks a few hundred candidates with a gradient-boosted tree or neural ranker. A third re-ranking layer can enforce diversity, freshness, salary fit, and employer quality.

Serving. Candidate opens feed. The system retrieves eligible jobs from a search index, scores them with a ranking service, applies business rules and guardrails, caches results for a short window, and logs impressions, positions, scores, clicks, applies, and downstream employer actions. If the ranker fails, fallback to a rules-based search relevance sort.

Monitoring. Track distribution drift in titles, skills, locations, and salary ranges. Monitor apply rate by segment, callback rate, latency, null features, and the share of jobs repeatedly shown but hidden. Retrain daily or weekly depending on data volume, but recalibrate delayed labels less frequently.

That answer is broad enough for system design and specific enough to show ML judgment.

Problem framing: the part candidates rush

Before metrics or models, define the actual decision. For example:

  • Fraud detection decides whether to block, challenge, review, or allow a transaction.
  • Ads ranking decides which ad to show and at what position.
  • Churn prediction decides which customers receive outreach.
  • Content moderation decides remove, demote, age-gate, send to human review, or allow.
  • Forecasting decides inventory, staffing, pricing, or alerting.

The action matters because it sets tolerance for false positives and false negatives. A fraud model can tolerate friction for suspicious transactions but cannot block legitimate rent payments casually. A medical triage model needs a different error profile than a playlist recommender. Say this explicitly.

A good framing sentence:

“The model score is not the product. The product decision is whether to show, hide, rank, route, or intervene. I’ll design the model around that decision and the cost of mistakes.”

Metrics: offline, online, and guardrail

For classification, offline metrics might be precision, recall, F1, AUC, PR-AUC, log loss, or calibration. For ranking, use NDCG, MRR, MAP, recall@K, or diversity metrics. For forecasting, use MAE, RMSE, MAPE, quantile loss, or service-level accuracy.

But every ML system design interview needs an online metric layer:

| System | Primary online metric | Guardrails | |---|---|---| | Fraud | Fraud loss prevented | False-positive rate, manual review load, customer churn | | Recommendations | Long-term engagement or conversion | Repetition, complaints, creator/seller concentration | | Search ranking | Successful search sessions | Latency, reformulation rate, zero-result rate | | Churn | Retained revenue | Discount overuse, sales workload, customer annoyance | | Moderation | Harmful content exposure reduced | Appeals, creator churn, reviewer load |

A mature answer explains thresholding. If the false-positive cost is high, use a review queue. If latency is tight, move complex features offline. If labels are imbalanced, use PR-AUC and precision at fixed recall instead of AUC alone.

Data and labels: where interviews are won

Modeling often gets attention, but label design is where experienced ML candidates stand out.

Ask:

  • What is the label source: user action, expert review, business outcome, sensor event, or synthetic rule?
  • How delayed is the label?
  • Is the label biased by previous product decisions?
  • Can the label be gamed?
  • Are there multiple labels with different reliability?
  • What data should not be used for privacy, fairness, or leakage reasons?

For a moderation model, user reports are noisy and biased toward visible content. Human review labels are higher quality but expensive and inconsistent. For a loan model, repayment is delayed and shaped by approval policy. For a recommendation model, clicks are cheap but shallow; long-term retention is richer but slower.

Say how you would bootstrap: start with rules or human labels, build a baseline, collect model-assisted review data, and improve the label pipeline over time.

Modeling choices: start simple, then justify complexity

A strong ML system design interview answer rarely starts with the most complex model. It starts with a baseline that makes the failure modes visible.

| Problem | Good baseline | When to add complexity | |---|---|---| | Tabular risk scoring | Logistic regression or gradient boosted trees | Nonlinear interactions, large sparse features, sequence behavior | | Search ranking | BM25 plus rules | Semantic matching, personalization, query intent | | Recommendations | Collaborative filtering or two-tower retrieval | Cold start, sequence modeling, multi-objective ranking | | Image classification | Pretrained CNN/ViT fine-tune | Domain shift, small objects, multimodal inputs | | Text classification | Fine-tuned encoder or compact LLM | Ambiguous intent, multilingual input, reasoning needs |

Explain tradeoffs: interpretability, latency, cost, data requirements, retraining complexity, and debuggability. If you mention deep learning, say what it buys you. If you mention embeddings, say how they are refreshed and monitored.

Serving architecture and latency

Serving design depends on the product decision.

Batch inference works when decisions can be precomputed: daily churn scores, weekly lead scoring, nightly recommendations. It is cheaper and easier to monitor.

Online inference is needed when fresh context matters: fraud at transaction time, search ranking with live query text, dynamic pricing, chat safety. It adds latency and reliability pressure.

Hybrid serving is common: precompute candidate embeddings or risk features, then score live context online. For ranking systems, retrieval is often precomputed or index-based, while final ranking is online.

Cover these components:

  • Feature store or feature pipeline with freshness guarantees.
  • Model registry and versioning.
  • Inference service with latency budget.
  • Cache for stable results.
  • Fallback path if model or features fail.
  • Experimentation framework.
  • Logging for impressions, features, scores, decisions, and outcomes.

A good latency answer is concrete: “For a feed ranking system I would keep p95 scoring under 100 ms by retrieving a few hundred candidates first, precomputing expensive embeddings, and limiting the online ranker to features available in the request or low-latency store.”

Common ML system design traps

Optimizing offline metrics alone. A model with higher AUC can still hurt product quality if it changes exposure, creates feedback loops, or worsens latency.

Ignoring feedback loops. Ranking systems shape their own training data. Fraud systems push attackers to adapt. Moderation systems change what users post.

No fallback. Production systems fail. Always include a rules-based fallback, cached results, human review path, or safe default.

Label leakage. Features that are only known after the decision, such as “customer contacted support after cancellation,” cannot be used at prediction time.

Overclaiming fairness. Do not say the model is fair because you removed protected attributes. Proxies remain. Discuss measurement by segment, policy constraints, and human review.

Forgetting cost. Inference cost, manual review cost, and annotation cost are part of the design.

Prep checklist

Before your interview, prepare one reusable answer for each system family: ranking, recommendations, fraud/risk, forecasting, classification, and generative AI safety. For each, know the product decision, labels, metrics, baseline model, architecture, monitoring, and failure modes.

Practice drawing the system in layers:

  1. Data sources and logs.
  2. Labeling and training pipeline.
  3. Feature generation.
  4. Model training and evaluation.
  5. Model registry and deployment.
  6. Online serving path.
  7. Monitoring and retraining.

In resumes and interviews, describe ML work in system terms:

  • “Built churn model” is weak.
  • “Designed churn scoring pipeline using billing, usage, and support signals; calibrated thresholds by CSM capacity; reduced preventable churn while keeping outreach volume stable” is strong.

The template is simple: decision, data, model, serving, monitoring. If you can keep returning to those five nouns, you will sound like someone who has shipped ML rather than only trained it.