A/B Testing Interview Questions in 2026 — Power Analysis, Peeking, and SRM
A tactical guide to A/B testing interview questions in 2026, with answer frameworks for power analysis, peeking, sample-ratio mismatch, guardrails, metrics, and experiment trade-offs. Built for product analysts, data scientists, PMs, and growth roles.
A/B Testing Interview Questions in 2026 — Power Analysis, Peeking, and SRM
A/B testing interview questions in 2026 are less about reciting “randomize users and compare conversion” and more about whether you can protect a company from bad experiment decisions. Expect questions on power analysis, peeking, sample-ratio mismatch, metric design, guardrails, interference, novelty effects, and what you would do when the test result is statistically significant but operationally suspicious.
The best candidates answer A/B testing prompts like operators. They clarify the decision, define the unit of randomization, choose primary and guardrail metrics, size the test before launch, monitor data quality without p-hacking, and explain how the result should or should not change the roadmap.
A/B testing interview questions in 2026: what interviewers are actually testing
Interviewers are looking for three skills at once:
| Skill | What it sounds like in an interview | Weak answer | |---|---|---| | Experimental design | “Randomize at user level because exposure persists across sessions.” | “Split traffic 50/50 and compare conversion.” | | Statistical judgment | “We need enough power to detect the minimum effect that would change the decision.” | “Run it until p < .05.” | | Product judgment | “A conversion lift is not enough if refund rate or latency gets worse.” | “Ship the winning variant.” |
A clean answer starts with the business decision. “Are we deciding whether to launch, iterate, or stop?” That question shapes sample size, metric sensitivity, and risk tolerance.
Core framework for any A/B testing prompt
Use this structure:
- Decision and hypothesis. What change are we testing and what action follows?
- Population and randomization unit. User, account, device, session, listing, market, or team?
- Exposure definition. Who actually saw the variant?
- Primary metric. The one metric that decides the test.
- Guardrails. Metrics that can block launch even if the primary moves.
- Power and duration. Minimum detectable effect, baseline rate, variance, and seasonality.
- Validity checks. SRM, instrumentation, balance, logging, bot traffic, and assignment stability.
- Decision rule. Ship, hold, segment, rerun, or redesign.
This is what separates a data scientist from a dashboard reader.
Question 1: “How would you design an A/B test for a checkout redesign?”
Strong answer:
“First I would define the decision: launch the redesign if it increases completed purchases without hurting revenue quality or user trust. I would randomize at the user or account level, not the session level, because a returning user might otherwise see both designs. Exposure would be users who reached checkout after assignment. The primary metric might be purchase completion rate among exposed checkout users. Guardrails would include average order value, refund rate, payment failure rate, page latency, support tickets, and downstream cancellation. I would run a power analysis using baseline checkout conversion, the smallest lift worth shipping, desired power, and alpha. Before reading results, I would check sample-ratio mismatch, logging completeness, and whether the variant changed who reaches checkout.”
Notice the nuance: if the redesign affects upstream navigation, “users who reached checkout” may be a post-treatment population. In that case, randomize before the funnel and report both intent-to-treat and exposed-user metrics.
Question 2: “Explain power analysis in plain English.”
Power is the probability that your test detects an effect of a chosen size if that effect is real. The chosen size is the minimum detectable effect, or MDE. In business terms: “If the new experience truly improves conversion by at least the amount we care about, how likely are we to notice?”
The main inputs are baseline conversion, variance, sample size, significance threshold, desired power, traffic allocation, and the MDE. Smaller effects require more users. Noisy metrics require more users. Rare events require many more users. A one-day test on a weekly retention metric is usually underpowered no matter how clean the p-value looks.
A good interview phrasing:
“I would not power the test to detect any non-zero effect. I would power it to detect the smallest effect that would change the product decision. If a 0.1% lift is not worth engineering cost or risk, detecting it is not useful.”
For proportions, candidates do not need to derive formulas from memory, but they should know directionally that sample size rises as variance rises and as MDE shrinks. If asked for a formula, state that for a two-sample proportion test, required sample size is roughly proportional to baseline variance divided by squared MDE. The squared MDE point is the key: cutting the effect size in half roughly quadruples sample needs.
Question 3: “What is peeking and why is it dangerous?”
Peeking means repeatedly checking the test result and stopping when it looks significant, even though the analysis plan assumed a fixed end date or fixed sample size. It inflates false positives because every look creates another chance for random noise to cross the threshold.
Strong answer:
“If we plan a fixed-horizon test, I would not stop early just because the p-value crosses .05. I would monitor only health and data-quality metrics during the run. If the business requires early stopping, I would use a sequential testing design or alpha-spending approach that accounts for repeated looks.”
Practical nuance: teams can stop a test early for safety. If payment failures spike, latency doubles, or support tickets surge, stop the variant. That is not p-hacking; that is harm prevention. The distinction is between stopping for pre-defined guardrail harm and stopping for a lucky primary-metric readout.
Question 4: “What is sample-ratio mismatch?”
Sample-ratio mismatch, or SRM, occurs when the observed allocation differs meaningfully from the intended allocation. If the test was supposed to split 50/50 but the data shows 56/44, something may be wrong with randomization, eligibility, logging, caching, bots, experiment bucketing, or exposure filters.
What to do:
- Check assignment counts before exposure filters.
- Compare exposure counts by platform, browser, geography, new versus returning users, and app version.
- Verify that both variants log the same events.
- Look for redirects, crashes, blocked scripts, or performance differences.
- Pause interpretation until the source is understood.
Interview-ready line: “SRM is a validity alarm, not a metric result. I would not trust a statistically significant lift until I know why the split is off.”
Question 5: “The test is significant, but revenue is flat. What now?”
Do not say “ship it” automatically. Diagnose metric consistency.
Possible explanations:
- Conversion rose but average order value fell.
- Low-intent users converted on smaller baskets.
- Discounts or trial starts increased short-term conversion but hurt paid retention.
- The metric counted duplicated orders.
- The effect is concentrated in a small segment and offset elsewhere.
- The sample is large enough to make a tiny effect significant but not valuable.
A strong answer says: “I would compare the primary metric to the decision metric. If conversion is significant but revenue per eligible user is flat, I would not launch broadly without understanding order value, refunds, and cohort retention.”
Question 6: “How do you choose metrics?”
Use a metric hierarchy.
| Metric type | Purpose | Example | |---|---|---| | Primary | Decides the experiment | Revenue per eligible user | | Secondary | Explains mechanism | Add-to-cart rate, checkout completion | | Guardrail | Prevents harmful launches | Latency, refunds, unsubscribe, support contacts | | Diagnostic | Debugs exposure and funnel | Event logging, assignment counts, browser mix |
The most common mistake is choosing a metric too far downstream for a short test. Thirty-day retention may be the real business metric, but a two-week experiment may need an earlier validated proxy plus a longer holdout read. Say this explicitly.
Question 7: “What if users influence each other?”
This is interference. Standard A/B tests assume one user’s treatment does not affect another user’s outcome. Marketplaces, social feeds, collaboration tools, sales teams, classrooms, and ad auctions often violate that assumption.
Options:
- Randomize at cluster level: team, company, school, market, or geography.
- Use switchback designs for time-based marketplace changes.
- Keep treatment and control separated when spillover is likely.
- Model network exposure when full isolation is impossible.
Trade-off: cluster tests need more sample and often have less power because outcomes are correlated within clusters. A good candidate names that cost instead of pretending cluster randomization is free.
Question 8: “How do you handle multiple metrics and segments?”
Segment analysis is useful for diagnosis and targeting, but it is dangerous as a fishing expedition. Pre-register the segments that matter: new versus returning, platform, geography, plan type, acquisition channel. Treat unplanned segment wins as hypotheses for follow-up, not guaranteed launch rules.
Multiple comparisons increase false positives. You can control this with stricter thresholds, false-discovery methods, hierarchical testing, or a clear primary metric that does not move just because a secondary metric looks interesting.
Interview answer: “I would make the launch decision on the primary metric and guardrails. Segment cuts help me understand heterogeneity, but I would be careful not to overfit the roadmap to noisy slices.”
Question 9: “How long should the test run?”
Long enough to hit the powered sample size, cover the relevant behavior cycle, and avoid obvious seasonality traps. For consumer products, that often means at least a full week to cover weekday/weekend patterns. For B2B products, it may require multiple business cycles. For retention, renewals, and churn, you need a longer read or a validated leading indicator.
Do not answer with a universal number. Answer with dependencies: traffic, baseline rate, MDE, metric window, and seasonality.
Prep checklist for A/B testing interviews
Before the interview, practice explaining these without notes:
- Difference between assignment, exposure, and analysis population.
- Why power depends on MDE and variance.
- Why peeking inflates false positives.
- What SRM means and how to debug it.
- Intent-to-treat versus treatment-on-treated analysis.
- Primary, secondary, guardrail, and diagnostic metrics.
- Interference and cluster randomization.
- Novelty effects, learning effects, and logging regressions.
- Why statistical significance is not the same as business significance.
How to talk about A/B testing on a resume
Weak bullet: “Analyzed A/B tests for growth team.”
Better bullet: “Designed checkout experiments with user-level randomization, pre-launch power analysis, SRM monitoring, and revenue-quality guardrails.”
Best bullet: “Built an experimentation review process that required MDE-based power checks, sample-ratio validation, and pre-defined launch rules before product teams interpreted A/B test results.”
That bullet tells a hiring manager you can prevent false wins, not just calculate them. In 2026, that is the difference. Experimentation maturity is now a product operating skill, and strong interview answers sound like they came from someone who has shipped, paused, and defended real tests.
Advanced follow-up: variance reduction and CUPED
For senior data science interviews, expect a follow-up on variance reduction. CUPED-style adjustment uses a pre-experiment covariate, such as a user’s prior spending or prior activity, to reduce metric variance. The idea is not to “correct” a bad experiment; it is to make a valid randomized experiment more sensitive by accounting for predictable baseline differences.
A good answer is careful: “I would only use pre-treatment variables, define the adjustment before reading results, and confirm that the covariate is measured for both variants in the same way.” Do not use post-treatment behavior as a covariate, because the treatment may have caused it. Also do not use variance reduction to rescue a test with SRM, broken logging, or a changed eligibility rule. Data quality comes first.
This is a strong way to show depth without overcomplicating the main answer: fixed-horizon design, power analysis, no peeking, SRM checks, then optional variance reduction if the experiment platform supports it.
Related guides
- AWS Interview Questions in 2026 — VPC, IAM, and the Services That Always Come Up — A focused AWS interview prep guide for 2026 covering VPC design, IAM reasoning, core services, common architecture prompts, debugging flows, and the mistakes that weaken senior answers.
- Deep Learning Interview Questions in 2026 — Backprop, Optimizers, and Regularization — A 2026-ready deep learning interview guide covering backpropagation, optimizers, regularization, debugging, transformers, evaluation, and sample answers that show practical judgment.
- Docker Interview Questions in 2026 — Layers, Multi-Stage Builds, and Runtime — A practical Docker interview guide for 2026 covering image layers, Dockerfile design, multi-stage builds, runtime isolation, Compose, security, and the debugging questions candidates keep seeing.
- GraphQL Interview Questions in 2026 — Schemas, Resolvers, and N+1 Prevention — A focused GraphQL interview guide for 2026 covering schema design, resolvers, N+1 prevention, DataLoader, pagination, auth, caching, federation, mutations, observability, and production trade-offs. Built for frontend, backend, and platform candidates.
- Kubernetes Interview Questions in 2026 — Controllers, Networking, and Operators — A practical guide to Kubernetes interview questions in 2026, focused on the controller model, service networking, CRDs, operators, and the debugging scenarios senior candidates actually get asked.
