Skip to main content
Guides Company playbooks The Nvidia Machine Learning Interview — GPU Systems, CUDA Optimization, and Applied Research
Company playbooks

The Nvidia Machine Learning Interview — GPU Systems, CUDA Optimization, and Applied Research

10 min read · April 25, 2026

Nvidia's ML loop doesn't look like Meta's or OpenAI's. They grade for GPU literacy, kernel-level intuition, and a working mental model of memory bandwidth. Here's the 2026 bar.

Nvidia is the most underestimated ML employer of 2026. With H100 and B200 supply defining who can train at frontier scale, with CUDA as the de facto compiler for every major foundation model, and with Nvidia Research publishing across robotics, graphics, LLMs, and physics simulation, the ML interview at Nvidia has become one of the hardest loops in tech — not because the algorithms are harder than Google's, but because the rubric is more specific. Nvidia wants ML engineers who think in warps, registers, and memory hierarchies. If you've shipped a transformer and have opinions about MLPerf submissions, they want to talk. If you've only ever written .fit() on a DataFrame, you will not clear the bar.

This guide is the structure, the rubrics, the questions, and the prep path. Sources are Blind, Levels.fyi, Nvidia GTC sessions, and candidate debriefs from 2024-2026 loops across Applied Deep Learning, Nvidia Research, the TensorRT team, the Megatron team, and cuDNN/cuBLAS.

The loop structure

Nvidia's ML loops vary sharply by team. The four main shapes:

  • Applied ML engineer (the biggest bucket). Recruiter screen, two technical phone screens (one coding, one ML depth), then a 5-round onsite: CUDA/GPU systems, ML coding, ML system design, research/depth, and hiring-manager behavioral.
  • Nvidia Research scientist. Recruiter screen, paper deep-dive call, two research rounds (one on your work, one on a domain of the team's choosing), a coding round closer to algorithms than ML, and a director conversation. The emphasis is on published work and research taste.
  • CUDA/kernel engineer (Megatron, TensorRT, cuDNN). Recruiter screen, a systems phone screen that is mostly C++ and memory model questions, then an onsite heavy on CUDA (two rounds), GPU architecture (one round), systems design (one round), and behavioral. ML depth is lighter here; systems depth is heavier.
  • Robotics and simulation (Isaac, Omniverse). Similar to Applied ML but with a graphics/physics round added — rendering, collision, or differentiable simulation depending on the team.

The applied ML loop is where most 2026 hiring happens and is what this guide centers on. Expect four to six weeks from recruiter to offer if you move briskly. Nvidia does not drag out decisions the way some FAANG do — the hiring managers have unusual authority and close fast when they see a fit.

What Nvidia actually grades on

The rubric dimensions that cluster into strong hires:

  • GPU literacy. You don't have to write CUDA from scratch in every round, but you must understand the memory hierarchy (registers, shared memory, L1/L2, HBM), occupancy, warp-level primitives, tensor cores, and the difference between compute-bound and memory-bound kernels. A candidate who confidently discusses arithmetic intensity and roofline analysis wins the room. One who says 'the GPU just parallelizes it' does not.
  • Numerical maturity. FLOPs, bytes, bandwidth, and latency. If asked 'how long does a forward pass of a 70B model take on 8xH100 at batch 1,' a strong candidate estimates it in 60 seconds with a back-of-envelope that lands within 2x. Weak candidates don't try.
  • Systems thinking for distributed training. Data parallelism vs tensor parallelism vs pipeline parallelism vs sequence parallelism. When does each dominate? What's the communication cost? Where does NCCL become the bottleneck? Nvidia invented half of these primitives, so the bar is higher than at a typical AI lab.
  • Precision awareness. FP32, TF32, FP16, BF16, FP8, INT8, INT4. What does each cost in accuracy? Where does loss scaling matter? When does FP8 training actually work and where does it break down? The B200 launch has made this table-stakes in 2026 loops.
  • Framework fluency. PyTorch, not just at the Module level but at the autograd, custom op, and torch.compile level. Familiarity with Triton is increasingly expected for senior roles.
  • Research-to-production taste. Nvidia's applied teams ship — MLPerf, TensorRT-LLM, NeMo, the Picasso platform. Candidates who romanticize pure research without shipping rarely clear the bar.
  • Communication density. Nvidia interviewers, especially in Research and Applied DL, are impatient. They want you to get to the point, draw the right diagram, and anticipate the follow-up. Slow, hedged explanations feel weak here more than at FAANG.

What does not score: naming every paper you've read, claiming you 'built an LLM' when you fine-tuned one, or dodging numerical questions.

Example questions

A sampling from 2024-2026 Nvidia ML loops:

  • Derive the FLOPs and memory for a transformer forward pass. Then do the backward. Then explain where activation checkpointing changes the picture.
  • Given an H100 with 80GB HBM3 at ~3.35 TB/s bandwidth and 989 TFLOPs of FP16, is matmul on batch size 1 for a 7B model compute-bound or memory-bound? Show the roofline.
  • Explain why FlashAttention is faster than naive attention. Draw the memory-access pattern.
  • You have a 405B model you want to train on 512 H100s. Sketch the parallelism strategy. What's your DP / TP / PP split and why?
  • Implement a fused softmax kernel in CUDA or Triton. Walk me through the memory hierarchy use.
  • What is a warp? What's a warp divergence? How does it affect performance?
  • Your training throughput is 40% of the theoretical peak FLOPs. Is that good? What are the usual suspects?
  • What happens numerically when you train in FP8 with E4M3 for forward and E5M2 for backward? When does training diverge?
  • Given a KV-cache-heavy inference workload, you're memory-bound on HBM. What are three ways to raise throughput, ranked by effort?
  • Explain tensor cores. When does a matmul use them and when does it silently fall back?
  • You're profiling an inference server and seeing 30% GPU idle. What's your debugging path?
  • Walk me through one Nvidia paper you've read recently. What would you change?

Notice the shape. Numerical estimation in every round. Hardware in every round. The rubric is 'can this person reason about a GPU, not just call .cuda() on a model.'

Strong vs passing answers

For 'sketch the parallelism strategy for 405B on 512 H100s,' a passing answer picks FSDP or a reasonable TP/PP split and draws boxes. A strong answer does:

  1. Starts from memory. '405B params at BF16 is 810GB for weights, roughly 2x that for optimizer state in mixed precision with Adam, plus activations. So we need ~3TB of model state before activations, across 512 GPUs with 80GB each — that's 40TB HBM total, so we're fine on capacity but need a plan for the sharding.'
  2. Picks TP inside a node. 'TP=8 across the 8 H100s inside one DGX node because NVLink bandwidth (900 GB/s per link) makes intra-node all-reduce cheap. TP across nodes would kill us on InfiniBand.'
  3. Picks PP across node groups. 'PP=8 across 8 pipeline stages, with interleaved 1F1B scheduling to keep the bubble under 15%. That's 64 GPUs per replica.'
  4. DP fills the rest. 'DP=8 replicas, ZeRO-1 for optimizer state sharding across DP ranks. That puts us at 512 GPUs.'
  5. Names the bottleneck. 'Pipeline bubbles at batch boundaries will be the biggest loss, plus any all-reduce on DP gradients. I'd target global batch ~4M tokens to amortize the bubble.'
  6. Names precision and sequence-parallel choices. 'BF16 for compute, FP32 master weights. Sequence parallelism for layer norms and dropout to save activation memory. FlashAttention-2 everywhere for attention.'
  7. Names what would go wrong. 'The risk is a straggler in the DP group stalling the step. I'd add gradient-accumulation-aware timing and would set NCCL timeouts accordingly.'

That's senior-IC quality. Hit all seven, and you're in the hire pile.

Common failure modes

Ways ML candidates reliably lose at Nvidia:

  • API-level only. Describing models exclusively in Hugging Face or PyTorch high-level terms without ever dropping to what the GPU is doing.
  • No numbers. Refusing to estimate. Nvidia interviewers will press until you produce a number; candidates who keep hedging fail the round.
  • Weak on FlashAttention. In 2026, not being able to draw the FlashAttention IO pattern is a real gap for an ML systems role. It's the canonical example of arithmetic-intensity optimization.
  • Confused about parallelism types. Not knowing the difference between DP, TP, and PP, or not being able to explain when each wins.
  • Mixing up precisions. Not knowing the difference between TF32 and FP16, or when FP8 loss scaling is needed.
  • Saying 'torch.compile' as if it's magic. Strong candidates know what it actually does (graph capture, kernel fusion, dynamic shapes) and where it regresses.
  • Sloppy backward-pass reasoning. Many candidates can describe forward passes but stumble on where memory actually goes in backward — recomputation, saved tensors, gradient checkpointing.

Prep strategy

30-50 hours over four to six weeks, assuming you have ML fundamentals:

  • Read the H100 whitepaper and the B200/GB200 announcement. Memorize the key numbers — HBM size, memory bandwidth, TFLOPs by precision, NVLink bandwidth.
  • Drill numerical estimation. Practice estimating training and inference throughput for common model sizes. Use the Chinchilla-era approximation for training FLOPs (6 × params × tokens) and the 2 × params × tokens approximation for inference.
  • Read FlashAttention (v1 and v2), PagedAttention, and one FP8 training paper. Be able to derive each from scratch. These are canonical Nvidia-adjacent reading.
  • Build one CUDA or Triton kernel. Not production-grade — a fused softmax or a simple reduce is enough. The experience of profiling it with ncu and watching occupancy numbers move will do more for your interview than 10 papers.
  • Learn one Megatron or TransformerEngine codebase. Read the tensor parallelism implementation. Understand what's happening in the all-reduce.
  • Practice ML system design with a distributed lens. Every design question should end with 'and here's how I'd parallelize it.'
  • Prepare a research pitch. Even for applied roles, senior interviewers will ask what you'd work on if headcount was free. Have a real answer.

Comp and negotiation anchors

Nvidia comp has surged post-2023 and is now FAANG-competitive at senior levels. 2026 rough bands for applied ML engineers in the Bay Area:

  • IC3 (new-to-mid): $180K-$230K base, $100K-$250K RSU/yr, 10-15% bonus — $320K-$520K TC.
  • IC4 (senior): $220K-$280K base, $300K-$700K RSU/yr, 15% — $580K-$1.1M TC.
  • IC5 (staff): $280K-$350K base, $700K-$1.8M RSU/yr, 20% — $1.05M-$2.4M TC.
  • IC6 (principal): $350K-$430K base, $1.5M-$4M RSU/yr, 20% — $1.9M-$4.9M TC.

The RSU grants are the story. Nvidia's stock run 2023-2025 turned typical grants into generational comp events, and 2026 offers are larger in dollar terms as a result. Negotiate on initial RSU, not base. Ask about sign-on separately — it moves 20-50K at IC3/IC4 and 75-200K at IC5+.

What the hiring manager wants

Across dozens of debriefs, the hiring-manager synthesis at Nvidia resolves to: 'Can this person make a GPU go faster, and do they care?' The 'care' part matters. Nvidia culture is high-ownership, long-tenured, and work-intensive — the median engineer has been there 6+ years and cares deeply about the hardware. Candidates who show up performatively, or who are clearly there to pad a resume en route to OpenAI, get filtered out. Candidates who can geek out for an hour about warp scheduling get hired.

If you can look at an ML system and instantly frame it in terms of arithmetic intensity, memory bandwidth, and communication cost — and if you have real numbers for an H100 in your head — you'll clear the bar. If you're fluent in the Hugging Face API but have never asked why a kernel stalls, Nvidia is not the loop for you in 2026.

Sources and further reading

When evaluating any company's interview process, hiring bar, or compensation, cross-reference what you read here against multiple primary sources before making decisions.

  • Levels.fyi — Crowdsourced compensation data with real recent offers across tech employers
  • Glassdoor — Self-reported interviews, salaries, and employee reviews searchable by company
  • Blind by Teamblind — Anonymous discussions about specific companies, often the freshest signal on layoffs, comp, culture, and team-level reputation
  • LinkedIn People Search — Find current employees by company, role, and location for warm-network outreach and informational interviews

These are starting points, not the last word. Combine multiple sources, weight recent data over older, and treat anonymous reports as signal that needs corroboration.