Skills and frameworks

RAG System Design Interview — Chunking, Embeddings, Retrieval, and Evals

10 min read · April 25, 2026

A practical RAG system design interview guide for chunking, embeddings, retrieval, reranking, prompts, evals, hallucination control, latency, and production tradeoffs.

RAG System Design Interview — Chunking, Embeddings, Retrieval, and Evals

A RAG system design interview tests whether you can build more than a demo. You need to explain chunking, embeddings, retrieval, reranking, prompt construction, evals, hallucination control, latency, security, and data freshness as one production system. The strongest answers show how the retrieval layer and generation layer depend on each other, and how you would know whether the system is actually answering from the right evidence.

RAG system design interview: the production mental model

Retrieval-augmented generation combines a language model with an external knowledge source. Instead of asking the model to answer only from its parameters, you retrieve relevant documents and provide them as context. The model then generates an answer grounded in that context.

A practical RAG architecture has two paths:

Indexing path: ingest documents, clean them, chunk them, attach metadata, embed chunks, store vectors and text, and update the index.
Query path: receive a question, rewrite or classify it, retrieve candidate chunks, rerank them, build a prompt, generate an answer, cite evidence, and log results.

Interviewers listen for the boring parts: permissions, stale data, duplicate chunks, chunk boundaries, fallback behavior, evaluation, and monitoring. Anyone can say “use a vector database.” Fewer candidates can explain how to stop the system from confidently answering with the wrong paragraph.

A strong opening: “I would build the RAG system as an ingestion pipeline plus an online retrieval-and-generation path. I would optimize retrieval quality before tuning prompts, because the model cannot reliably answer from evidence it never receives.”

Chunking: where many RAG systems succeed or fail

Chunking splits documents into units that can be embedded and retrieved. Bad chunking creates bad retrieval. If chunks are too small, they lose context. If chunks are too large, retrieval becomes noisy and the prompt fills with irrelevant text.

Useful chunking strategies:

| Strategy | Best for | Risk | |---|---|---| | Fixed token windows | Simple baseline | Cuts through sections and tables | | Sliding window with overlap | Preserves nearby context | Duplicates content and increases index size | | Structure-aware chunking | Docs with headings, sections, FAQs | Requires parsing quality | | Semantic chunking | Topic boundaries matter | More complex and harder to debug | | Hierarchical chunks | Long reports or policies | More moving parts at retrieval time |

For most interviews, propose a structure-aware baseline: split by headings, keep tables intact when possible, include document title and section path in metadata, and use modest overlap for paragraphs that cross boundaries. For code or API docs, chunk by function, class, endpoint, or section instead of arbitrary length.

Chunk metadata is as important as text. Store document ID, title, section heading, source URL or file path, update timestamp, author or owner, permissions, version, and content type. Metadata powers filters, citations, freshness, and access control.

Embeddings: semantic similarity with constraints

Embeddings map text into vectors so semantically similar chunks are close together. In a RAG system, you embed chunks offline and embed the user query online, then retrieve nearest chunks.

Explain what embeddings are good at: paraphrase, fuzzy semantic match, and domain vocabulary when trained or adapted well. Also explain what they are bad at: exact constraints, numbers, dates, permissions, negation, and rare identifiers. A query like “policy updated after March 2025” may need metadata filters, not only vector similarity.

Important embedding choices:

model dimension and cost
domain fit and language coverage
maximum input length
whether to embed titles with body text
how often to re-embed when content changes
whether queries and documents use the same encoder
evaluation on real user questions

Do not say “larger embedding model is always better.” Larger vectors cost more memory and can slow retrieval. The right model is the one that improves recall on your queries within latency and cost constraints.

Retrieval: vector, lexical, hybrid, and filters

Vector retrieval is common in RAG, but hybrid retrieval is often safer. Lexical search catches exact names, IDs, product codes, policy numbers, and rare terms. Vector search catches paraphrases. Metadata filters enforce permissions, tenant boundaries, document type, locale, recency, and product area.

A strong retrieval plan:

Parse the query for filters and intent.
Run vector search for semantic candidates.
Run lexical search for exact-match candidates.
Apply permission and metadata filters.
Merge and deduplicate candidates.
Rerank the top candidates with a cross-encoder or LLM-based scorer if latency allows.
Pass only the best evidence into the prompt.

Candidate retrieval should optimize recall; reranking should optimize precision. If relevant evidence is not in the candidate set, no prompt can save the answer.

Mention top-k carefully. Retrieving too few chunks misses evidence. Retrieving too many chunks floods the prompt and increases the chance the model uses irrelevant text. A common pattern is retrieve 50 to 100 candidates cheaply, rerank to 5 to 12 evidence chunks, and then generate.

Reranking: improving evidence quality

A reranker scores query-chunk pairs more precisely than pure vector similarity. Cross-encoders can read the query and chunk together, capturing interactions that two independent embeddings miss. LLM rerankers can work too, but they are costlier and require careful prompt design.

Use reranking when:

many chunks are semantically similar but only a few answer the question
exact wording matters
the corpus has duplicates or near-duplicates
vector retrieval returns broad topical matches
answer correctness is more important than raw speed

If latency is tight, rerank fewer candidates, cache common queries, use a smaller reranker, or reserve reranking for high-risk queries. Again, the interview answer should show tradeoffs, not tool worship.

Prompt construction and answer behavior

The prompt should instruct the model to answer from retrieved evidence, cite sources, and say when the evidence is insufficient. It should not bury the question under a giant context dump.

A good RAG prompt includes:

user question
system instruction about grounding
retrieved chunks with titles and source IDs
rules for uncertainty and citations
output format if needed
refusal or escalation behavior for restricted topics

Example instruction: “Answer using only the provided context. If the context does not contain the answer, say what is missing and ask a clarifying question. Cite the source IDs used.”

But do not pretend prompt instructions guarantee truth. Models can still hallucinate, overgeneralize, or combine chunks incorrectly. Retrieval quality, evals, and post-generation checks matter.

Evals: the difference between a demo and a system

RAG evals should measure both retrieval and generation. If the final answer is wrong, you need to know whether retrieval missed the evidence or generation misused it.

| Layer | Metric or check | What it tells you | |---|---|---| | Retrieval | recall@K | Did we retrieve the needed evidence? | | Retrieval | MRR/NDCG | Was evidence near the top? | | Retrieval | filter accuracy | Did permissions and metadata work? | | Generation | groundedness | Is the answer supported by context? | | Generation | answer correctness | Did it answer the question? | | Generation | citation accuracy | Do citations support the claim? | | System | latency, cost, fallback rate | Can this run in production? |

Build an eval set from real user questions, expected answers, and gold evidence chunks. Include adversarial cases: no-answer questions, stale policies, similar product names, conflicting docs, permission-restricted documents, and numeric questions.

A high-quality answer says: “I would not judge the system only by thumbs-up feedback. I would maintain a labeled eval set and break results down by retrieval miss, synthesis error, stale data, and unsupported claim.”

Hallucination control and grounding

RAG reduces hallucination risk; it does not remove it. Controls include:

retrieve enough relevant evidence
instruct the model to abstain when evidence is missing
include source IDs and require citations
verify that cited chunks support key claims
use answer templates for high-risk domains
block answers from unauthorized documents
log unsupported answer patterns for review

For enterprise RAG, access control is non-negotiable. Filter by user permissions before generation, not after. Never show a citation to a document the user cannot access. If document-level permissions are too coarse, chunk-level ACLs may be needed.

Also mention data freshness. If policies change daily, the ingestion pipeline must support incremental updates, deletion, versioning, and cache invalidation. A stale but confidently cited answer is still a failure.

Latency and cost tradeoffs

A RAG request can include query rewriting, embedding, vector search, lexical search, reranking, prompt assembly, generation, and post-checks. Each step adds latency and cost.

Optimization levers:

cache embeddings for repeated queries
cache retrieval results for common questions when permissions allow
use hybrid retrieval before expensive reranking
limit context to high-quality chunks
stream generated answers
route simple FAQ questions to cheaper paths
use smaller models for classification or reranking
precompute summaries for long documents

In an interview, give a target such as “p95 under two seconds for internal support answers” only if the prompt asks for it. Otherwise, speak in tradeoffs: more reranking improves precision but costs latency; more context improves recall but may confuse generation.

Common RAG interview traps

The biggest trap is treating RAG as “embed documents, retrieve top five, call the LLM.” That is a prototype, not a system.

Other traps:

chunking without preserving headings or tables
ignoring permissions and tenant isolation
relying on vector search for exact dates, IDs, or filters
evaluating only final answer thumbs-up instead of retrieval recall
passing too much context and hoping the model sorts it out
failing to handle conflicting or stale documents
not storing source metadata for citations
using synthetic questions only and missing real user language
forgetting deletion and re-indexing behavior

If asked to debug hallucinations, split the diagnosis: Was the right source retrieved? Was it ranked high enough? Was the chunk complete? Did the prompt include too much irrelevant context? Did the model ignore an abstention rule? Did the source itself conflict with another source?

Interview answer template

For a RAG design prompt, use this sequence:

Clarify corpus, users, permissions, freshness needs, and answer risk.
Design ingestion: parsing, cleaning, chunking, metadata, embeddings, indexing.
Design query path: query understanding, hybrid retrieval, filtering, reranking, prompt assembly, generation.
Define fallback behavior for no evidence, conflicting evidence, or restricted evidence.
Define evals for retrieval and generation separately.
Discuss latency, cost, monitoring, and human review.

A good closing: “I would ship only after retrieval recall is strong on a labeled eval set. Prompt tuning cannot compensate for missing evidence, and answer ratings are hard to interpret unless we know whether the retrieval layer worked.”

Resume and interview language

Good resume bullets name the system parts and measurable quality:

“Built a hybrid RAG assistant over 120K support docs with structure-aware chunking, ACL filters, reranking, and citation checks, reducing unsupported answers in evals by 31%.”
“Created a retrieval eval set with gold evidence chunks, separating retrieval misses from synthesis errors and improving recall@10 by 18 points.”
“Implemented incremental re-indexing and document version metadata so policy updates appeared in search within 15 minutes.”

If you only built a prototype, say what you evaluated. “Built a RAG prototype” is weak. “Compared fixed-size and section-aware chunking on 200 real questions and improved gold-chunk recall@5” is strong.

Prep checklist

Before a RAG system design interview, prepare one diagram of indexing and query paths. Be ready to explain chunking tradeoffs, embedding limitations, hybrid retrieval, reranking, eval sets, hallucination controls, permissions, freshness, latency, and cost. Have a crisp answer for “What do you do when the answer is not in the documents?” The correct answer is not to hallucinate politely. It is to abstain, explain what is missing, and route the user to the next best action.

Backend System Design Interview Cheatsheet in 2026 — Patterns, Examples, Practice Plan, and Common Traps — A backend System Design interview cheatsheet for 2026 with the core flow, architecture patterns, capacity heuristics, reliability tradeoffs, and traps that separate senior answers from vague box drawing.
CQRS Interview Guide: When to Split Commands and Queries in a System Design — CQRS is the pattern candidates propose to sound sophisticated and then can't justify. Here is when to actually split reads and writes, what it buys you, and the price you pay for it.
Design System Interview Guide — Tokens, Components, and Governance Questions — A practical design system interview guide covering design tokens, component APIs, accessibility, governance, adoption, contribution models, and how to answer system-maturity questions with credibility.
Designing a Chat System Design Interview — WebSockets, Presence, and Message Storage — A system design interview guide for chat applications, covering WebSockets, fanout, message ordering, presence, storage, delivery receipts, media, search, scaling, and common trade-offs.
Designing a Payment System Design Interview — Idempotency, Ledgers, and Reconciliation — A senior payment-system design answer lives or dies on idempotency, double-entry ledgers, and reconciliation. This guide gives the architecture, state model, failure-mode answers, and interview script.

RAG System Design Interview — Chunking, Embeddings, Retrieval, and Evals

RAG system design interview: the production mental model

Chunking: where many RAG systems succeed or fail

Embeddings: semantic similarity with constraints

Retrieval: vector, lexical, hybrid, and filters

Reranking: improving evidence quality

Prompt construction and answer behavior

Evals: the difference between a demo and a system

Hallucination control and grounding

Latency and cost tradeoffs

Common RAG interview traps

Interview answer template

Resume and interview language

Prep checklist

Related guides

More in Skills and frameworks

A/B Testing Interview Questions in 2026 — Power Analysis, Peeking, and SRM

API Design Interview Cheatsheet in 2026 — Patterns, Examples, Practice Plan, and Common Traps

API Design Interview Guide — REST vs GraphQL vs gRPC, Versioning, and Pagination