Most RAG systems fail for a boring reason: the model is asked to answer with the wrong context.
The first version usually looks simple. Split documents into chunks, embed the chunks, store them in a vector database, retrieve the top matches, and pass them to the LLM. That is useful for demos, internal FAQs, and small knowledge bases. It is not enough for production systems where users ask ambiguous questions, documents change, tables matter, exact terms matter, and answers need to be audited.
Advanced RAG is not one technique. It is an architecture around retrieval quality.
The goal is to build a system that can:
- prepare source data without losing meaning
- retrieve both semantic matches and exact matches
- route different queries to different indexes or tools
- decompose complex questions into answerable subquestions
- rerank and compress context before generation
- verify whether the answer is grounded in the retrieved evidence
- measure regressions before users find them
Microsoft's guide to advanced retrieval-augmented generation systems is a useful baseline because it frames RAG as three connected phases: ingestion, inference, and evaluation. That is the right mental model. A production RAG system is not just a vector search call. It is a data pipeline, a query pipeline, and a measurement system.
This guide expands that baseline into the architecture patterns we use when deciding how far a RAG system needs to go.
The simple RAG pattern and why it breaks
Naive RAG usually follows this flow:
- Load documents.
- Split them into chunks.
- Create embeddings.
- Store embeddings in a vector index.
- Embed the user query.
- Retrieve top-K similar chunks.
- Put the chunks into the prompt.
- Generate an answer.
That works when the user's language is close to the indexed text and the answer lives cleanly inside one or two chunks. It breaks when any of these are true:
- the user asks a vague or multi-part question
- the answer depends on exact identifiers, product names, dates, SKUs, ticket IDs, or legal terms
- the source document has tables, images, slides, code blocks, or nested sections
- the relevant chunk lost context when it was separated from the parent document
- the corpus has multiple domains that need different retrieval strategies
- the top vector matches are semantically related but not actually useful
- the model receives too much context and misses the important evidence
- documents are updated but old chunks remain in the index
In production, RAG quality is usually limited by retrieval and context construction, not by the final model alone.
The production architecture
A practical advanced RAG architecture has four layers:
| Layer | Job | Typical techniques |
|---|---|---|
| Ingestion | Turn raw source material into retrievable knowledge | parsing, normalization, metadata, chunking, contextual chunks, versioning |
| Retrieval | Find a broad candidate set | vector search, BM25, hybrid search, filters, graph retrieval, routers |
| Context assembly | Decide what the model should see | reranking, deduplication, parent-child expansion, compression, citation packing |
| Answer control | Generate, verify, observe, and improve | grounded generation, policy checks, answer validation, evals, feedback loops |
The important part is the separation of responsibilities.
The retriever should maximize recall. The reranker should improve precision. The context builder should fit the most useful evidence into the prompt. The generator should answer from that evidence. The evaluator should tell you which step failed.
When these responsibilities are blurred, teams debug RAG by changing prompts. That is usually the slowest way to fix a retrieval problem.
1. Ingestion: retrieval quality starts before search
Ingestion is where many RAG systems quietly lose accuracy.
Before chunking, the system should normalize and preserve the structure of the source material:
- document title
- source URL or system of record
- author, owner, team, product, customer, or account
- version and last updated time
- section hierarchy
- table captions and headers
- image captions or OCR output
- permissions and visibility rules
- document type
- lifecycle state such as draft, approved, deprecated, or archived
Microsoft's advanced RAG guidance calls out content preprocessing, metadata extraction, chunking strategy, organization, and update strategy as ingestion concerns. That matters because metadata is not decorative. It becomes a retrieval control surface.
For example, a customer support RAG system might need filters for product, region, plan, support tier, document version, and customer entitlement. If those fields are not captured during ingestion, the query pipeline has to guess.
Chunking strategy
There is no universal chunk size.
Small chunks improve precision but often lose context. Large chunks preserve context but can dilute retrieval and waste prompt budget. A better approach is to choose chunking based on document shape:
| Source type | Better chunking approach |
|---|---|
| Product docs | heading-aware sections with parent references |
| API docs | endpoint or method-level chunks with code examples attached |
| Contracts | clause-aware chunks with document, party, and effective-date metadata |
| Support tickets | issue, environment, root cause, and resolution blocks |
| Tables | row groups plus table title, column headers, and surrounding explanation |
| Slide decks | slide-level chunks with speaker notes and nearby slide context |
| Codebases | symbol-aware chunks: file, class, function, dependency, and usage context |
In many systems, the winning pattern is not "small chunks" or "large chunks." It is parent-child retrieval.
The index stores smaller child chunks for precise matching, but the prompt receives a larger parent section when the child is selected. Microsoft describes this idea as Small2Big: find a small unit, then give the model nearby or parent context.
Contextual chunks
Chunking can destroy context. Anthropic's Contextual Retrieval pattern addresses this by adding a short, chunk-specific explanation before embedding and keyword indexing each chunk. The added context tells the retriever what the chunk means inside the larger document.
That is especially useful when a chunk says something like "revenue grew 3%" or "this setting is disabled by default." Without document context, the retriever may not know which company, product, period, or feature the chunk refers to.
Contextual chunks are most useful when:
- documents are long and section-dependent
- the same terms repeat across many products or customers
- chunks contain pronouns or references like "this feature," "the previous quarter," or "the policy"
- exact metadata is missing from the text itself
- you need both semantic retrieval and BM25 keyword retrieval to benefit from the added context
The tradeoff is indexing cost. You spend extra model calls during ingestion to create contextual descriptions. That is usually acceptable for high-value knowledge bases because it moves cost out of the user-facing path.
2. Retrieval: do not rely on vector search alone
Vector search is strong at semantic similarity. It is weak when exact terms matter.
Azure AI Search's hybrid search documentation explains the production reason for combining vector and full-text search: vector search helps with conceptual similarity, while keyword search helps with exact matches such as product codes, jargon, dates, and names.
Hybrid retrieval usually works like this:
- Run dense vector search for semantic matches.
- Run sparse keyword search such as BM25 for exact matches.
- Merge candidate lists with a fusion method such as reciprocal rank fusion.
- Apply metadata filters where appropriate.
- Send a larger candidate set to a reranker.
This pattern is useful because the first retrieval stage should be generous. It should gather enough possible evidence so a later precision step can choose the best passages.
When vector search wins
Vector search is useful when users describe concepts differently from the source text:
- "How do I stop churn risk?" versus "customer health score deterioration"
- "Can users bring their own SSO?" versus "SAML identity provider configuration"
- "What happens if renewal is delayed?" versus "grace period and billing retry policy"
When keyword search wins
Keyword search is useful when the exact term is the signal:
- error codes
- legal clauses
- SKUs
- API names
- customer names
- ticket IDs
- version numbers
- compliance terms
Production RAG usually needs both.
3. Reranking: separate broad recall from final precision
The retriever should not be forced to pick the final context in one step.
A better pipeline retrieves a wider set of candidates, then reranks them with a model that compares the query and document text more directly. Anthropic describes reranking as a filtering step after initial retrieval, where a larger candidate set is scored and reduced before passing context to the model.
The pattern is:
- Retrieve top 50 to 200 candidates with hybrid search.
- Deduplicate near-identical chunks.
- Rerank candidates against the user query.
- Select the best evidence under a token budget.
- Preserve source metadata for citations.
Reranking adds latency and cost, but it often pays for itself because the generation model receives fewer irrelevant chunks. It also makes failures easier to debug: if the relevant chunk was retrieved but not reranked highly, you have a reranking problem; if it never appeared in the candidate set, you have a retrieval or ingestion problem.
4. Query transformation: fix the question before retrieval
Users do not write search-optimized questions.
They use acronyms, vague references, incomplete phrasing, and multi-part requests. Query transformation improves the retrieval query before searching.
Common transformations include:
| Technique | What it does | When to use it |
|---|---|---|
| Query rewriting | Rephrases the user's question for search | vague, conversational, or typo-heavy queries |
| Multi-query retrieval | Generates several search variants | when relevant docs use different terminology |
| Step-back prompting | Asks a broader conceptual question first | when the literal query is too narrow |
| HyDE | Generates a hypothetical answer/document, embeds that, then searches | when source language differs from user language |
| Metadata extraction | Converts the question into filters | when date, product, customer, region, or version matters |
| Query decomposition | Breaks a complex question into subqueries | multi-hop or multi-part questions |
LlamaIndex's query transform cookbook includes HyDE as a query rewriting pattern where a generated hypothetical text is embedded alongside or instead of the raw query. NVIDIA's RAG Blueprint query decomposition describes decomposition as breaking complex questions into focused subqueries, processing them independently, and synthesizing a final answer.
The practical rule is simple: transform only when it helps. A simple factual query should stay simple. Decomposition and multi-query retrieval add latency, cost, and more places for the system to drift.
5. Query routing: one index rarely fits every question
As the corpus grows, the system needs to decide where to search.
A company knowledge assistant may have:
- product docs
- CRM records
- support tickets
- Slack discussions
- onboarding guides
- API references
- contracts
- analytics tables
- web search
Putting all of that into one vector index creates noisy retrieval. A query router classifies the user's intent and sends the query to the right source or combination of sources.
Routing can be deterministic, model-based, or hybrid:
- Deterministic routing: if the query mentions an account ID, search CRM and support systems.
- Model routing: an LLM classifies the query as technical, legal, sales, account, policy, or analytics.
- Embedding routing: compare the query against descriptions of available indexes or tools.
- Permission-aware routing: search only sources the user is allowed to access.
The router should return not just sources, but a reason. That reason is useful for logging, debugging, and human review.
6. GraphRAG: when relationships matter
Some questions are not answered well by isolated chunks.
Examples:
- "Which customers are blocked by the same integration issue?"
- "What themes are emerging across this quarter's support escalations?"
- "Which vendors, controls, and policies are connected to this compliance risk?"
- "What changed in the architecture after the migration?"
GraphRAG adds a relationship layer. Microsoft Research describes GraphRAG as combining text extraction, network analysis, LLM prompting, and summarization. The Microsoft GraphRAG architecture shows an indexing pipeline that loads documents, chunks them, extracts a graph and claims, embeds chunks and entities, detects communities, and generates reports.
GraphRAG is not necessary for every RAG system. It is useful when:
- entities and relationships are the core of the question
- the answer requires corpus-level synthesis, not just passage lookup
- you need to explore clusters, themes, dependencies, or communities
- the same entity appears across many documents with different local context
The tradeoff is operational complexity. You now maintain graph extraction, entity resolution, community summaries, and graph-aware retrieval. Use it when the problem deserves that structure.
7. Agentic RAG: when retrieval is part of reasoning
Fixed RAG retrieves once before generation. Agentic RAG lets the model decide when and how to retrieve while it reasons.
LangChain's retrieval documentation distinguishes 2-step RAG, agentic RAG, and hybrid RAG. In a 2-step system, retrieval always happens before generation. In an agentic system, the agent can call retrieval tools during the interaction. In a hybrid system, you add validation and refinement steps while keeping more control than a fully agentic loop.
Agentic RAG is useful when:
- the user question is exploratory
- the system has multiple tools or sources
- the answer requires follow-up retrieval
- the model needs to inspect one result before deciding what to retrieve next
- the workflow includes validation, correction, or escalation
It is risky when the system needs predictable latency, strict cost bounds, or deterministic compliance behavior. For high-risk domains, a hybrid architecture is often better: the model can request more evidence, but the workflow controls tool access, retry limits, approval gates, and final checks.
8. Context assembly: the prompt is a scarce resource
After retrieval and reranking, the system still has to build the final context.
This step decides:
- which chunks enter the prompt
- how much parent context to include
- whether to compress or summarize evidence
- how to order the evidence
- how to include citations
- how to handle contradictory sources
- how to stay within the model's context window
Do not treat context assembly as string concatenation.
A good context builder should prefer:
- newer approved documents over older drafts
- primary sources over summaries
- source diversity when answering broad questions
- exact matches for identifiers
- parent context when the selected chunk is ambiguous
- clear citation metadata for every evidence block
It should also remove duplicate or near-duplicate evidence. More context is not always better. Irrelevant passages can distract the model and increase cost.
9. Answer generation: force grounding without pretending it is guaranteed
The generation prompt should make the contract clear:
- answer only from provided evidence when the question is knowledge-base grounded
- cite the sources used
- say when the evidence is missing or insufficient
- separate facts from assumptions
- preserve uncertainty
- avoid using stale retrieved content when metadata says a newer version exists
But prompts are not enough. The system should validate the answer after generation.
Post-generation checks can include:
- citation coverage: does every factual claim map to evidence?
- contradiction checks: does the answer conflict with retrieved text?
- policy checks: does the answer reveal restricted data or unsafe guidance?
- completeness checks: did the answer address every part of the user question?
- format checks: does the response match the required output shape?
For internal tools, it can be better to show the user a transparent "not enough evidence" answer than to generate a polished guess.
10. Evaluation: advanced RAG needs regression tests
RAG systems degrade quietly.
New documents are added. Old documents stay indexed. Chunking changes. Embedding models change. Reranker thresholds change. User behavior changes. The only way to keep quality stable is to measure it.
Microsoft's guide recommends golden datasets, assessment pipelines, feedback capture, logging, and safeguards. In practice, a useful eval set includes:
- real user questions
- approved answers
- expected source documents
- acceptable alternate phrasings
- metadata such as topic, difficulty, source type, and risk level
- negative examples where the system should refuse or say it does not know
Measure more than final answer quality.
| Metric | What it tells you |
|---|---|
| Recall@K | Did retrieval find the right evidence at all? |
| MRR or NDCG | Did the right evidence rank near the top? |
| Reranker win rate | Did reranking improve the candidate order? |
| Groundedness | Are answer claims supported by retrieved evidence? |
| Citation accuracy | Do citations point to the right source passages? |
| Refusal accuracy | Does the system say "not enough evidence" when appropriate? |
| Latency | Is the pipeline fast enough for the workflow? |
| Cost per answered query | Is the architecture economically sustainable? |
Debugging becomes much easier when each eval case stores the full trace: rewritten query, selected route, filters, retrieved candidates, reranked candidates, final context, generated answer, citations, and validation results.
Recommended architecture by maturity
Not every team needs GraphRAG, agentic retrieval, and decomposition on day one.
Stage 1: reliable baseline
Use this when the corpus is small and mostly clean.
- structured ingestion
- heading-aware chunking
- embeddings
- metadata filters
- source citations
- basic eval set
Stage 2: production retrieval
Use this when users ask real questions and retrieval misses matter.
- contextual chunks
- hybrid vector plus BM25 retrieval
- broader candidate retrieval
- reranking
- parent-child expansion
- query rewriting for ambiguous questions
- answer validation
- feedback capture
Stage 3: multi-source intelligence
Use this when the system spans many domains or tools.
- query router
- source-specific indexes
- permissions-aware retrieval
- query decomposition for multi-hop questions
- graph retrieval for relationship-heavy domains
- agentic retrieval for exploratory workflows
- golden dataset regression suite
- full observability across every retrieval step
A reference advanced RAG flow
Here is a practical production flow:
- User asks a question.
- Policy and permission checks run first.
- Query classifier decides whether the question is simple, ambiguous, multi-hop, or tool-based.
- Query transformer rewrites, expands, or decomposes the query only if needed.
- Router selects one or more indexes or tools.
- Hybrid retrieval gathers candidates from vector and keyword search.
- Metadata filters remove irrelevant, stale, unauthorized, or wrong-version content.
- Reranker scores the candidate set.
- Context builder deduplicates, expands parent context, compresses if needed, and packs citations.
- Generator answers from the provided evidence.
- Validator checks grounding, citations, policy, and completeness.
- Trace and feedback are stored for evaluation.
This is more complex than a demo. It is also the difference between a chatbot that sounds confident and a system an operator can trust.
Common architecture mistakes
Mistake 1: treating embeddings as the whole retrieval system
Embeddings are one retrieval signal. They are not a complete architecture. Exact matching, metadata, graph relationships, and recency often matter as much as semantic similarity.
Mistake 2: chunking without looking at the documents
Blind token-based chunking destroys structure. Production ingestion should respect headings, tables, code blocks, clauses, and parent-child relationships.
Mistake 3: adding agents before retrieval is measurable
Agentic RAG can improve exploratory workflows, but it also makes behavior less predictable. Add it after you can measure baseline retrieval quality.
Mistake 4: no versioning strategy
If users can receive answers from outdated policies or deprecated docs, the system is not production-ready. Track source versions, timestamps, and lifecycle states.
Mistake 5: no negative evals
RAG evals should include questions the system cannot answer. Otherwise, teams only measure how well the system answers friendly questions.
Bottom line
Advanced RAG is a retrieval architecture, not a bigger prompt.
Start with clean ingestion and measurable retrieval. Add hybrid search when exact terms matter. Add reranking when top-K quality is noisy. Add query transformation when users ask messy questions. Add routing when one index becomes too broad. Add GraphRAG when relationships matter. Add agentic retrieval when the workflow benefits from iterative evidence gathering.
The winning architecture is the smallest one that retrieves the right evidence, shows its work, and fails honestly when the evidence is not there.
For teams building AI systems around private business knowledge, that is the bar: not a chatbot that answers everything, but a workflow that can find, prove, and safely use the right context.
Sources
- Microsoft Learn: Build advanced retrieval-augmented generation systems
- Azure AI Search: Hybrid search using vectors and full text
- Anthropic Engineering: Introducing Contextual Retrieval
- Microsoft GraphRAG: Architecture
- Microsoft Research: Project GraphRAG
- LlamaIndex: Query Transform Cookbook
- NVIDIA: Query Decomposition for NVIDIA RAG Blueprint
- LangChain: Retrieval documentation