Make RAG Trustworthy in Production
- Nandita Krishnan
- Oct 13
- 5 min read
Updated: Oct 13
Two weeks after launch, a policy assistant confidently told employees they could expense $65/day. The absolute limit was $75, updated three months earlier. That wasn't a “model hallucination"—it was stale retrieval with no uncertainty handling, quietly shipping bad guidance at scale.
This post shows how to stop that. We'll turn "embed → top-k → paste” into a production-ready loop that adapts its strategy, checks its own evidence, and admits uncertainty when support is weak.
In this post, you’ll build:
Adaptive retrieval: sparse + dense + rerank, chosen per query
Iterative reasoning: retrieve → reflect → refine until support is sufficient (or abstain)
Trusted grounding: citations with timestamps/versions, ACL-aware retrieval, designed abstention
A LangGraph state graph you can paste to orchestrate the loop end-to-end
The Naive Pattern (and Why It Breaks)
Most RAG today looks like this:
# Embed query → similarity search → stuff context → generate
chunks = vector_store.similarity_search(query, k=5)
context = "\n".join(chunks)
answer = llm.generate(f"Context: {context}\n\nQ: {query}")This works for demos. In production, it fails:
Similarity ≠ answerability
Single-shot brittleness
No recency awareness
No abstention path
A team shipped a policy assistant with this pattern. Two weeks in, it quoted last year's expense policy with total confidence. The index was stale; the system didn't know.
What Actually Changed
1. Adaptive Retrieval
Systems now choose retrieval strategies based on query type:
Sparse search (BM25) for exact term matching (policy names, IDs)
Dense embeddings for semantic similarity
Hybrid with cross-encoder reranking to surface answer spans
# Query classification routes to strategy
plan = classify_query(query) # → lookup, analysis, multi_hop
strategy = select_strategy(plan) # → sparse, dense, hybrid
k = select_k(plan) # dynamic top-k
# Hybrid retrieval + reranking
candidates = retrieve(query, strategy, k=k)
reranked = cross_encoder.rerank(query, candidates, top_n=3)Why it matters: Hybrid + rerank typically improves answer-bearing precision at low k; use it when you need fewer, better spans.
These choices set the stage; the loop that follows decides whether we've got enough or need another pass.
2. Iterative Reasoning
After retrieving, good systems reflect:
Is this evidence sufficient?
Are sources current?
Should I decompose this question?
If not, they refine the query and retrieve again.
for attempt in range(max_iterations):
chunks = retrieve(query, plan)
reflection = evaluate_sufficiency(query, chunks)
if reflection.sufficient:
break
# Guardrail: respect latency budget
if elapsed_ms() > 2500:
break
query = refine_query(query, reflection.gaps)This turns retrieval from preprocessing into active reasoning.
3. Trusted Grounding
Evidence now carries metadata: source, timestamp, author, version, access permissions.
Three concrete policies:
Always carry provenance (source, timestamp, version, ACL)
Filter by recency when questions are time-sensitive
Abstain with reason on low confidence or conflicts
If support is weak or sources conflict, return uncertainty with citations and offer a next step (broaden timeframe, relax filters, or escalate).
# Each chunk has provenance
chunk = {
"content": "...",
"source": "Policy_v2.4.pdf",
"timestamp": "2024-06-03",
"access_scope": ["finance_team"]
}
# Abstain on low support
if low_support or contradictions:
return explain_uncertainty(citations) # no answer > hallucination
# Generate with citations
answer = generate_with_citations(query, chunks)4. Integration into Orchestration
RAG dissolved into agent workflows. It's called repeatedly in small passes as agents reason through tasks.

Agent loop: classify → retrieve → evaluate → retry or generate
Mental Model: RAG as Context Compiler
The useful reframe: RAG compiles high-level queries into precise, verified context through multiple passes.
Think: parse intent → optimize retrieval → verify support → emit only what's needed—then re-run if gaps remain.
Like a code compiler:
Parse intent (query type, entities, constraints)
Optimize retrieval (select strategy, adjust k, apply filters)
Inline only necessary spans with citations
Verify provenance and contradictions
Emit compact, structured context
And like compilers, it's iterative: profile → refine → re-compile.
The rule: Agents decide; retrieval supplies.Agents without retrieval are blind. Retrieval without agency is dumb.
Cognitive Layers (A Better Taxonomy)
Instead of "RAG variants," think about what cognitive function you're adding:
Perceptual — Reformulate queries to match corpus structure (query expansion, entity normalization)
Selective — Decide what and how much to retrieve (adaptive k, confidence gating, hybrid strategies)
Reflective — Assess sufficiency and trigger re-queries (self-evaluation, gap detection)
Compositional — Decompose complex questions into subtasks, retrieve per subtask, synthesize
Relational — Connect facts across sources (graph-based retrieval, entity linking)
This vocabulary helps you evolve incrementally by adding layers where failures occur.
Practical Patterns
Corpus Preparation
Chunk at semantic boundaries (sections, not arbitrary tokens); cap at ~300–800 tokens
Attach metadata: source, timestamp, version, ACLs, section_id
Include heading breadcrumbs in metadata for better rerank prompts
Deduplicate but preserve version history
Hybrid Retrieval + Reranking
# Combine BM25 + embeddings
ensemble = EnsembleRetriever(
retrievers=[bm25, vector_store],
weights=[0.4, 0.6]
)
candidates = ensemble.retrieve(query, k=10)
# Cross-encoder reranking for precision at low k
reranked = cross_encoder.rerank(query, candidates, top_n=3)
If cross-encoder latency is too high, start with RRF (Reciprocal Rank Fusion) or a light reranker like monoT5-small; add cross-encoder only on the final shortlist. If recall is 0 after an iteration, broaden k or relax filters and expand entities (synonyms/aliases) before reranking.
Cache reranker features to control latency/cost. Target: retrieval 80-250ms, rerank 150-500ms, p95 end-to-end under 2.5s.
Reflection Pass
# After drafting answer
reflection = llm.invoke(f"""
Query: {query}
Retrieved: {context}
Draft: {answer}
Are citations sufficient and recent? If not, how to refine?
""")
if not reflection.sufficient:
refined_query = reflection.suggested_refinement
# Retry with better queryTrust Layer
Surface citations with timestamps for knowledge responses
Prefer span-level citations (section_id + offsets) when your chunker supports it
Detect contradictions between sources and explain conflicts
Enforce ACLs at retrieval time, not after generation
Abstain on low confidence: "Found conflicting evidence in [A], [B]"
With LangGraph
Here's the state graph for agentic RAG:
The graph handles iteration. The system decides when to retry, when to generate, and when to give up.
from langgraph.graph import StateGraph, START, END
from typing import TypedDict
class RAGState(TypedDict):
query: str
chunks: list
sufficient: bool
answer: str
graph = StateGraph(RAGState)
graph.add_node("plan", plan_retrieval)
graph.add_node("retrieve", hybrid_retrieve)
graph.add_node("reflect", evaluate_context)
graph.add_node("refine", refine_query)
graph.add_node("generate", generate_grounded)
graph.add_edge(START, "plan")
graph.add_edge("plan", "retrieve")
graph.add_edge("retrieve", "reflect")
graph.add_conditional_edges(
"reflect",
lambda s: "generate" if s["sufficient"] else "refine",
{"generate": "generate", "refine": "refine"}
)
graph.add_edge("refine", "retrieve")
graph.add_edge("generate", END)
app = graph.compile()The graph handles iteration. The system decides when to retry, when to generate, when to give up.

The conditional edge routes are based on sufficiency checks
Observability: Log query, strategy, k, filters, latency per stage, support_score from sufficiency judge, age_of_newest_citation_days, reranker_model, and reranker_seed (if supported) for reproducible traces. Store chosen spans and citations. Record abstention reasons. Alert on: latency > SLO, support below threshold, or conflict spikes. Consider using LangSmith tracing for end-to-end observability.
Measuring What Matters
Traditional metrics (precision@k, NDCG) don't capture real quality. Measure groundedness:
Citation support: Does the retrieved evidence back claims?
Context relevance: Did retrieval return answer-bearing spans?
Answerability rate: % answered confidently vs uncertain
Faithfulness: No unsupported extrapolations
# LLM-as-judge for groundedness
score = evaluate_groundedness(
answer=answer,
citations=citations,
retrieved=chunks
)
# Returns 0.0-1.0: how well the answer is supported by retrieved chunks
Caveat: Use LLM judges, but calibrate with periodic human spot-checks and pairwise comparisons; recent surveys highlight bias and variance risks. Ensure no train/test or corpus leakage—hold out documents and validate groundedness only against the retrieved spans.
Three Myths
"Large context windows killed RAG."
In most production settings, selective retrieval improves cost and latency and reduces noise. Even with large context windows, research shows that models can struggle with relevant information buried in long contexts (Lost in the Middle). Retrieval is optimization, not a workaround for a limited context.
"Agents replaced RAG
"LangGraph agents invoke retrievers repeatedly—just more carefully. Orchestration improved; the need for evidence didn't vanish.
"RAG = vector database"
Vector DBs are one component. Modern retrieval means hybrid search, reranking, metadata filters, and provenance tracking.
When to Use This
Good fits:
Frequently updated knowledge bases
Multi-document synthesis
Compliance/audit requirements
Need for citations and provenance
Maybe not:
Simple FAQ lookups (naive RAG is fine)
Static documents
Extreme latency needs (<500ms)
Try It
Build it yourself with the official tutorials:
Agentic RAG Tutorial — Step-by-step implementation
LangGraph How-Tos — Agent patterns and graph API examples
Retrievers Concepts — Deep dive on retrieval strategies
The tutorial includes an evaluation harness, cost profiling, and before/after comparisons.



