top of page
Search

Make RAG Trustworthy in Production

  • Writer: Nandita Krishnan
    Nandita Krishnan
  • Oct 13
  • 5 min read

Updated: Oct 13

Two weeks after launch, a policy assistant confidently told employees they could expense $65/day. The absolute limit was $75, updated three months earlier. That wasn't a “model hallucination"—it was stale retrieval with no uncertainty handling, quietly shipping bad guidance at scale.

This post shows how to stop that. We'll turn "embed → top-k → paste” into a production-ready loop that adapts its strategy, checks its own evidence, and admits uncertainty when support is weak.


In this post, you’ll build:

  • Adaptive retrieval: sparse + dense + rerank, chosen per query

  • Iterative reasoning: retrieve → reflect → refine until support is sufficient (or abstain)

  • Trusted grounding: citations with timestamps/versions, ACL-aware retrieval, designed abstention

  • A LangGraph state graph you can paste to orchestrate the loop end-to-end

The Naive Pattern (and Why It Breaks)

Most RAG today looks like this:


# Embed query → similarity search → stuff context → generate
chunks = vector_store.similarity_search(query, k=5)
context = "\n".join(chunks)
answer = llm.generate(f"Context: {context}\n\nQ: {query}")

This works for demos. In production, it fails:

  • Similarity ≠ answerability

  • Single-shot brittleness

  • No recency awareness

  • No abstention path

A team shipped a policy assistant with this pattern. Two weeks in, it quoted last year's expense policy with total confidence. The index was stale; the system didn't know.


What Actually Changed

1. Adaptive Retrieval

Systems now choose retrieval strategies based on query type:

  • Sparse search (BM25) for exact term matching (policy names, IDs)

  • Dense embeddings for semantic similarity

  • Hybrid with cross-encoder reranking to surface answer spans

# Query classification routes to strategy
plan = classify_query(query)  # → lookup, analysis, multi_hop
strategy = select_strategy(plan)  # → sparse, dense, hybrid
k = select_k(plan)  # dynamic top-k

# Hybrid retrieval + reranking
candidates = retrieve(query, strategy, k=k)
reranked = cross_encoder.rerank(query, candidates, top_n=3)

Why it matters: Hybrid + rerank typically improves answer-bearing precision at low k; use it when you need fewer, better spans.


These choices set the stage; the loop that follows decides whether we've got enough or need another pass.


2. Iterative Reasoning

After retrieving, good systems reflect:

  • Is this evidence sufficient?

  • Are sources current?

  • Should I decompose this question?

If not, they refine the query and retrieve again.


for attempt in range(max_iterations):
	chunks = retrieve(query, plan)
    reflection = evaluate_sufficiency(query, chunks)

    if reflection.sufficient:
        break

    # Guardrail: respect latency budget
    if elapsed_ms() > 2500:
        break

    query = refine_query(query, reflection.gaps)

This turns retrieval from preprocessing into active reasoning.


3. Trusted Grounding

Evidence now carries metadata: source, timestamp, author, version, access permissions.

Three concrete policies:

  1. Always carry provenance (source, timestamp, version, ACL)

  2. Filter by recency when questions are time-sensitive

  3. Abstain with reason on low confidence or conflicts

If support is weak or sources conflict, return uncertainty with citations and offer a next step (broaden timeframe, relax filters, or escalate).


 # Each chunk has provenance
chunk = {
    "content": "...",
    "source": "Policy_v2.4.pdf",
    "timestamp": "2024-06-03",
    "access_scope": ["finance_team"]
}

# Abstain on low support
if low_support or contradictions:
    return explain_uncertainty(citations)  # no answer > hallucination

# Generate with citations
answer = generate_with_citations(query, chunks)

4. Integration into Orchestration

RAG dissolved into agent workflows. It's called repeatedly in small passes as agents reason through tasks.


ree

Agent loop: classify → retrieve → evaluate → retry or generate


Mental Model: RAG as Context Compiler

The useful reframe: RAG compiles high-level queries into precise, verified context through multiple passes.


Think: parse intent → optimize retrieval → verify support → emit only what's needed—then re-run if gaps remain.


Like a code compiler:

  • Parse intent (query type, entities, constraints)

  • Optimize retrieval (select strategy, adjust k, apply filters)

  • Inline only necessary spans with citations

  • Verify provenance and contradictions

  • Emit compact, structured context


And like compilers, it's iterative: profile → refine → re-compile.


The rule: Agents decide; retrieval supplies.Agents without retrieval are blind. Retrieval without agency is dumb.

Cognitive Layers (A Better Taxonomy)

Instead of "RAG variants," think about what cognitive function you're adding:


  • Perceptual — Reformulate queries to match corpus structure (query expansion, entity normalization)

  • Selective — Decide what and how much to retrieve (adaptive k, confidence gating, hybrid strategies)

  • Reflective — Assess sufficiency and trigger re-queries (self-evaluation, gap detection)

  • Compositional — Decompose complex questions into subtasks, retrieve per subtask, synthesize

  • Relational — Connect facts across sources (graph-based retrieval, entity linking)


This vocabulary helps you evolve incrementally by adding layers where failures occur.


Practical Patterns

Corpus Preparation

  • Chunk at semantic boundaries (sections, not arbitrary tokens); cap at ~300–800 tokens

  • Attach metadata: source, timestamp, version, ACLs, section_id

  • Include heading breadcrumbs in metadata for better rerank prompts

  • Deduplicate but preserve version history


Hybrid Retrieval + Reranking

# Combine BM25 + embeddings
ensemble = EnsembleRetriever(
    retrievers=[bm25, vector_store],
    weights=[0.4, 0.6]
)

candidates = ensemble.retrieve(query, k=10)

# Cross-encoder reranking for precision at low k
reranked = cross_encoder.rerank(query, candidates, top_n=3)

If cross-encoder latency is too high, start with RRF (Reciprocal Rank Fusion) or a light reranker like monoT5-small; add cross-encoder only on the final shortlist. If recall is 0 after an iteration, broaden k or relax filters and expand entities (synonyms/aliases) before reranking.

Cache reranker features to control latency/cost. Target: retrieval 80-250ms, rerank 150-500ms, p95 end-to-end under 2.5s.


Reflection Pass

# After drafting answer
reflection = llm.invoke(f"""
Query: {query}
Retrieved: {context}
Draft: {answer}

Are citations sufficient and recent? If not, how to refine?
""")

if not reflection.sufficient:
    refined_query = reflection.suggested_refinement
    # Retry with better query

Trust Layer

  • Surface citations with timestamps for knowledge responses

  • Prefer span-level citations (section_id + offsets) when your chunker supports it

  • Detect contradictions between sources and explain conflicts

  • Enforce ACLs at retrieval time, not after generation

  • Abstain on low confidence: "Found conflicting evidence in [A], [B]"


With LangGraph

Here's the state graph for agentic RAG:

The graph handles iteration. The system decides when to retry, when to generate, and when to give up.

from langgraph.graph import StateGraph, START, END
from typing import TypedDict

class RAGState(TypedDict):
    query: str
    chunks: list
    sufficient: bool
    answer: str

graph = StateGraph(RAGState)

graph.add_node("plan", plan_retrieval)
graph.add_node("retrieve", hybrid_retrieve)
graph.add_node("reflect", evaluate_context)
graph.add_node("refine", refine_query)
graph.add_node("generate", generate_grounded)

graph.add_edge(START, "plan")
graph.add_edge("plan", "retrieve")
graph.add_edge("retrieve", "reflect")
graph.add_conditional_edges(
    "reflect",
    lambda s: "generate" if s["sufficient"] else "refine",
    {"generate": "generate", "refine": "refine"}
)

graph.add_edge("refine", "retrieve")
graph.add_edge("generate", END)

app = graph.compile()

The graph handles iteration. The system decides when to retry, when to generate, when to give up.

ree


The conditional edge routes are based on sufficiency checks

Observability: Log query, strategy, k, filters, latency per stage, support_score from sufficiency judge, age_of_newest_citation_days, reranker_model, and reranker_seed (if supported) for reproducible traces. Store chosen spans and citations. Record abstention reasons. Alert on: latency > SLO, support below threshold, or conflict spikes. Consider using LangSmith tracing for end-to-end observability.


Measuring What Matters

Traditional metrics (precision@k, NDCG) don't capture real quality. Measure groundedness:

  • Citation support: Does the retrieved evidence back claims?

  • Context relevance: Did retrieval return answer-bearing spans?

  • Answerability rate: % answered confidently vs uncertain

  • Faithfulness: No unsupported extrapolations


# LLM-as-judge for groundedness

score = evaluate_groundedness(
    answer=answer,
    citations=citations,
    retrieved=chunks
)
# Returns 0.0-1.0: how well the answer is supported by retrieved chunks

Caveat: Use LLM judges, but calibrate with periodic human spot-checks and pairwise comparisons; recent surveys highlight bias and variance risks. Ensure no train/test or corpus leakage—hold out documents and validate groundedness only against the retrieved spans.


Three Myths

"Large context windows killed RAG."

In most production settings, selective retrieval improves cost and latency and reduces noise. Even with large context windows, research shows that models can struggle with relevant information buried in long contexts (Lost in the Middle). Retrieval is optimization, not a workaround for a limited context.


"Agents replaced RAG

"LangGraph agents invoke retrievers repeatedly—just more carefully. Orchestration improved; the need for evidence didn't vanish.


"RAG = vector database"

Vector DBs are one component. Modern retrieval means hybrid search, reranking, metadata filters, and provenance tracking.

When to Use This

Good fits:

  • Frequently updated knowledge bases

  • Multi-document synthesis

  • Compliance/audit requirements

  • Need for citations and provenance


Maybe not:

  • Simple FAQ lookups (naive RAG is fine)

  • Static documents

  • Extreme latency needs (<500ms)

Try It

Build it yourself with the official tutorials:

The tutorial includes an evaluation harness, cost profiling, and before/after comparisons.

 
 
 

Drop Me a Line, Let Me Know What You Think

© 2025 by Nandita Krishnan

bottom of page