Failure Modes

When RAG breaks and how to diagnose and fix retrieval problems

RAG systems fail in predictable ways. Understanding these failure modes helps you diagnose issues quickly and apply targeted fixes. When an answer is wrong, is it a retrieval problem or a generation problem?

The Blame Game: Retrieval vs Generation

Diagnose: Retrieval or Generation?

Scenario: Wrong answer produced

Query: "What is the maximum file size limit?"
Wrong answer: "The maximum file size is 10MB"
Actual answer: "The maximum file size is 100MB"

When RAG produces a wrong answer, the cause falls into one of two categories.

Retrieval failure means the relevant documents were not retrieved. The LLM never saw the right information—it could not have answered correctly because the answer was never in its context.

Generation failure means the relevant documents were retrieved, but the LLM produced a wrong answer anyway. The information was there; the model just did not use it correctly.

The diagnosis method is straightforward: look at the retrieved passages. Do they contain the correct answer? If not, you have a retrieval problem. If they do contain the answer but the output is wrong, you have a generation problem. This distinction is crucial because the fixes differ completely.

Common Failure Modes

Common failure patterns

Vocabulary mismatch

Symptom:Query terms differ from doc terms

Fix:Hybrid search, query expansion

Out of domain

Symptom:No relevant docs exist

Fix:Admit uncertainty, expand corpus

Chunking issues

Symptom:Answer split across chunks

Fix:Overlap, hierarchical retrieval

Retrieval Failures

Vocabulary mismatch occurs when the query uses different words than the relevant document. A user asks "How to fix slow code" but the relevant document discusses "Performance optimization techniques." No keyword overlap exists, and the embedding model may not bridge the semantic gap. The fix is query expansion or hybrid search that combines BM25 keyword matching with dense retrieval.

Out-of-domain queries happen when users ask about topics not covered in your corpus. The system retrieves tangentially related but wrong documents—something that mentions similar words but answers a different question. The fix involves admitting uncertainty when retrieval confidence is low, expanding the corpus, or detecting out-of-domain queries explicitly.

Chunking artifacts arise when the answer spans two chunks and neither chunk alone is sufficient. The chunking boundary fell in the wrong place. The fix is overlapping chunks so information appears in multiple chunks, hierarchical retrieval that can return parent documents, or simply larger chunks.

Embedding model mismatch happens when your embedding model was trained on general text but your corpus is highly technical. The semantic space does not represent domain-specific concepts well—technical terms that should be similar end up far apart. The fix is using a domain-specific embedding model or fine-tuning on your data.

Ambiguous queries like "Apple support" could refer to the company, the fruit, or Apple Records. Wrong disambiguation leads to wrong retrieval. The fix involves query clarification (asking the user), entity disambiguation using context, or leveraging user history.

Generation Failures

Ignoring context occurs when the LLM has strong prior beliefs that override the retrieved context. The model might say "According to the context..." and then generate from its own knowledge instead. Larger models with more parametric knowledge are more prone to this. The fix is stronger grounding instructions that emphasize using only the provided context, or using smaller models with less confident priors.

Lost in the middle is a well-documented phenomenon where the relevant passage sits in the middle of the context but the LLM attends primarily to the beginning and end. The answer is right there, but the model does not see it. The fix is reordering passages to put the most relevant at the beginning or end, or reducing total context length so there is no "middle" to get lost in.

Contradictory sources arise when multiple retrieved passages disagree. The LLM may pick one (possibly the wrong one) or confuse them into an incoherent answer. The fix involves ranking sources by quality so the most reliable comes first, or instructing the model explicitly to note contradictions rather than resolve them silently.

Hallucinated citations happen when the LLM claims "[Source: X]" but source X does not actually say that. The model fabricates citations that sound plausible. The fix is programmatic verification—check that quoted text actually appears in the cited source—or using structured output that can be validated.

Instruction following failure manifests when you tell the model to say "I don't know" if unsure, but it answers anyway with fabricated information. Some models are optimized to be helpful to the point of confabulation. The fix involves prompt engineering, fine-tuning specifically for instruction following, or output validation that detects and filters problematic responses.

Diagnostic Workflow

Debug workflow

Reproduce

Get exact failing query and response

Check Retrieval

Examine retrieved passages

Classify

Retrieval or generation failure?

Investigate

Find root cause

Fix

Apply targeted solution

Add Test

Prevent regression

Step 1: Reproduce. Get the exact query and answer that failed. You cannot debug what you cannot reproduce.

Step 2: Check retrieval. What passages were retrieved? Do they contain the answer? This is the crucial classification step.

Step 3: Classify failure. Is this a retrieval problem or a generation problem? The answer determines everything that follows.

Step 4: Investigate root cause. For retrieval failures, check similarity scores and run different query phrasings to understand why the right document was not found. For generation failures, examine the prompt and try different models to understand why the model did not use the available information.

Step 5: Apply fix. Use a targeted fix for the specific failure mode. Then validate that your fix does not break other cases—regressions are easy to introduce.

Step 6: Add to test set. Every fixed failure becomes a test case. Prevent regression by adding it to your evaluation suite.

Failure Detection Systems

Automated failure detection

Retrieval Confidence

0.45

Low similarity scores

Citation Accuracy

78%

Most citations verified

User Satisfaction

3.2/5

Below target of 4.0

Uncertainty Language

12%

Model hedging detected

Alerts Detected

• Low retrieval confidence may indicate vocabulary mismatch
• User satisfaction below target—investigate recent queries

Low retrieval confidence can be detected automatically. If all retrieved passages have low similarity scores, the system likely found nothing relevant. This is a signal to either admit uncertainty or try alternative retrieval strategies.

High uncertainty in generation manifests as hedging language ("It might be...", "I'm not sure but...") or low token probabilities. These signals indicate the model is uncertain and the answer may be unreliable.

Contradiction detection catches cases where the answer contradicts the retrieved passages. If the LLM says one thing but the sources say another, something went wrong in generation.

Citation verification failure is detectable by checking whether claimed citations actually match source text. If the model claims "[Source: X]" but X does not contain that information, flag the response.

User feedback signals like thumbs down, follow-up questions that rephrase the original, or explicit corrections all indicate failures. These are noisy but valuable signals for improvement.

Monitoring in Production

Track metrics that signal failures before users complain.

Retrieval health includes average top-k similarity scores over time and the fraction of queries with no high-confidence results. If retrieval quality degrades, it shows up here first.

Generation quality is harder to measure automatically. User satisfaction ratings provide direct signal. Follow-up question rate indicates confusion—if users frequently rephrase, they are not getting good answers. Citation accuracy rate measures how often claimed citations actually match sources.

Latency distribution serves as a proxy for health. Sudden increases may indicate problems like index corruption or resource exhaustion. Track P99 latency, not just average, for reliability insights.

Error rates—failed generations, timeouts, empty responses—should be near zero. Any increase warrants investigation.

Recovery Strategies

When failure is detected, have a plan.

Graceful degradation means being honest about uncertainty. Instead of a confident wrong answer, return "I'm not confident in this answer. Here's what I found:" followed by the sources without synthesis. Users can at least see the raw material.

Fallback to web search expands scope when the internal corpus fails. If you cannot find the answer internally, try broader search. Clearly label results as coming from external sources so users know the provenance.

Human escalation routes difficult queries to human experts. This serves two purposes: users get accurate answers, and you get training data for improving the system.

Retry with modifications attempts automatic recovery. Rephrase the query using synonyms or expansions. Increase retrieval k to get more candidate documents. Try a different model that might handle the case better.

Key Takeaways

Distinguish retrieval failures (didn't find it) from generation failures (found it, answered wrong)
Common retrieval failures: vocabulary mismatch, bad chunking, domain mismatch, ambiguity
Common generation failures: ignoring context, lost in middle, hallucination, instruction failure
Systematic debugging: reproduce → check retrieval → classify → investigate → fix → add to tests
Monitor in production: retrieval scores, user feedback, citation accuracy, latency
Build recovery strategies: graceful degradation, fallbacks, human escalation
Every production failure is a test case waiting to be added