Context Optimization

How much to retrieve, what to include, and prompt engineering for RAG

The context you provide to the LLM determines answer quality. Too little context: missing information. Too much: confusion, cost, and potential hallucination. Finding the right balance is essential for effective RAG.

Context Window Constraints

Interactive: Context window usage

Context window: 8,000 tokens4,750 buffer
System (200)
Context (2000)
Question (50)
Response (1000)

LLMs have finite context windows (4K to 200K+ tokens). Within that window, you must fit:

  • System prompt: Instructions and guidelines (~100-500 tokens)
  • Retrieved context: Your passages (~1000-8000 tokens typical)
  • User question: The query (~20-200 tokens)
  • Response space: Room for the answer (~500-2000 tokens)

Common mistakes:

  • Stuffing too much context, leaving no room for response
  • Using entire context window when a fraction would suffice
  • Not accounting for response length in planning

How Many Passages?

More is not always better:

Interactive: Chunk selection

Coverage90%

How much relevant info is included

Focus80%

Signal-to-noise ratio

Cost1600 tokens

Token usage

Lost in Middle Risk0%

Info ignored due to position

Good balance of coverage and focus

Too few (1-2 passages):

  • May miss relevant information
  • Faster, cheaper
  • Higher risk of incomplete answers

Too many (10+ passages):

  • Diminishing returns on relevance
  • Higher cost (more tokens)
  • "Lost in the middle" effect: models neglect mid-context information
  • Potential confusion from contradictory passages

Sweet spot (3-5 passages): Often optimal. Enough coverage without overwhelming.

Calibrate empirically for your domain.

Lost in the Middle

Research shows LLMs attend more to the beginning and end of context than the middle. Relevant information buried in the middle may be missed.

Implications:

  • Put most relevant passages first
  • Consider ending with important context too
  • Avoid long middle sections with critical information

Context Ordering Strategies

Context ordering strategies

Most relevant first

P10.95
P20.88
P30.82
P40.75

Default choice. Key info at start where attention is highest.

Relevance-ordered: Most similar passages first. Natural and usually effective.

Reverse-ordered: Least similar first, most similar last. Places key info at the end.

Interleaved: Alternate high/low relevance. Distributes importance.

Source-grouped: Group by document source. Good for multi-source answers.

Chronological: Order by date for time-sensitive queries.

Empirically, relevance-ordered with most important first works best for most cases.

Prompt Templates

Prompt template patterns

Use ONLY the following context to answer. If the answer is not in the context, say "I don't know."

Context:
{context}

Question: {question}

Answer:

Grounding instruction reduces hallucination by telling model to admit uncertainty.

Effective RAG prompts include:

Clear role definition:

You are a technical documentation assistant.
plaintext

Context delineation:

Use ONLY the following context to answer:
---
{context}
---
plaintext

Grounding instructions:

If the answer is not in the context, say "I don't have that information."
Do not make up information.
plaintext

Citation guidance:

Cite sources using [Source: title] format.
plaintext

Response format:

Provide a concise answer in 2-3 sentences.
plaintext

Passage Formatting

How you format passages matters:

Numbered passages:

[1] First passage text here...
[2] Second passage text here...
plaintext

Enables citation by number.

Source-labeled:

From "User Guide v2.1":
Passage text here...
 
From "API Reference":
More passage text...
plaintext

Good for transparency.

XML-style:

<passage source="doc1.pdf" page="5">
Passage text here...
</passage>
plaintext

Structured, machine-readable.

Context Compression

When context is too long, compress it:

Extractive summarization: Pull key sentences from each passage.

Abstractive summarization: Rewrite passages to be shorter while preserving meaning.

Query-focused extraction: Keep only sentences relevant to the query.

Progressive summarization: Summarize long documents in stages.

Trade-off: Compression may lose nuance. Use when necessary, not by default.

Dynamic Context Selection

Adapt context to the query:

Confidence-based cutoff: Include passages above similarity threshold.

Marginal relevance: Diversify context to cover different aspects.

Query-type aware: Different query types need different context amounts.

Iterative retrieval: Start with less, retrieve more if initial answer is uncertain.

Cost Optimization

Context tokens cost money:

Input tokens: Usually 10-50% cheaper than output tokens But more numerous: 3000 context tokens + 500 output = 3500 total

Strategies:

  • Use shorter context for simple queries
  • Cache common contexts
  • Summarize verbose passages
  • Truncate passages to relevant sections

Key Takeaways

  • Balance context size: too little misses info, too much confuses and costs
  • Account for response space when planning context window usage
  • "Lost in the middle" effect: put important passages first (or last)
  • Use clear formatting: numbered passages or source labels for citations
  • Include grounding instructions: tell the model to admit uncertainty
  • Compress context when necessary but preserve essential information
  • Typical sweet spot: 3-5 well-chosen passages