Evaluation Metrics

Precision, recall, MRR, NDCG: measuring retrieval quality with worked examples

You cannot improve what you do not measure. Retrieval evaluation quantifies how well your system finds relevant documents. Different metrics capture different aspects of quality: did we find everything? Are results ranked correctly? Is the first result good?

Precision and Recall

The fundamental trade-off in retrieval.

Interactive: Precision and recall

Relevant Retrieved
5
of 10 total relevant
Precision
62.5%
5/8 retrieved
Recall
50.0%
5/10 found
Visualization
Relevant & Retrieved
False Positive
Missed (False Negative)

Precision: Of the documents you retrieved, what fraction are relevant?

Precision=Relevant RetrievedTotal Retrieved\text{Precision} = \frac{\text{Relevant Retrieved}}{\text{Total Retrieved}}

Recall: Of all relevant documents, what fraction did you retrieve?

Recall=Relevant RetrievedTotal Relevant\text{Recall} = \frac{\text{Relevant Retrieved}}{\text{Total Relevant}}

Precision@k: Precision computed on top k results only.

Recall@k: Recall computed on top k results only.

Trade-off: Retrieving more increases recall but often decreases precision. Retrieving less increases precision but misses relevant documents.

Worked Example: Precision and Recall

Query: "Python error handling"

  • Total relevant documents in corpus: 10
  • You retrieve 8 documents
  • Of those 8, 5 are relevant

Precision=58=0.625=62.5%\text{Precision} = \frac{5}{8} = 0.625 = 62.5\%

Recall=510=0.50=50%\text{Recall} = \frac{5}{10} = 0.50 = 50\%

You found half the relevant documents (moderate recall), and most of what you found was relevant (decent precision).

Mean Reciprocal Rank (MRR)

When you care most about the first relevant result.

Interactive: MRR calculator

QueryFirst Relevant RankReciprocal Rank
Python error handling11.000
Machine learning basics30.333
REST API design20.500
Mean Reciprocal Rank
0.611
(1.00 + 0.33 + 0.50) / 3 = 0.611

For each query, find the rank of the first relevant document. The reciprocal rank is 1/rank.

MRR=1QqQ1rankq\text{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\text{rank}_q}

Example:

  • Query 1: First relevant at rank 1 → RR = 1/1 = 1.0
  • Query 2: First relevant at rank 3 → RR = 1/3 = 0.33
  • Query 3: First relevant at rank 2 → RR = 1/2 = 0.5

MRR=1.0+0.33+0.53=0.61\text{MRR} = \frac{1.0 + 0.33 + 0.5}{3} = 0.61

Use MRR when: Users typically click the first good result (search engines, Q&A).

Normalized Discounted Cumulative Gain (NDCG)

When you care about the ranking quality of all results.

Interactive: NDCG calculator

Relevance scores (drag to reorder)
3
Rank 1
2
Rank 2
0
Rank 3
1
Rank 4
2
Rank 5
DCG@5
10.48
IDCG@5
10.82
NDCG@5
0.969
DCG = Σ (2^rel - 1) / log₂(rank + 1)
NDCG = DCG / IDCG = 10.48 / 10.82 = 0.969

NDCG accounts for:

  1. Graded relevance (not just relevant/not)
  2. Position (earlier is better)
  3. Normalization (compare across queries)

DCG (Discounted Cumulative Gain):

DCG@k=i=1k2reli1log2(i+1)\text{DCG}@k = \sum_{i=1}^{k} \frac{2^{rel_i} - 1}{\log_2(i + 1)}

Gain from each result is discounted by log of position.

NDCG:

NDCG@k=DCG@kIDCG@k\text{NDCG}@k = \frac{\text{DCG}@k}{\text{IDCG}@k}

Normalize by the ideal DCG (perfect ranking).

Worked Example: NDCG

Results with relevance scores [3, 2, 0, 1, 2] (3 = highly relevant):

DCG@5: DCG=231log2(2)+221log2(3)+201log2(4)+211log2(5)+221log2(6)\text{DCG} = \frac{2^3-1}{\log_2(2)} + \frac{2^2-1}{\log_2(3)} + \frac{2^0-1}{\log_2(4)} + \frac{2^1-1}{\log_2(5)} + \frac{2^2-1}{\log_2(6)} =71+31.58+02+12.32+32.58=7+1.90+0+0.43+1.16=10.49= \frac{7}{1} + \frac{3}{1.58} + \frac{0}{2} + \frac{1}{2.32} + \frac{3}{2.58} = 7 + 1.90 + 0 + 0.43 + 1.16 = 10.49

Ideal ranking [3, 2, 2, 1, 0]: IDCG=7+1.90+1.16+0.43+0=10.49\text{IDCG} = 7 + 1.90 + 1.16 + 0.43 + 0 = 10.49

NDCG@5=10.4910.49=1.0\text{NDCG}@5 = \frac{10.49}{10.49} = 1.0 (perfect ranking for this set)

Which Metric to Use?

Metrics comparison

For RAG: Focus on Recall@k. The LLM just needs to see the relevant passages—exact ranking within the context window matters less.

MetricUse WhenCaptures
Precision@kYou show fixed k resultsQuality of what's shown
Recall@kMissing documents is costlyCoverage of relevant docs
MRRFirst result matters mostQuality of top result
NDCG@kRanking order mattersGraded ranking quality

For semantic search: NDCG and Recall are typically most important. You want high recall (find everything relevant) with good ranking (best first).

For RAG: Recall@k matters most. If the LLM sees the relevant passages, it can answer. Ranking within context matters less.

Evaluation Best Practices

1. Create a test set: Queries with labeled relevant documents. 50-100 queries minimum.

2. Use multiple metrics: No single metric tells the whole story.

3. Segment by query type: Performance may vary by category, length, or difficulty.

4. Compare to baseline: Measure improvement relative to previous system or simple baseline.

5. Statistical significance: Improvements should be significant, not random variation.

6. Regular evaluation: Re-evaluate as data and queries evolve.

Common Pitfalls

Incomplete labels: If you only label top-retrieved docs, recall is artificially inflated.

Position bias in labels: Labelers rate higher-ranked docs as more relevant.

Query set bias: Test queries not representative of production traffic.

Metric gaming: Optimizing metric without improving user experience.

Key Takeaways

  • Precision measures quality of retrieved results; Recall measures coverage of relevant documents
  • MRR focuses on the first relevant result—use for single-answer scenarios
  • NDCG measures graded ranking quality—use when result order matters
  • No single metric tells the whole story; use multiple metrics
  • For RAG, Recall@k is typically most important (LLM needs to see relevant passages)
  • Build a labeled test set and evaluate regularly
  • Watch for evaluation pitfalls: incomplete labels, bias, unrepresentative queries