How to Improve RAG Answers With Reranking

By Sagnik Bhattacharya 20 Mar 2026 5 min read

Coding Liquids blog cover featuring Sagnik Bhattacharya for improving RAG answers with reranking.

Vector search is fast but imprecise. It finds documents that are semantically similar to the query, but 'similar' and 'relevant' are not the same thing. Reranking adds a second, more precise step that reorders the results by actual relevance.

I teach Flutter and Excel with AI — explore my courses if you want structured learning.

This guide shows how to add reranking to an existing RAG pipeline and explains when it helps most.

Quick answer

After vector search retrieves candidate chunks (e.g., top 20), pass them through a cross-encoder reranker that scores each chunk against the query. Take the top 3-5 reranked results for generation. This typically improves answer quality by 10-30%.

Your RAG system retrieves plausible but wrong chunks.
Vector search returns results that are similar in topic but not relevant to the specific question.
You have tried improving chunking and embeddings but still get mediocre answers.

Why vector search is not enough

Vector search using embeddings compresses entire passages into fixed-length vectors. This compression loses nuance — two passages about the same topic get similar vectors even if one answers the question and the other does not.

Reranking uses a cross-encoder that reads the query and each candidate passage together, producing a much more accurate relevance score. It is slower but more precise.

How reranking works

The pipeline becomes: embed query → vector search (retrieve top 20-50) → rerank (score each against query) → take top 3-5 → generate answer.

The reranker is a cross-encoder model that takes (query, passage) pairs and outputs a relevance score. Unlike bi-encoders used in vector search, cross-encoders see both texts together and can capture fine-grained relevance.

Retrieve a larger initial set from vector search (20-50 candidates)
Score each candidate with the cross-encoder reranker
Sort by reranker score and take the top 3-5
Pass the reranked results to the generation model

Choosing a reranker

For local use, cross-encoder models from sentence-transformers work well (e.g., `cross-encoder/ms-marco-MiniLM-L-6-v2`). For cloud use, Cohere Rerank and Jina Reranker are popular hosted options.

The reranker does not need to be large. A small cross-encoder model (50-100MB) can dramatically improve results because it is doing a different, more focused task than the embedding model.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# After vector search returns candidates
candidates = vector_search(query, top_k=20)

# Rerank
pairs = [(query, chunk["text"]) for chunk in candidates]
scores = reranker.predict(pairs)

# Sort by score and take top 5
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
top_chunks = [chunk for chunk, score in ranked[:5]]

When reranking helps most

Reranking helps most when your document collection is large, when queries are specific, and when the embedding model produces many 'near-miss' results that are topically related but not specifically relevant.

If your RAG system already retrieves the right chunks most of the time, reranking will not help much. Test retrieval quality first before adding reranking.

Performance considerations

Reranking adds latency — typically 50-200ms for 20 candidates with a small cross-encoder. This is acceptable for most applications but matters for real-time chat.

If latency is critical, reduce the number of candidates passed to the reranker (10 instead of 50) or use a faster reranker model.

Worked example: improving a knowledge base RAG system

A company knowledge base RAG system retrieves the right document only 60% of the time. After adding a cross-encoder reranker that rescores the top 20 vector search results, the correct document appears in the top 3 results 85% of the time. Answer quality improves noticeably because the generation model now has better context.

Common mistakes

Adding reranking before testing whether retrieval is the actual bottleneck.
Using too few initial candidates (reranking 5 results adds little value — you need 20+).
Not considering the latency cost of reranking in real-time applications.

When to use something else

If your problem is chunking quality rather than retrieval precision, see chunking documents for RAG. For building the RAG pipeline from scratch, see building a RAG app.

Frequently asked questions

What does a reranker actually do?

After vector search returns, say, the top 20 candidates, a reranker (a cross-encoder) scores each chunk against the query directly and reorders them, so the best three to five go to the model. It typically lifts answer quality noticeably.

Why is vector search not enough on its own?

Embeddings compress a whole passage into one fixed vector, which loses nuance: two passages on the same topic look similar even if only one answers the question. A cross-encoder reads query and chunk together, so it is more precise.

Where does reranking fit in the pipeline?

Embed query, vector search to retrieve the top 20-50, rerank to score each against the query, keep the top three to five, then generate. Retrieval casts a wide net; reranking picks the winners.

Which reranker should I use?

Locally, a sentence-transformers cross-encoder such as ms-marco-MiniLM works well; hosted, Cohere Rerank or Jina Reranker are popular. Start with a small local model and measure before paying for an API.

Does reranking slow things down?

A little, since it is an extra model pass over the candidates, but you can cap the candidate count and run it on GPU. The latency cost is usually worth the accuracy gain for answer quality.

When is reranking not worth it?

When retrieval already returns the right chunk in the top few, or when your corpus is tiny. Measure recall first and add reranking only if the right chunk is being retrieved but ranked too low.

Related guides on this site

These guides cover RAG pipeline building, chunking strategies, and output evaluation.