RAG Is 80% Search Engineering and 20% LLM

09 Jun 2026

Most “RAG systems” I get shown are a vector database, an LLM, and a bit of hope. They demo beautifully. You ask a question, the right answer comes back, everyone nods. Then real users turn up and you find out retrieval was the whole game all along.

I have spent a fair amount of time building and debugging retrieval over the last few years, and I have come round to a fairly blunt view. Retrieval-Augmented Generation is sold as plug-and-play, but in production it is one of the most underestimated systems in modern AI. Here is where it actually breaks.

Retrieval is the bottleneck, not generation

This is the bit that surprises people who have only seen the demo.

If the wrong passages come back, the smartest model on the planet will summarise the wrong thing. Fluently, confidently, in a tone that makes everyone believe it. The generation step is rarely where things go wrong. Almost every RAG failure I have debugged was actually a search failure wearing an LLM costume.

So when your RAG “mostly works” but occasionally says something confidently wrong, the instinct is to blame the model or tweak the prompt. Usually the real problem is upstream. The model was handed bad context and did exactly what it was told.

Chunking is a design decision, not a default

How you split your documents quietly decides how good your retrieval can ever be.

Split too small and you shred the context. A chunk that is half a sentence retrieves well on keywords but means nothing on its own. Split too big and you bury the signal. The one relevant line is now drowning in three paragraphs of surrounding text, and the embedding averages out into mush.

There is no universal chunk size, despite what the tutorial said. The right size depends on your content (legal documents and chat logs are not the same) and on the shape of your queries. You find it by looking at real failures, not by copying a number from a blog post.

“Closest vector” is not “most useful passage”

Cosine similarity is a starting point, not an answer.

The nearest vector is often not the most useful passage for the question. Two pieces of text can be semantically close and still unhelpful, and a keyword-exact match can be more valuable than anything the embeddings surface. Production retrieval almost always ends up needing more than raw vector search:

Reranking, to reorder the top candidates with a model that looks at the query and passage together.
Metadata filters, so you are searching the right subset before you ever compare vectors.
Hybrid search, keyword and semantic together, because each catches what the other misses.

Similarity logic likes to sprawl

One more from real life. On an older search stack I once went looking for how “find similar” was implemented, and found the same idea written five slightly different ways across the codebase. Each had drifted. Each returned slightly different results for what users thought was a single feature.

This is worth watching for, because retrieval logic leaks. The moment “find similar” or “related items” matters to your product, it tends to get reimplemented wherever someone needed it, and now you are maintaining several subtly different search behaviours and wondering why results feel inconsistent. Give it one home and guard it.

It is a loop, not a launch

You do not ship RAG and walk away. You ship it and start watching.

Which questions retrieve nothing useful? Where does the model hedge or hallucinate because the context was thin? Which sources never get retrieved even though they should? That feedback is the work. You tune chunking, adjust filters, add reranking, improve the underlying data, and you keep doing it. The system is never finished, because the way people ask questions keeps surprising you.

The honest split

If I had to put a number on it, RAG is something like 80% search engineering and 20% LLM. The model matters, but it is the easy, mostly-solved part. The hard, durable engineering is in retrieval: ingestion, chunking, metadata, hybrid search, reranking, and the feedback loop that keeps it honest.

Teams that treat RAG as an LLM problem ship impressive demos. Teams that treat it as a retrieval problem ship things people actually rely on.

So if your RAG mostly works and you want it to fully work, do not start with the model. Go and look at what is being retrieved. I would bet the gap is there.

Abdul Qabiz

RAG Is 80% Search Engineering and 20% LLM

Retrieval is the bottleneck, not generation

Chunking is a design decision, not a default

“Closest vector” is not “most useful passage”

Similarity logic likes to sprawl

It is a loop, not a launch

The honest split

Related Posts

Vector Search at Scale: What Breaks After the Demo 09 Jun 2026

Payments Are a Distributed State Machine, Not CRUD 09 Jun 2026

Yahoo! Mail IMAP Download Limit Issue 04 Mar 2023