RAG Is Not Magic. Here Is How to Build It Right.

Retrieval-augmented generation is the most useful pattern in applied LLM work. It is also the most badly built.

WTM StudioMarch 1, 20264 min read

Why everyone defaults to RAG

Before retrieval-augmented generation became the default pattern, LLM apps had a hard problem. Models trained on last year's data did not know about this year's products, policies or contracts. Even when they had seen a document during training, they could not cite it. They had generalized over it, not memorized it.

RAG fixes that by changing the architecture. You retrieve the relevant documents first, then ask the model to reason over them. The model becomes an interpreter. Not a memorizer.

It is a great pattern. It is also implemented badly more often than it is implemented well.

The four ways most teams break RAG

1. Treating embeddings as a solved problem

Most tutorials walk you through embedding a corpus, dumping it in a vector database and pulling the top-k nearest neighbors. That works in tutorials. In production, the chunk you retrieve is often not the chunk you need.

Why? Because relevance and proximity are different things. A query about refund policy can match a document about refund history. Same words. Different meaning. Naive similarity search finds documents that look like the query, not documents that answer it.

Fix: combine dense retrieval (vector search) with sparse retrieval (BM25 or keyword). Then put a reranker on top. Hybrid search plus a cross-encoder reranker is the floor for any serious RAG system. Not the ceiling.

2. Ignoring chunk boundaries

Documents get split into chunks before they get embedded. Where you split matters more than people realize. A sentence cut mid-clause. A table broken across two chunks. A heading separated from the paragraph it introduces. Each one quietly destroys the semantics that retrieval depends on.

Fix: chunk semantically, not by character count. Parse PDFs as structured documents. Respect headings, tables, lists. The boring infra work pays back compounding interest in answer quality.

3. No eval pipeline

You cannot improve what you cannot measure. Most RAG implementations are built, tested manually on a handful of queries and then deployed. They work for those queries. Everything else is hope.

Fix: build an eval set before you ship. Define what "correct" looks like for the use case. Run it on every retrieval-parameter change, every model swap, every prompt edit. RAG quality silently degrades when an upstream model updates. Without evals you will find out from a customer.

4. Stuffing the context window

Sending twenty retrieved chunks does not make the model more accurate. It makes it less accurate. Models lose the thread inside long contexts. They start latching onto whatever sits near the start or end of the prompt and ignoring the middle. This is documented behavior, not a hunch.

Fix: retrieve aggressively, then rerank ruthlessly. Four high-quality chunks beat sixteen mediocre ones every time.

What a real production RAG pipeline looks like

At WTM, a production RAG pipeline has six layers:

Ingestion. Structured parsing, semantic chunking, metadata extraction.
Indexing. Vector embeddings plus keyword index (hybrid).
Retrieval. Hybrid search plus a cross-encoder reranker.
Generation. Grounded prompt, citation enforcement, refusal logic.
Evaluation. Automated eval suite, human spot checks.
Observability. Query logs, retrieval logs, sampled quality scores.

None of those layers are technically complex. All of them are easy to skip. The ones you skip are the ones that show up in your incident channel six months later.

The first question to ask

Before you pick a vector database, before you pick an embedding model, ask one thing. What does correct look like, and how will we know when we have it?

Define the eval first. Build the pipeline second. Deploy when the eval passes.

That order is not optional.