Retrieval-augmented generation (RAG)
Give a model facts it never trained on by retrieving relevant text and putting it in the prompt.
Prerequisites
- Embeddings & semantic search
- Calling an LLM API
You will learn
- Explain why RAG reduces hallucination and stale answers
- Build a retrieve-then-generate pipeline end to end
- Spot the common failure points and how to mitigate them
Telugu lo nerchuko · Watch in Telugu
A model only knows what it trained on, up to its cutoff. RAG is how you make it answer questions about your documents, your prices, today's policy — anything it never saw. The idea is simple: find the relevant text first, then ask the model to answer using only that text.
Overview
Retrieval-augmented generation has two stages. Retrieve: given a question, find the most relevant chunks from your knowledge base (using the embeddings and semantic search from last week). Generate: put those chunks into the prompt as context and ask the model to answer based on them. The model stops guessing from memory and starts reading from the material you supplied.
Key ideas
Why it works
Two of the biggest LLM weaknesses are hallucination and stale knowledge. RAG addresses both: the answer is grounded in retrieved text you control, and that text can be as fresh as your last update. You are not retraining the model — you are changing what it reads at question time.
Chunking your documents
Long documents are split into smaller chunks before embedding, because retrieval works better on focused passages than on whole files, and because chunks must fit the context window. A few hundred tokens per chunk with a small overlap between consecutive chunks is a sensible default — the overlap stops a sentence that straddles a boundary from losing its meaning.
The pipeline in code
This builds on the embed and cosine_similarity helpers from the embeddings lesson.
from anthropic import Anthropic
client = Anthropic()
def answer_with_rag(question, chunks, chunk_vectors, top_k=3):
q_vec = embed(question)
ranked = sorted(
zip(chunks, chunk_vectors),
key=lambda pair: cosine_similarity(q_vec, pair[1]),
reverse=True,
)
context = "\n\n---\n\n".join(chunk for chunk, _ in ranked[:top_k])
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=400,
system=(
"Answer using only the provided context. "
"If the answer is not in the context, say you do not know."
),
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}",
}],
)
return response.content[0].textThe instruction that prevents lies
Notice the system prompt: answer only from context, and admit when the answer is not there. Without this, the model happily fills gaps from its training memory, which defeats the purpose. Telling it to say "I do not know" is what keeps RAG honest.
Quick recap
- RAG = retrieve relevant chunks, then generate an answer grounded in them.
- It fixes stale knowledge and reduces hallucination without retraining.
- Chunk documents into focused passages with a little overlap before embedding.
- Instruct the model to answer only from context and admit when it cannot.
- When answers are wrong, debug retrieval first — most failures are there.