Retrieval-augmented generation (RAG)

A model only knows what it trained on, up to its cutoff. RAG is how you make it answer questions about your documents, your prices, today's policy — anything it never saw. The idea is simple: find the relevant text first, then ask the model to answer using only that text.

Overview

Retrieval-augmented generation has two stages. Retrieve: given a question, find the most relevant chunks from your knowledge base (using the embeddings and semantic search from last week). Generate: put those chunks into the prompt as context and ask the model to answer based on them. The model stops guessing from memory and starts reading from the material you supplied.

Key ideas

Why it works

Two of the biggest LLM weaknesses are hallucination and stale knowledge. RAG addresses both: the answer is grounded in retrieved text you control, and that text can be as fresh as your last update. You are not retraining the model — you are changing what it reads at question time.

Chunking your documents

Long documents are split into smaller chunks before embedding, because retrieval works better on focused passages than on whole files, and because chunks must fit the context window. A few hundred tokens per chunk with a small overlap between consecutive chunks is a sensible default — the overlap stops a sentence that straddles a boundary from losing its meaning.

The pipeline in code

This builds on the embed and cosine_similarity helpers from the embeddings lesson.

from anthropic import Anthropic
 
client = Anthropic()
 
def answer_with_rag(question, chunks, chunk_vectors, top_k=3):
    q_vec = embed(question)
 
    ranked = sorted(
        zip(chunks, chunk_vectors),
        key=lambda pair: cosine_similarity(q_vec, pair[1]),
        reverse=True,
    )
    context = "\n\n---\n\n".join(chunk for chunk, _ in ranked[:top_k])
 
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=400,
        system=(
            "Answer using only the provided context. "
            "If the answer is not in the context, say you do not know."
        ),
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}",
        }],
    )
    return response.content[0].text

The instruction that prevents lies

Notice the system prompt: answer only from context, and admit when the answer is not there. Without this, the model happily fills gaps from its training memory, which defeats the purpose. Telling it to say "I do not know" is what keeps RAG honest.

Quick recap

RAG = retrieve relevant chunks, then generate an answer grounded in them.
It fixes stale knowledge and reduces hallucination without retraining.
Chunk documents into focused passages with a little overlap before embedding.
Instruct the model to answer only from context and admit when it cannot.
When answers are wrong, debug retrieval first — most failures are there.

from anthropic import Anthropic
 
client = Anthropic()
 
def answer_with_rag(question, chunks, chunk_vectors, top_k=3):
    q_vec = embed(question)
 
    ranked = sorted(
        zip(chunks, chunk_vectors),
        key=lambda pair: cosine_similarity(q_vec, pair[1]),
        reverse=True,
    )
    context = "\n\n---\n\n".join(chunk for chunk, _ in ranked[:top_k])
 
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=400,
        system=(
            "Answer using only the provided context. "
            "If the answer is not in the context, say you do not know."
        ),
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}",
        }],
    )
    return response.content[0].text

The instruction that prevents lies

Quick recap

RAG = retrieve relevant chunks, then generate an answer grounded in them.
It fixes stale knowledge and reduces hallucination without retraining.
Chunk documents into focused passages with a little overlap before embedding.
Instruct the model to answer only from context and admit when it cannot.
When answers are wrong, debug retrieval first — most failures are there.