Evals: knowing if it works

You changed the prompt and it feels better. Is it actually better, or did you just get lucky on the three examples you tried? Without evals you cannot tell, and you cannot improve a system you cannot measure. Evals turn "feels better" into a number you can track.

Overview

An eval is a test set for an AI system: a collection of inputs paired with a way to judge whether the output is acceptable. You run your system over the set, grade each output, and get a score. When you change a prompt, model, or retrieval setting, you re-run and compare scores instead of guessing. This is the single habit that separates teams that ship reliable AI from teams that thrash.

Key ideas

Why vibes fail

Manual spot-checking tests a handful of cases and forgets them next time. A change that fixes one case often breaks another you are not looking at. An eval set is your regression guard: it remembers every case that ever mattered and tells you, every run, whether you broke any of them.

Build the test set

Start small and real. Ten to fifty cases drawn from actual usage beats hundreds of invented ones. Each case is an input and a way to check the output.

eval_set = [
    {"input": "Where do you ship?", "must_include": ["Vijayawada"]},
    {"input": "What is your return window?", "must_include": ["7 days"]},
    {"input": "Do you sell electronics?", "must_include": ["do not", "no"]},
]

Grade automatically

For factual or structured tasks, a simple programmatic check is enough and is fast and free.

def grade(output, case):
    text = output.lower()
    return any(phrase.lower() in text for phrase in case["must_include"])
 
def run_evals(answer_fn):
    passed = sum(grade(answer_fn(c["input"]), c) for c in eval_set)
    return passed / len(eval_set)
 
print(f"Score: {run_evals(my_system):.0%}")

When to use an LLM as judge

Some outputs cannot be checked by string matching — tone, helpfulness, whether a summary is faithful. For those, use a second model call as a grader: give it the input, the output, and a rubric, and ask for a pass/fail or a score. Keep the rubric specific, and spot-check the judge itself, because a vague judge is as unreliable as a vague generator.

Quick recap

Evals replace "feels better" with a score you can compare across changes.
They are a regression guard: every fixed case stays checked forever.
Start with 10–50 real cases, each with a way to grade the output.
Use programmatic checks for facts and structure; use an LLM judge for tone and faithfulness.
Add every production bug to the eval set so it cannot return unnoticed.

eval_set = [
    {"input": "Where do you ship?", "must_include": ["Vijayawada"]},
    {"input": "What is your return window?", "must_include": ["7 days"]},
    {"input": "Do you sell electronics?", "must_include": ["do not", "no"]},
]

Grade automatically

For factual or structured tasks, a simple programmatic check is enough and is fast and free.

def grade(output, case):
    text = output.lower()
    return any(phrase.lower() in text for phrase in case["must_include"])
 
def run_evals(answer_fn):
    passed = sum(grade(answer_fn(c["input"]), c) for c in eval_set)
    return passed / len(eval_set)
 
print(f"Score: {run_evals(my_system):.0%}")

When to use an LLM as judge

Quick recap

Evals replace "feels better" with a score you can compare across changes.
They are a regression guard: every fixed case stays checked forever.
Start with 10–50 real cases, each with a way to grade the output.
Use programmatic checks for facts and structure; use an LLM judge for tone and faithfulness.
Add every production bug to the eval set so it cannot return unnoticed.