Evals: knowing if it works
Measure your AI system with a test set and graders so you can improve it on purpose.
Prerequisites
- Calling an LLM API
- Prompting that actually works
You will learn
- Explain why vibes are not enough to judge an AI system
- Build a small eval set with inputs and expected outcomes
- Grade outputs automatically, including with an LLM judge
You changed the prompt and it feels better. Is it actually better, or did you just get lucky on the three examples you tried? Without evals you cannot tell, and you cannot improve a system you cannot measure. Evals turn "feels better" into a number you can track.
Overview
An eval is a test set for an AI system: a collection of inputs paired with a way to judge whether the output is acceptable. You run your system over the set, grade each output, and get a score. When you change a prompt, model, or retrieval setting, you re-run and compare scores instead of guessing. This is the single habit that separates teams that ship reliable AI from teams that thrash.
Key ideas
Why vibes fail
Manual spot-checking tests a handful of cases and forgets them next time. A change that fixes one case often breaks another you are not looking at. An eval set is your regression guard: it remembers every case that ever mattered and tells you, every run, whether you broke any of them.
Build the test set
Start small and real. Ten to fifty cases drawn from actual usage beats hundreds of invented ones. Each case is an input and a way to check the output.
eval_set = [
{"input": "Where do you ship?", "must_include": ["Vijayawada"]},
{"input": "What is your return window?", "must_include": ["7 days"]},
{"input": "Do you sell electronics?", "must_include": ["do not", "no"]},
]Grade automatically
For factual or structured tasks, a simple programmatic check is enough and is fast and free.
def grade(output, case):
text = output.lower()
return any(phrase.lower() in text for phrase in case["must_include"])
def run_evals(answer_fn):
passed = sum(grade(answer_fn(c["input"]), c) for c in eval_set)
return passed / len(eval_set)
print(f"Score: {run_evals(my_system):.0%}")When to use an LLM as judge
Some outputs cannot be checked by string matching — tone, helpfulness, whether a summary is faithful. For those, use a second model call as a grader: give it the input, the output, and a rubric, and ask for a pass/fail or a score. Keep the rubric specific, and spot-check the judge itself, because a vague judge is as unreliable as a vague generator.
Quick recap
- Evals replace "feels better" with a score you can compare across changes.
- They are a regression guard: every fixed case stays checked forever.
- Start with 10–50 real cases, each with a way to grade the output.
- Use programmatic checks for facts and structure; use an LLM judge for tone and faithfulness.
- Add every production bug to the eval set so it cannot return unnoticed.