Evaluating LLM Systems in Production: Beyond the Vibe Check

April 24, 20265 min read

AILLMEvalsTestingProduction

Every LLM project I have seen goes through the same first phase: someone runs the pipeline on a few examples, reads the outputs, and says "yeah, that looks good." Ship it.

The vibe check is fine for a prototype. The problem is what happens three months later, when you want to change the prompt, swap the model, or tweak the retrieval, and you realize you have no way of knowing whether the change made things better or worse. So nobody touches anything. The system fossilizes at "pretty good," held together by fear.

I have lived this on a legal AI product, where "pretty good" is not an acceptable answer when a lawyer relies on the output. Here is the evaluation stack we converged on, in order of effort.

Level 1: a golden dataset, even a small one

The single highest-ROI artifact: a set of real inputs paired with known-good outputs. Not synthetic examples you invented, but real cases from production, with the answer an expert validated.

Fifty cases is enough to start. The discipline is in the curation: cover the easy majority, but deliberately over-represent the hard cases (the ambiguous contract clause, the question with no answer in the corpus, the input that is subtly malformed). Your dataset should look like your nightmares, not your demo.

Then the rule that changes everything: no change to a prompt, a model, or a retrieval step ships without running the golden set. It is a regression test, exactly like in classical engineering, except the assertions are fuzzier. Which brings us to scoring.

Level 2: scoring, deterministic where possible, LLM-judge where not

Some outputs can be checked mechanically. Extraction tasks: did the dates, amounts, party names match? Classification: exact label comparison. Structured output: does it parse, does it validate against the schema? Do this wherever you can; it is cheap and unambiguous.

For free-text outputs (summaries, analyses, recommendations) you need a judge, and the pragmatic option is an LLM grading against a rubric. This works, with caveats I learned the painful way:

A judge needs a rubric, not a feeling. "Rate this 1 to 10" produces noise. "Does the summary mention the termination clause? yes/no. Does it invent any fact absent from the source? yes/no" produces signal. Decompose quality into binary checks.
Calibrate the judge against humans before trusting it. Take 30 cases, have experts score them, have the judge score them, measure agreement. Ours initially over-rewarded confident, well-written answers that were subtly wrong, the same bias the underlying model has. We fixed the rubric until agreement was acceptable. Skipping this step means automating your own blind spot.
Use a different model as judge than the one generating, or at minimum a different configuration. A model grading its own homework is exactly as reliable as that sounds.

Level 3: production is the real eval

Offline evals tell you a change is safe to ship. They do not tell you the system works for real users on real traffic, because production inputs drift in ways your golden set will always lag behind.

Three things proved their worth:

Log everything, structured. Every generation: full input context, output, model version, prompt version, latency, token counts. This is your raw material. When a user reports a bad output, you replay the exact context instead of guessing. And it feeds the next iteration of the golden dataset: yesterday's production failure is tomorrow's regression case.

Cheap online signals. Did the user retry the question? Edit the generated draft heavily? Abandon the flow? None of these is a quality score, but they cluster around problems and tell you where to look.

Human-in-the-loop sampling. Every week, a domain expert reviews a sample of production outputs against the rubric. On the legal product, this calibration ritual surfaced the gap between what our metrics said and what experts saw: the same lesson I learned on analytics scoring. The hard part is never the math, it is the alignment between measurement and ground truth.

Loading diagram…

The loop matters more than any single component. Production failures become golden cases, golden cases gate releases, releases generate logs, logs feed expert review. A flywheel, not a checklist.

What it costs, honestly

Building this took a few weeks spread over a quarter, plus a recurring hour or two of expert time per week. That expert time is the part teams resist: pulling a senior lawyer or a CS lead away from their job to grade outputs feels expensive.

It is the cheapest insurance you will ever buy. The alternative is the fossilized system: nobody dares change the prompt, model upgrades get postponed for months, and quality discussions happen via anecdote and seniority instead of data. I have watched a team argue for two weeks about whether a change "felt better." The eval run would have taken twenty minutes.

Start the dataset on day one

Start the golden dataset the day you start the prototype, not the month before launch. Every interesting case you encounter during development is a free eval case, so capture it then, because reconstructing it later is miserable.

And accept that evals for LLM systems are never finished, the way tests for deterministic code can be. The distribution shifts, the models change, the users surprise you. The goal is not a perfect score. The goal is being able to change your system without fear. That is what evals buy you, and it is worth every hour.

Working on a similar AI project? Let's talk about it.