Blog

What Are Evals?

This Week's Term: Evals - structured tests that measure whether an AI system is performing to your standards, consistently and over time.

AI TerminologyAI SafetyQuality AssuranceCastifaiProduction AI

This Week's Term: Evals - structured tests that measure whether an AI system is performing to your standards, consistently and over time.

If you've been building with AI, you've probably experienced the "vibe check" approach to quality. You try a prompt, look at the output, think "yeah, that seems good," and move on. It works when you're experimenting. It breaks down the moment you're running anything in production.

Here's why: traditional software is deterministic. The same input produces the same output, every time. If your login function works on Tuesday, it works on Wednesday. AI systems don't have that guarantee. The same prompt can produce subtly different outputs each time. And when the model provider pushes an update - which happens regularly - a capability that worked perfectly last week might degrade without warning.

Evals replace the vibe check with evidence.

Why this matters for business leaders

Think of evals as the quality assurance and governance layer for AI systems. In traditional software, you have unit tests, integration tests, and QA processes that catch problems before they reach users. Evals serve the same function for AI, but they're designed for systems where outputs are probabilistic rather than deterministic.

This makes evals a risk-control mechanism, not just a technical concern. When your AI system handles customer-facing interactions, generates content that represents your brand, or makes recommendations that affect business outcomes - you need more than someone occasionally checking if "it looks right."

Model updates are a particular risk. A new version might improve general reasoning while degrading performance on your specific use case. Without evals, you won't know until customers complain. With evals, you catch it before deployment.

How I built evals into Castifai

When I was building Castifai, I realized early that subjective quality assessment wouldn't scale. I couldn't personally review every generated infographic, and my sense of "good enough" wasn't necessarily aligned with what users actually valued. So I built two feedback loops.

User-facing flags. Users can mark any generated output as "good" or "needs improvement." This creates a continuous stream of real-world quality signals that goes beyond my own judgment. When multiple users flag a specific visual style as problematic, that's data - not opinion.

Prompt-level tracking. I cluster the feedback by visual style and prompt template. This means I can see that my "corporate clean" style has a 92% satisfaction rate while "bold colorful" is at 78%. When a model update lands, I can immediately check if those numbers shift. If "corporate clean" drops to 85% after an update, I know exactly which prompts need retuning - without waiting for a wave of complaints.

A starter framework for your organization

You don't need a sophisticated testing infrastructure to start. Here's a practical framework:

1. Define your quality bar. Write down 3-5 pass/fail criteria tied to business impact. Not "the output should be good" but "the output must include accurate pricing" or "the response must reference the customer's specific situation." These criteria should be binary - pass or fail, no maybes.

2. Build a representative test set. Collect 20-50 examples that represent your actual use cases, including edge cases. The common mistake is testing only the happy path. Include the weird inputs, the ambiguous requests, the cases that currently trip up your team.

3. Establish consistent scoring. Whether you use human reviewers, automated checks, or a combination - the method should be repeatable. If two reviewers can't agree on whether an output passes, your criteria aren't specific enough.

4. Set release gates. Before any model update or prompt change goes live, it must pass your test set at a predetermined threshold. "90% pass rate on the test set" is a concrete gate. "Looks good to the team" is not.

The connection to everything else

Evals tie directly to the themes in the rest of this issue. They operationalize your first principles - if you've committed to "quality and trust are non-negotiable," evals are how you enforce that commitment. They enable the value metrics shift - instead of measuring how many outputs your AI generates, you measure how many meet your quality bar. And they protect product quality as models evolve, which is exactly the reliability challenge I described in the tool spotlight.

Without evals, your AI commitment is aspirational. With them, it's operational.

If you are curious and want to dive deeper, I recommend watching the Wharton analysis on 2026 AI trends by Stefano Puntoni, which covers how evaluation frameworks are becoming central to enterprise AI strategy.

Your action step

Create a one-page eval specification for your most important AI use case. Include: your top 5 test scenarios, the top 3 failure modes you're most worried about, how you'll score outputs (pass/fail criteria), and the minimum threshold for launch. One page. That's your starting point for moving from vibe checks to evidence-based AI quality.

Originally published in Think Big Newsletter #19 on the Think Big Newsletter.

Subscribe to Think Big Newsletter