What is Evals

Evals is a way to measure the quality of your Agents and Teams.<br/> Kern provides multiple dimensions for evaluating Agents.

Learn how to evaluate your Kern Agents and Teams across multiple dimensions - accuracy (simple correctness checks), agent as judge (custom quality criteria), performance (runtime and memory), and reliability (tool calls).

Evaluation Dimensions

Quick Start

Here's a simple example of running an accuracy evaluation:

1from typing import Optional
2from kern.agent import Agent
3from kern.eval.accuracy import AccuracyEval, AccuracyResult
4from kern.models.openai import OpenAIResponses
5from kern.tools.calculator import CalculatorTools
6
7# Create an evaluation
8evaluation = AccuracyEval(
9 model=OpenAIResponses(id="gpt-5.2"),
10 agent=Agent(model=OpenAIResponses(id="gpt-5.2"), tools=[CalculatorTools()]),
11 input="What is 10*5 then to the power of 2? do it step by step",
12 expected_output="2500",
13 additional_guidelines="Agent output should include the steps and the final answer.",
14)
15
16# Run the evaluation
17result: Optional[AccuracyResult] = evaluation.run(print_results=True)

Best Practices

  • Start Simple: Begin with basic accuracy tests before progressing to complex performance and reliability evaluations
  • Use Multiple Test Cases: Don't rely on a single test case—build comprehensive test suites that cover edge cases
  • Track Over Time: Monitor your eval metrics continuously as you iterate on your agents
  • Combine Dimensions: Evaluate across all three dimensions for a holistic view of agent quality

Guides

Dive deeper into each evaluation dimension:

  1. Accuracy Evals - Learn LLM-as-a-judge techniques and multiple test case strategies
  2. Agent as Judge Evals - Define custom quality criteria with flexible scoring strategies
  3. Performance Evals - Measure latency, memory usage, and compare different configurations
  4. Reliability Evals - Test tool calls, error handling, and rate limiting behavior