What is Evals

Evals is a way to measure the quality of your Agents and Teams.<br/> Kern provides multiple dimensions for evaluating Agents.

Learn how to evaluate your Kern Agents and Teams across multiple dimensions - accuracy (simple correctness checks), agent as judge (custom quality criteria), performance (runtime and memory), and reliability (tool calls).

Evaluation Dimensions

bullseye

Accuracy

The accuracy of the Agent's response using LLM-as-a-judge methodology.

scale-balanced

Agent as Judge

Evaluate custom quality criteria using LLM-as-a-judge with scoring.

stopwatch

Performance

The performance of the Agent's response, including latency and memory footprint.

shield-check

Reliability

The reliability of the Agent's response, including tool calls and error handling.

Quick Start

Here's a simple example of running an accuracy evaluation:

1from typing import Optional
2from kern.agent import Agent
3from kern.eval.accuracy import AccuracyEval, AccuracyResult
4from kern.models.openai import OpenAIResponses
5from kern.tools.calculator import CalculatorTools
6
7# Create an evaluation
8evaluation = AccuracyEval(
9    model=OpenAIResponses(id="gpt-5.2"),
10    agent=Agent(model=OpenAIResponses(id="gpt-5.2"), tools=[CalculatorTools()]),
11    input="What is 10*5 then to the power of 2? do it step by step",
12    expected_output="2500",
13    additional_guidelines="Agent output should include the steps and the final answer.",
14)
15
16# Run the evaluation
17result: Optional[AccuracyResult] = evaluation.run(print_results=True)

Best Practices

Start Simple: Begin with basic accuracy tests before progressing to complex performance and reliability evaluations
Use Multiple Test Cases: Don't rely on a single test case—build comprehensive test suites that cover edge cases
Track Over Time: Monitor your eval metrics continuously as you iterate on your agents
Combine Dimensions: Evaluate across all three dimensions for a holistic view of agent quality

Guides

Dive deeper into each evaluation dimension:

Accuracy Evals - Learn LLM-as-a-judge techniques and multiple test case strategies
Agent as Judge Evals - Define custom quality criteria with flexible scoring strategies
Performance Evals - Measure latency, memory usage, and compare different configurations
Reliability Evals - Test tool calls, error handling, and rate limiting behavior