What is Evals
Evals is a way to measure the quality of your Agents and Teams.<br/> Kern provides multiple dimensions for evaluating Agents.
Learn how to evaluate your Kern Agents and Teams across multiple dimensions - accuracy (simple correctness checks), agent as judge (custom quality criteria), performance (runtime and memory), and reliability (tool calls).
Evaluation Dimensions
bullseye
Accuracy
The accuracy of the Agent's response using LLM-as-a-judge methodology.
scale-balanced
Agent as Judge
Evaluate custom quality criteria using LLM-as-a-judge with scoring.
stopwatch
Performance
The performance of the Agent's response, including latency and memory footprint.
shield-check
Reliability
The reliability of the Agent's response, including tool calls and error handling.
Quick Start
Here's a simple example of running an accuracy evaluation:
1from typing import Optional2from kern.agent import Agent3from kern.eval.accuracy import AccuracyEval, AccuracyResult4from kern.models.openai import OpenAIResponses5from kern.tools.calculator import CalculatorTools67# Create an evaluation8evaluation = AccuracyEval(9 model=OpenAIResponses(id="gpt-5.2"),10 agent=Agent(model=OpenAIResponses(id="gpt-5.2"), tools=[CalculatorTools()]),11 input="What is 10*5 then to the power of 2? do it step by step",12 expected_output="2500",13 additional_guidelines="Agent output should include the steps and the final answer.",14)1516# Run the evaluation17result: Optional[AccuracyResult] = evaluation.run(print_results=True)Best Practices
- Start Simple: Begin with basic accuracy tests before progressing to complex performance and reliability evaluations
- Use Multiple Test Cases: Don't rely on a single test case—build comprehensive test suites that cover edge cases
- Track Over Time: Monitor your eval metrics continuously as you iterate on your agents
- Combine Dimensions: Evaluate across all three dimensions for a holistic view of agent quality
Guides
Dive deeper into each evaluation dimension:
- Accuracy Evals - Learn LLM-as-a-judge techniques and multiple test case strategies
- Agent as Judge Evals - Define custom quality criteria with flexible scoring strategies
- Performance Evals - Measure latency, memory usage, and compare different configurations
- Reliability Evals - Test tool calls, error handling, and rate limiting behavior