Comparison Accuracy Evaluation

Example showing how to evaluate agent accuracy on comparison tasks.

Create a Python file

1from typing import Optional
2
3from kern.agent import Agent
4from kern.eval.accuracy import AccuracyEval, AccuracyResult
5from kern.models.openai import OpenAIResponses
6from kern.tools.calculator import CalculatorTools
7
8evaluation = AccuracyEval(
9 name="Comparison Evaluation",
10 model=OpenAIResponses(id="gpt-5.2"),
11 agent=Agent(
12 model=OpenAIResponses(id="gpt-5.2"),
13 tools=[CalculatorTools()],
14 instructions="You must use the calculator tools for comparisons.",
15 ),
16 input="9.11 and 9.9 -- which is bigger?",
17 expected_output="9.9",
18 additional_guidelines="Its ok for the output to include additional text or information relevant to the comparison.",
19)
20
21result: Optional[AccuracyResult] = evaluation.run(print_results=True)
22assert result is not None and result.avg_score >= 8

Set up your virtual environment

1uv venv --python 3.12
2source .venv/bin/activate
1uv venv --python 3.12
2.venv\Scripts\activate

Install dependencies

1uv pip install -U openai kern-ai

Export your OpenAI API key

1export OPENAI_API_KEY="your_openai_api_key_here"
1$Env:OPENAI_API_KEY="your_openai_api_key_here"

Run Agent

1python accuracy_comparison.py