Binary Agent as Judge

Binary pass/fail evaluation without numeric scoring

This example demonstrates binary PASS/FAIL evaluation mode without numeric scoring.

Add the following code to your Python file

1from kern.agent import Agent
2from kern.db.sqlite import SqliteDb
3from kern.eval.agent_as_judge import AgentAsJudgeEval
4from kern.models.openai import OpenAIResponses
5
6# Setup database to persist eval results
7db = SqliteDb(db_file="tmp/agent_as_judge_binary.db")
8
9agent = Agent(
10 model=OpenAIResponses(id="gpt-5.2"),
11 instructions="You are a customer service agent. Respond professionally.",
12 db=db,
13)
14
15response = agent.run("I need help with my account")
16
17evaluation = AgentAsJudgeEval(
18 name="Professional Tone Check",
19 criteria="Response must maintain professional tone without informal language or slang",
20 db=db,
21)
22
23result = evaluation.run(
24 input="I need help with my account",
25 output=str(response.content),
26 print_results=True,
27 print_summary=True,
28)
29
30print(f"Result: {'PASSED' if result.results[0].passed else 'FAILED'}")

Set up your virtual environment

1uv venv --python 3.12
2source .venv/bin/activate
1uv venv --python 3.12
2.venv\Scripts\activate

Install dependencies

1uv pip install -U kern-ai openai

Export your OpenAI API key

1export OPENAI_API_KEY="your_openai_api_key_here"
1$Env:OPENAI_API_KEY="your_openai_api_key_here"

Run the example

1python agent_as_judge_binary.py