Agent as Judge Evals

Agent as Judge evals measure custom quality criteria for your Agents and Teams using LLM-as-a-judge methodology.

Agent as Judge evaluations let you define custom quality criteria and use an LLM to score your Agent's responses. You provide evaluation criteria (like "professional tone", "factual accuracy", or "user-friendliness"), and an evaluator model assesses how well the Agent's output meets those standards.

Basic Example

In this example, the AgentAsJudgeEval will evaluate the output of the Agent with their input, providing a score of the Agent's response according to the custom criteria provided.

1from kern.agent import Agent
2from kern.db.sqlite import SqliteDb
3from kern.eval.agent_as_judge import AgentAsJudgeEval
4from kern.models.openai import OpenAIResponses
5
6# Setup database to persist eval results
7db = SqliteDb(db_file="tmp/agent_as_judge_basic.db")
8
9agent = Agent(
10    model=OpenAIResponses(id="gpt-5.2"),
11    instructions="You are a technical writer. Explain concepts clearly and concisely.",
12    db=db,
13)
14
15response = agent.run("Explain what an API is")
16
17evaluation = AgentAsJudgeEval(
18    name="Explanation Quality",
19    criteria="Explanation should be clear, beginner-friendly, and use simple language",
20    scoring_strategy="numeric",  # Score 1-10
21    threshold=7,  # Pass if score >= 7
22    db=db,
23)
24
25result = evaluation.run(
26    input="Explain what an API is",
27    output=str(response.content),
28    print_results=True,
29)

Custom Evaluator Agent

You can use a custom agent to evaluate responses with specific instructions:

1from kern.agent import Agent
2from kern.eval.agent_as_judge import AgentAsJudgeEval
3from kern.models.openai import OpenAIResponses
4
5agent = Agent(
6    model=OpenAIResponses(id="gpt-5.2"),
7    instructions="Explain technical concepts simply.",
8)
9
10response = agent.run("Explain what an API is")
11
12# Create a custom evaluator with specific instructions
13custom_evaluator = Agent(
14    model=OpenAIResponses(id="gpt-5.2"),
15    description="Strict technical evaluator",
16    instructions="You are a strict evaluator. Only pass exceptionally clear and accurate explanations.",
17)
18
19evaluation = AgentAsJudgeEval(
20    name="Technical Accuracy",
21    criteria="Explanation must be technically accurate and comprehensive",
22    evaluator_agent=custom_evaluator,
23)
24
25result = evaluation.run(
26    input="Explain what an API is",
27    output=str(response.content),
28    print_results=True,
29    print_summary=True,
30)

Params

Parameter	Type	Default	Description
`criteria`	`str`	`""`	The evaluation criteria describing what makes a good response (required).
`scoring_strategy`	`Literal["numeric", "binary"]`	`"binary"`	Scoring mode: `"numeric"` (1-10 scale) or `"binary"` (pass/fail).
`threshold`	`int`	`7`	Minimum score to pass (only used for numeric strategy).
`on_fail`	`Optional[Callable]`	`None`	Callback function triggered when evaluation fails.
`additional_guidelines`	`Optional[Union[str, List[str]]]`	`None`	Extra evaluation guidelines beyond the main criteria.
`name`	`Optional[str]`	`None`	Name for the evaluation.
`model`	`Optional[Model]`	`None`	Model to use for judging (defaults to gpt-5-mini if not provided).
`evaluator_agent`	`Optional[Agent]`	`None`	Custom agent to use as evaluator.
`print_summary`	`bool`	`False`	Print summary of evaluation results.
`print_results`	`bool`	`False`	Print detailed evaluation results.
`file_path_to_save_results`	`Optional[str]`	`None`	File path to save evaluation results.
`debug_mode`	`bool`	`False`	Enable debug mode for detailed logging.
`db`	`Optional[Union[BaseDb, AsyncBaseDb]]`	`None`	Database to store evaluation results.
`telemetry`	`bool`	`True`	Enable telemetry.
`run_in_background`	`bool`	`False`	Run evaluation as background task (non-blocking).

Methods

run() / arun()

Run the evaluation synchronously (run()) or asynchronously (arun()).

Parameter	Type	Default	Description
`input`	`Optional[str]`	`None`	Input text for single evaluation.
`output`	`Optional[str]`	`None`	Output text for single evaluation.
`cases`	`Optional[List[Dict[str, str]]]`	`None`	List of input/output pairs for batch evaluation.
`print_summary`	`bool`	`False`	Print summary of evaluation results.
`print_results`	`bool`	`False`	Print detailed evaluation results.

Note

Provide either (input, output) for single evaluation OR cases for batch evaluation, not both.

Examples

gavel

Basic Agent as Judge

Basic usage with numeric scoring and failure callbacks

bolt

Agent as Judge as Post-Hook

Automatic evaluation after agent runs