Agent as Judge Evals
Agent as Judge evals measure custom quality criteria for your Agents and Teams using LLM-as-a-judge methodology.
Agent as Judge evaluations let you define custom quality criteria and use an LLM to score your Agent's responses. You provide evaluation criteria (like "professional tone", "factual accuracy", or "user-friendliness"), and an evaluator model assesses how well the Agent's output meets those standards.
Basic Example
In this example, the AgentAsJudgeEval will evaluate the output of the Agent with their input, providing a score of the Agent's response according to the custom criteria provided.
1from kern.agent import Agent2from kern.db.sqlite import SqliteDb3from kern.eval.agent_as_judge import AgentAsJudgeEval4from kern.models.openai import OpenAIResponses56# Setup database to persist eval results7db = SqliteDb(db_file="tmp/agent_as_judge_basic.db")89agent = Agent(10 model=OpenAIResponses(id="gpt-5.2"),11 instructions="You are a technical writer. Explain concepts clearly and concisely.",12 db=db,13)1415response = agent.run("Explain what an API is")1617evaluation = AgentAsJudgeEval(18 name="Explanation Quality",19 criteria="Explanation should be clear, beginner-friendly, and use simple language",20 scoring_strategy="numeric", # Score 1-1021 threshold=7, # Pass if score >= 722 db=db,23)2425result = evaluation.run(26 input="Explain what an API is",27 output=str(response.content),28 print_results=True,29)Custom Evaluator Agent
You can use a custom agent to evaluate responses with specific instructions:
1from kern.agent import Agent2from kern.eval.agent_as_judge import AgentAsJudgeEval3from kern.models.openai import OpenAIResponses45agent = Agent(6 model=OpenAIResponses(id="gpt-5.2"),7 instructions="Explain technical concepts simply.",8)910response = agent.run("Explain what an API is")1112# Create a custom evaluator with specific instructions13custom_evaluator = Agent(14 model=OpenAIResponses(id="gpt-5.2"),15 description="Strict technical evaluator",16 instructions="You are a strict evaluator. Only pass exceptionally clear and accurate explanations.",17)1819evaluation = AgentAsJudgeEval(20 name="Technical Accuracy",21 criteria="Explanation must be technically accurate and comprehensive",22 evaluator_agent=custom_evaluator,23)2425result = evaluation.run(26 input="Explain what an API is",27 output=str(response.content),28 print_results=True,29 print_summary=True,30)
Params
| Parameter | Type | Default | Description |
|---|---|---|---|
criteria | str | "" | The evaluation criteria describing what makes a good response (required). |
scoring_strategy | Literal["numeric", "binary"] | "binary" | Scoring mode: "numeric" (1-10 scale) or "binary" (pass/fail). |
threshold | int | 7 | Minimum score to pass (only used for numeric strategy). |
on_fail | Optional[Callable] | None | Callback function triggered when evaluation fails. |
additional_guidelines | Optional[Union[str, List[str]]] | None | Extra evaluation guidelines beyond the main criteria. |
name | Optional[str] | None | Name for the evaluation. |
model | Optional[Model] | None | Model to use for judging (defaults to gpt-5-mini if not provided). |
evaluator_agent | Optional[Agent] | None | Custom agent to use as evaluator. |
print_summary | bool | False | Print summary of evaluation results. |
print_results | bool | False | Print detailed evaluation results. |
file_path_to_save_results | Optional[str] | None | File path to save evaluation results. |
debug_mode | bool | False | Enable debug mode for detailed logging. |
db | Optional[Union[BaseDb, AsyncBaseDb]] | None | Database to store evaluation results. |
telemetry | bool | True | Enable telemetry. |
run_in_background | bool | False | Run evaluation as background task (non-blocking). |
Methods
run() / arun()
Run the evaluation synchronously (run()) or asynchronously (arun()).
| Parameter | Type | Default | Description |
|---|---|---|---|
input | Optional[str] | None | Input text for single evaluation. |
output | Optional[str] | None | Output text for single evaluation. |
cases | Optional[List[Dict[str, str]]] | None | List of input/output pairs for batch evaluation. |
print_summary | bool | False | Print summary of evaluation results. |
print_results | bool | False | Print detailed evaluation results. |
Provide either (input, output) for single evaluation OR cases for batch evaluation, not both.
Examples
Basic Agent as Judge
Basic usage with numeric scoring and failure callbacks
Agent as Judge as Post-Hook
Automatic evaluation after agent runs