Standard application logging (such as structured JSON logs or text printouts) is insufficient for diagnosing issues in agentic systems. Because agents operate in loops and call external tools based on non-deterministic model outputs, linear logs cannot answer key questions during outages. To debug a failing agent, you must trace the exact sequence of tool calls, prompt iterations, and repair attempts that occurred during the execution.
The Correct Span Shape for Agents
Traditional tracing treats each HTTP request as a single span. For agents, we must design a hierarchical span tree that maps to the execution graph. Every agent invocation represents a root span, with nested child spans for each step in the workflow. Key attributes that should be attached to each span include:
- Root Span: Invocation parameters, system prompt, and final output status.
- Model Spans: Model name, tokens used (prompt and completion), finish reason, and temperature.
- Tool Spans: Tool name, input arguments, raw string output, and execution latency.
- Repair Spans: The syntax error message, repair strategy used, and number of iterations.

Native OpenTelemetry Tracing
Integrating observability directly into the agent framework boundary ensures every span is captured automatically. By exporting trace data via OpenTelemetry, you can visualize execution graphs and pinpoint bottlenecks inside standard APM tools (like Datadog, Honeycomb, or Jaeger).
# Wire OpenTelemetry tracing to the framework client
from kern.obs import trace_agent
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("run_agent") as span:
# Captures nested spans for LLM calls and tool execution automatically
result = my_agent.run("Process billing query")Metrics that Matter
While monitoring latency and token costs is important, the metrics that indicate system drift are repair rates, tool retry frequencies, and fallback escalations. An increase in repair rates usually indicates a change in the model's behavior or a data shift in user inputs. By setting alerts on these system-level metrics, you can identify and resolve prompts or schema issues before they impact end-users.
