All articles
Observability

Observability for AI agents

Design patterns for tracing non-deterministic agent workflows, and the metrics SREs should monitor.

March 30, 20268 min read
Observability for AI agents

Standard application logging (such as structured JSON logs or text printouts) is insufficient for diagnosing issues in agentic systems. Because agents operate in loops and call external tools based on non-deterministic model outputs, linear logs cannot answer key questions during outages. To debug a failing agent, you must trace the exact sequence of tool calls, prompt iterations, and repair attempts that occurred during the execution.

The Correct Span Shape for Agents

Traditional tracing treats each HTTP request as a single span. For agents, we must design a hierarchical span tree that maps to the execution graph. Every agent invocation represents a root span, with nested child spans for each step in the workflow. Key attributes that should be attached to each span include:

  • Root Span: Invocation parameters, system prompt, and final output status.
  • Model Spans: Model name, tokens used (prompt and completion), finish reason, and temperature.
  • Tool Spans: Tool name, input arguments, raw string output, and execution latency.
  • Repair Spans: The syntax error message, repair strategy used, and number of iterations.
Observability dashboard close-up with cyan and violet trend lines
Observability dashboard close-up with cyan and violet trend lines

Native OpenTelemetry Tracing

Integrating observability directly into the agent framework boundary ensures every span is captured automatically. By exporting trace data via OpenTelemetry, you can visualize execution graphs and pinpoint bottlenecks inside standard APM tools (like Datadog, Honeycomb, or Jaeger).

python
# Wire OpenTelemetry tracing to the framework client
from kern.obs import trace_agent
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("run_agent") as span:
    # Captures nested spans for LLM calls and tool execution automatically
    result = my_agent.run("Process billing query")

Metrics that Matter

While monitoring latency and token costs is important, the metrics that indicate system drift are repair rates, tool retry frequencies, and fallback escalations. An increase in repair rates usually indicates a change in the model's behavior or a data shift in user inputs. By setting alerts on these system-level metrics, you can identify and resolve prompts or schema issues before they impact end-users.