Observability Guide¶

What you'll learn¶

OTEL distributed tracing for agent runs
Prometheus metrics and Grafana dashboards
The append-only event log for audit and replay
Behavior monitoring and anomaly detection
Setting up Jaeger locally with Docker Compose

Architecture overview¶

graph LR
    Agent["Grampus Agent"] --> Tracer["GrampusTracer\n(OTEL SDK)"]
    Agent --> Metrics["GrampusMetrics\n(Prometheus)"]
    Agent --> EventLog["EventLog\n(append-only)"]
    Agent --> Monitor["BehaviorMonitor\n(anomaly detection)"]

    Tracer --> OTELCol["OTEL Collector"]
    OTELCol --> Jaeger["Jaeger UI\nlocalhost:16686"]

    Metrics --> Prom["Prometheus\nlocalhost:9090"]
    Prom --> Grafana["Grafana\nlocalhost:3000"]

    EventLog --> Dapr["Dapr State Store\n(PostgreSQL)"]
    Monitor --> PubSub["Dapr Pub/Sub\n(cost/tool events)"]

OTEL tracing¶

Setup¶

from grampus.observability.tracer import GrampusTracer

tracer = GrampusTracer(
    service_name="research-agent",
    otel_endpoint="http://localhost:4317",   # OTEL Collector gRPC
    enabled=True,
)

Span types¶

Every significant agent action produces a span. Spans are nested under a session-level parent:

Span type	Triggered by	Key attributes
`agent.run`	`AgentRunner.run()`	`agent.name`, `agent.model`, `session.id`
`agent.llm_call`	Each LLM API request	`model`, `input_tokens`, `output_tokens`, `cost_usd`, `stop_reason`
`agent.tool_call`	Each tool execution	`tool.name`, `tool.duration_ms`, `tool.success`
`agent.memory_read`	`MemoryManager.recall()`	`memory.type`, `memory.query`, `memory.results_count`
`agent.memory_write`	`MemoryManager.remember()`	`memory.type`, `memory.source_type`, `memory.trust_level`
`agent.decision`	End of each ReAct iteration	`agent.step`, `decision.action` (tool_call vs final_answer)

Manual spans (for custom code)¶

with tracer.span("agent.custom_step", attributes={"step.name": "validate_input"}):
    validated = validate_user_input(user_input)

Viewing traces in Jaeger¶

Navigate to http://localhost:16686. Select service research-agent. Each agent run appears as a root span agent.run with nested child spans:

agent.run  (session-42, 2.3s)
├── agent.memory_read  (0.05s)  query="capital of Brazil"
├── agent.llm_call     (0.8s)   model=claude-sonnet-4-6, tokens=312, cost=$0.0002
├── agent.tool_call    (0.3s)   tool=web_search
├── agent.memory_write (0.02s)  type=episodic
└── agent.llm_call     (0.9s)   model=claude-sonnet-4-6, tokens=489, cost=$0.0003

Prometheus metrics¶

Expose the metrics endpoint¶

from grampus.observability.metrics import GrampusMetrics

metrics = GrampusMetrics(port=9090)
await metrics.start()   # starts /metrics HTTP server

Available metrics¶

Counters (cumulative, ever-increasing):

Metric	Labels	Description
`nexus_tokens_total`	`model`, `agent_name`, `token_type`	Total tokens consumed
`nexus_cost_usd_total`	`model`, `agent_name`	Total USD spent
`nexus_tool_calls_total`	`tool_name`, `agent_name`, `status`	Tool executions
`nexus_errors_total`	`error_code`, `agent_name`	Errors by type
`nexus_agent_runs_total`	`agent_name`, `status`	Agent run completions

Gauges (current snapshot):

Metric	Labels	Description
`grampus_active_agents`	`agent_name`	Currently running agents

Histograms (latency distributions):

Metric	Labels	Description
`grampus_llm_latency_seconds`	`model`, `agent_name`	LLM call duration
`grampus_tool_latency_seconds`	`tool_name`, `agent_name`	Tool execution duration
`nexus_agent_run_duration_seconds`	`agent_name`	Full agent run duration

Sample Grafana queries¶

# Average LLM latency per model (last 5 minutes)
rate(grampus_llm_latency_seconds_sum[5m]) / rate(grampus_llm_latency_seconds_count[5m])

# Token cost rate (USD per hour)
rate(nexus_cost_usd_total[1h]) * 3600

# Tool error rate
rate(nexus_errors_total{error_code=~"tool.*"}[5m])

# P99 agent run duration
histogram_quantile(0.99, rate(nexus_agent_run_duration_seconds_bucket[5m]))

Event log¶

The event log captures every agent action as an immutable, append-only record. It provides full audit trails and supports forensic debugging ("why did the agent do X at step 5?").

from grampus.observability.events import EventLog

event_log = EventLog(state_store=state_store)

# Query events for a session
events = await event_log.get_events(session_id="session-42")
for event in events:
    print(f"[{event.timestamp}] {event.event_type}: {event.summary}")

Event types¶

Event type	Triggered by
`agent.started`	`AgentRunner.run()` called
`agent.completed`	Successful `ExecutionResult` returned
`agent.failed`	Unhandled exception in runner
`llm.called`	LLM API request sent
`llm.responded`	LLM API response received
`tool.called`	`ToolExecutor.execute()` called
`tool.completed`	Tool returned result
`tool.failed`	Tool raised exception
`memory.read`	`MemoryManager.recall()` called
`memory.written`	`MemoryManager.remember()` called
`safety.violation`	Safety check detected an issue

Replay events¶

Events can be replayed to reconstruct the exact state at any point in a run:

# Reconstruct state at step 3
state_at_step_3 = await event_log.replay_to_step(
    session_id="session-42",
    step=3,
)
print(f"Messages at step 3: {len(state_at_step_3.messages)}")

Behavior monitor¶

The BehaviorMonitor tracks agent behavior patterns over time and alerts on anomalies.

from grampus.observability.behavior import BehaviorMonitor

monitor = BehaviorMonitor(
    agent_name="research-agent",
    window_hours=24,         # analyze behavior over last 24 hours
    alert_threshold=2.5,     # alert if metric exceeds 2.5× rolling average
)

Monitored patterns¶

Pattern	What triggers an alert
Tool usage shift	Tool X called 3× more or less than baseline
Cost spike	Cost per run exceeds 2.5× rolling average
Memory access anomaly	Memory reads from unusual sources
Error rate spike	Error rate exceeds baseline by 2.5×
Latency spike	P95 latency exceeds 2.5× baseline

Checking for anomalies¶

anomalies = await monitor.detect_anomalies(session_id="session-42")
for anomaly in anomalies:
    print(f"  [{anomaly.severity}] {anomaly.pattern}: {anomaly.description}")
    print(f"  Current: {anomaly.current_value:.2f}, Baseline: {anomaly.baseline_value:.2f}")

Local Jaeger setup¶

Add to docker-compose.yml:

services:
  jaeger:
    image: jaegertracing/all-in-one:1.62
    ports:
      - "16686:16686"   # Jaeger UI
      - "4317:4317"     # OTEL gRPC
      - "4318:4318"     # OTEL HTTP
    environment:
      COLLECTOR_OTLP_ENABLED: "true"

Then configure Grampus to send traces:

# grampus.yaml
observability:
  otel_enabled: true
  otel_endpoint: http://localhost:4317
  log_level: INFO
  metrics_enabled: true

Start everything:

docker compose up -d

Open the Jaeger UI at http://localhost:16686.

Grafana dashboard¶

A pre-built 14-panel Grafana dashboard is included at grafana/dashboards/grampus-overview.json. It auto-provisions when you start the Grafana stack:

docker compose -f grafana/docker-compose.grafana.yml up -d
# Open http://localhost:3000  (default login: admin / admin)
# Dashboard auto-provisions under the "Grampus" folder

Dashboard panels¶

Panel	Type	Description
Agent throughput	Time series	Runs completed per minute
LLM call rate	Time series	Model API calls per minute
P50 LLM latency	Stat	50th percentile LLM response time
P95 LLM latency	Stat	95th percentile LLM response time
P99 LLM latency	Stat	99th percentile LLM response time
LLM latency histogram	Heatmap	Full latency distribution over time
Cost per model	Bar chart	Cumulative USD spend broken down by model
Active agents	Gauge	Current count of running agent sessions
Error rate	Time series	Errors per minute by error type
Tool call rate	Time series	Tool executions per minute
Tool success rate	Stat	Percentage of tool calls that succeeded
Tokens per run (avg)	Stat	Average total tokens per agent run
Session cost (avg)	Stat	Average USD per session
Top tools by call count	Table	Most-used tools ranked by invocation count

Template variables¶

The dashboard includes two template variables you can use to filter all panels:

$datasource — switch between Prometheus instances
$agent_id — filter all panels to a specific agent ID (empty = all agents)

Prometheus metrics endpoint¶

The Grampus server exposes a Prometheus-compatible metrics endpoint at GET /metrics. Point your Prometheus scrape config at it:

# prometheus.yml
scrape_configs:
  - job_name: grampus
    static_configs:
      - targets: ["localhost:8000"]
    metrics_path: /metrics

Available metrics¶

Counters (ever-increasing, reset on restart):

Metric	Labels	Description
`nexus_llm_calls_total`	`agent_id`, `model`	Total LLM API calls made
`nexus_tool_calls_total`	`agent_id`, `tool_name`	Total tool executions
`nexus_cost_usd_total`	`agent_id`, `model`	Cumulative USD spend
`nexus_errors_total`	`agent_id`	Total errors by agent

Gauges (current snapshot):

Metric	Labels	Description
`grampus_active_agents`	—	Number of currently running agent sessions

Histograms (latency distributions with _bucket, _sum, _count suffixes):

Metric	Labels	Description
`grampus_llm_latency_seconds`	`agent_id`, `model`	LLM API call duration
`grampus_tool_latency_seconds`	`agent_id`, `tool_name`	Tool execution duration

Next steps¶

Observability API reference → — Full GrampusTracer and GrampusMetrics reference
Evaluation guide → — Correlate eval results with traces
Deployment guide → — Configure OTEL for Kubernetes