Architecture Decision Records¶

ADRs capture the reasoning behind major design decisions. Each decision is permanent unless explicitly superseded.

ADR-001: Dapr as Infrastructure Backbone¶

Status: Accepted

Context: Need durable execution, state management, pub/sub, security, and observability without building from scratch. Building these correctly (with mTLS, ETag concurrency, distributed locks, workflow checkpointing) would require 6–12 months of distributed systems engineering.

Decision: Use Dapr runtime as the infrastructure layer. Agent code talks only to Dapr APIs, never directly to databases or message brokers.

Consequences: - Requires Dapr sidecar running alongside agent process - Adds operational complexity in Kubernetes (two containers per pod) - All infrastructure is swappable via Dapr component YAML — no code changes to switch from Redis to Kafka for pub/sub - mTLS between services is zero-configuration - Dapr emits OTEL traces automatically for all state and messaging operations

ADR-002: PostgreSQL + pgvector as Primary State Store¶

Status: Accepted

Context: Need ACID-compliant, auditable storage for memory with vector search capability. Running separate databases for relational data (facts, events) and vector data (embeddings) increases operational complexity.

Decision: PostgreSQL 16 with pgvector extension as primary state store. Redis as cache layer for working memory and tool result caching.

Consequences: - Single database for relational + vector data reduces operational complexity - pgvectorscale benchmarks show competitive performance (471 QPS at 99% recall on 50M vectors) - Graph queries use recursive CTEs in PostgreSQL rather than a separate graph database (revisit if proven insufficient) - ACID transactions enable reliable provenance and audit trails without saga complexity

ADR-003: Pydantic v2 for All Data Models¶

Status: Accepted

Context: Need strict validation, serialization, and schema generation for agent definitions, tool parameters, memory records, and API contracts. Inconsistent validation across the codebase is a primary source of subtle bugs in agent frameworks.

Decision: All public types are Pydantic v2 BaseModel subclasses. Configuration uses pydantic-settings.

Consequences: - Strict mode catches construction-time bugs (wrong types, missing required fields) - JSON schema generation from ToolDefinition.to_function_schema() enables automatic tool documentation for LLMs - Serialization is consistent: one code path for storage, HTTP, and logging - Adds ~100ms import time (negligible for agent workflows where LLM calls dominate)

ADR-004: Async-First Architecture¶

Status: Accepted

Context: Agents are I/O-bound: LLM API calls (1–30s), tool execution (10ms–30s), database operations (1–50ms). Synchronous Python would block the event loop on every I/O call, preventing concurrency.

Decision: All I/O operations use async/await. httpx for HTTP, asyncpg for PostgreSQL, aioredis for Redis. Synchronous tool functions run in thread pool executors.

Consequences: - Entire call chain must be async; synchronous callers use asyncio.run() - Test fixtures require pytest-asyncio with asyncio_mode = "auto" - Enables serving many concurrent agent runs in a single process - CPU-bound tool execution (e.g., image processing) runs in loop.run_in_executor()

ADR-005: Event Sourcing for Agent Actions¶

Status: Accepted

Context: Need full auditability, replay capability, and time-travel debugging for agent executions. Compliance requirements in financial, healthcare, and legal domains require verifiable audit trails.

Decision: Every agent action (LLM call, tool call, memory read/write, decision point) is stored as an immutable event in an append-only log via Dapr state store.

Consequences: - Current state is derived from event replay - Enables forensic debugging: "why did the agent do X at step 5?" - Adds write amplification (every action writes to both operational state and event log) - Enables compliance-ready audit trails without additional infrastructure

ADR-006: Memory Write Provenance as Non-Negotiable¶

Status: Accepted

Context: Memory poisoning attacks (MINJA, MemoryGraft) achieve 95%+ success rates against unprotected agents. Memory is the primary attack surface for persistent agent compromise — a successful poisoning persists across sessions and survives context window limits.

Decision: Every memory write must include provenance metadata (source_type, source_id, trust_level, content_hash). Memory writes without provenance are rejected. This is enforced at the DaprStateStore wrapper level, not at the application level, so it cannot be bypassed by application code.

Consequences: - Adds ~2ms overhead per memory write (SHA-256 hash computation + metadata storage) - All retrieval queries can filter by trust level - Memory auditor can verify integrity via content hashes - Makes memory poisoning attacks significantly harder — tampered content changes the hash, triggering auditor alerts

ADR-007: Sandbox by Default for Tool Execution¶

Status: Accepted

Context: Agents execute LLM-generated code and call external APIs. Unsandboxed execution grants LLM-generated actions full access to the host system — filesystem, network, environment variables, and other processes. OpenClaw (2024) found 190 security advisories in popular agent frameworks due to unsandboxed execution.

Decision: All tool code execution runs in Docker container sandbox by default. Network access, filesystem access, and resource limits are configured per-tool. Opt-out requires explicit configuration.

Consequences: - Adds ~200ms cold-start latency for first tool call in a session (container spin-up) - Warm container reuse reduces subsequent calls to ~10ms overhead - Prevents host system compromise via prompt injection → code execution - Requires Docker daemon running alongside agent

ADR-008: OpenTelemetry for All Observability¶

Status: Accepted

Context: Need distributed tracing, metrics, and logs that work with any backend (Jaeger, Prometheus, Grafana, Datadog, Honeycomb, etc.). Vendor lock-in to a specific observability platform would prevent adoption in organizations with existing tooling.

Decision: OpenTelemetry is the observability standard. Dapr provides infrastructure-level OTEL automatically. Grampus adds agent-specific custom spans: agent.run, agent.llm_call, agent.tool_call, agent.memory_read, agent.memory_write, agent.decision.

Consequences: - Any OTEL-compatible backend works out of the box — change the exporter endpoint, not the code - Custom spans enable agent-specific debugging that generic APM tools cannot provide - Agents running in Kubernetes benefit from Dapr's automatic service mesh tracing - Token cost and model information are captured as span attributes, enabling cost analysis via trace queries

ADR-009: Code Agents as Primary, JSON Tool Calling as Fallback¶

Status: Accepted

Context: Smolagents research (2024) demonstrated that agents writing Python code compose tools more flexibly and handle data transformations more naturally than JSON tool calling. Code agents can chain tool calls, use Python data structures, and perform calculations without additional LLM calls.

Decision: Support both code agents (LLM writes Python executed in sandbox) and JSON tool calling (standard function calling). Code agents are the recommended default for complex, multi-step tasks.

Consequences: - Requires robust sandboxing (ADR-007) - Code execution captures stdout, stderr, and return values - Sandbox Python namespace includes registered tools as callable functions - Simpler tasks can use JSON tool calling to avoid sandbox overhead

ADR-010: MCP + A2A Protocol Support from Day One¶

Status: Accepted

Context: MCP (Model Context Protocol) is becoming the standard for tool integration (97M monthly SDK downloads as of 2025). A2A (Agent-to-Agent) enables cross-framework agent discovery. Building custom tool protocols creates ecosystem lock-in and prevents Grampus agents from using the growing ecosystem of MCP-compatible tools.

Decision: Implement MCP client in the tool layer (Phase 6). Implement A2A discovery in the orchestration layer (Phase 7+). Both are standards-compliant implementations, not custom protocols.

Consequences: - Grampus agents can use any MCP-compatible tool server (filesystem, browser, databases, APIs) - Other frameworks' agents can discover and invoke Grampus agents via A2A - Avoids ecosystem lock-in — Grampus works alongside LangGraph, CrewAI, and Autogen - Requires tracking protocol evolution as both MCP and A2A mature

ADR-011: Consolidated HTMX + Jinja2 Web UI¶

Status: Accepted

Context: Multiple post-launch phases require visual interfaces: memory inspector (D9), eval dashboard (D10), cost analytics, alert management, and an execution trace viewer. Two alternative approaches were considered: (a) separate CLI commands for each feature, or (b) separate web apps or SPAs per feature. Both create fragmentation — users must remember different URLs or commands, state cannot be shared across views (e.g., filtering by agent_id in the sidebar should filter all pages), and each feature reimplements the same table/chart components.

Decision: All web UI phases build into a single consolidated web app served at /ui/ from the existing FastAPI server. Technology stack: HTMX (loaded from CDN — no npm, no build step) + Jinja2 templates for server-side rendering. The D9 phase builds the shell (base template, sidebar navigation, layout system) and all subsequent UI phases add pages to it. One new optional dependency: jinja2>=3.0 added to the server extras group (already a transitive FastAPI dependency in practice).

Exception: The Visual Agent Builder (drag-and-drop graph editor) requires rich interactivity — sortable nodes, canvas pan/zoom, live edge drawing — that HTMX cannot support. That feature uses a minimal React SPA bundled at src/grampus/server/ui/static/builder/ and served at /ui/builder/. It is the only component permitted to introduce a frontend build step.

Consequences: - No Node.js toolchain required to run or develop the UI — uv sync is sufficient - Single URL entry point; sidebar navigation shared across all views - HTMX partial endpoints (/ui/<feature>/_<partial>) enable dynamic updates (live cost tickers, SSE-driven agent status) without full page reloads - HTMX has limits on complex client-side interactivity — sufficient for developer dashboards, not for visual graph editors (see exception above) - Static assets (CSS, minimal JS helpers) live in src/grampus/server/ui/static/ and are served by FastAPI's StaticFiles mount - Jinja2 templates live in src/grampus/server/ui/templates/ with a base.html that all pages extend

ADR-012: Multi-Agent Debate as a First-Class Orchestration Primitive¶

Status: Accepted

Context: High-stakes agent tasks (legal analysis, medical triage, financial decisions) cannot rely on a single LLM call because (a) individual models hallucinate on specialised questions and (b) there is no confidence signal that a single model can reliably self-report. Two prior approaches exist: prompt-level self-consistency (same model, multiple samples) and multi-agent crews (different agents, different roles). Self-consistency degrades on hard questions because sampling diversity is bounded by a single model's knowledge. Crews require pre-defined pipelines and do not provide a convergence signal. Research (Du et al. ICML 2024; M3MAD-Bench ICLR 2025) demonstrates that heterogeneous models arguing toward a shared answer reach substantially higher accuracy than either alternative.

Decision: Implement DebateOrchestrator as a standalone orchestration primitive in src/grampus/orchestration/debate/. It operates on a single question rather than a task pipeline, runs all debaters concurrently per round via asyncio.gather, and integrates with the existing Graph engine via debate_node(). Four specific research findings are baked into the design:

Heterogeneous panels — DebaterConfig.model_id allows mixing model families, not just temperatures. The aggregator uses debater.weight to handle unequal capability.
Sycophancy resistance — Round 2+ prompts require debaters to restate their prior answer verbatim before evaluating peers, and to cite specific logical evidence for any position change (ACL 2025 CONSENSAGENT).
Adaptive routing — If a fast routing model reports confidence ≥ threshold, the full debate is bypassed. This eliminates ~40% of unnecessary calls with no quality loss (arXiv 2504.05047).
Act-vs-escalate — When the final convergence score is below escalate_threshold, the result sets escalate_to_human=True rather than silently returning a low-confidence answer ("From Debate to Decision", April 2026).

Consequences: - Zero new runtime dependencies — stdlib json, asyncio, re, time plus existing Pydantic and OTEL - debate_node() integrates cleanly with the existing Graph conditional-edge API; human escalation uses the existing human_node - Concurrent debaters within a round mean latency is bounded by the slowest debater, not the sum — no worse than a single LLM call per round - Cost scales as num_debaters × num_rounds but adaptive routing mitigates this for easy questions - The convergence detector uses Jaccard word-overlap clustering (no ML model, no embedding calls) — fast and deterministic

ADR-013: Dual-Process Uncertainty Quantification as a First-Class Runner Feature¶

Status: Accepted

Context: Agents produce unreliable outputs at unknown rates. Single-call verbalized confidence (asking the model to write "confidence": 0.8) has a documented ECE of 0.377+ even on frontier models (arXiv 2412.14737, KDD 2025 survey) — aligned models cluster at 90–100% confidence regardless of factual accuracy. Existing frameworks either ignore this or apply per-call thresholds that do not account for how uncertainty accumulates across sequential steps. A grounding error in step 1 biases all downstream reasoning (the "Spiral of Hallucination"), so per-step overconfidence checking is insufficient. There is also no standard mechanism for agents to escalate irreversible actions (send_email, delete, deploy) to humans when confidence is too low.

Decision: Implement UncertaintyMonitor as an optional hook in AgentRunner, not as a separate layer. Four research findings are baked directly into the implementation:

Dual-process estimation (arXiv 2601.15703, Jan 2026) — System 1 (fast): P(True) self-evaluation fused with verbalized confidence, both calibrated. System 2 (slow, opt-in): adaptive semantic entropy sampling when fused confidence is in the uncertain middle zone.
P(True) as primary fast signal (Kadavath et al. 2022) — A single follow-up call asking "Is your answer correct?" achieves ECE ≈ 0.10 on frontier models without logit access. Verbalized confidence (weight 0.4) remains a weak supporting signal alongside P(True) (weight 0.6).
Adaptive semantic entropy (arXiv 2504.03579, 2025) — Start with 2 samples; early-stop if Jaccard ≥ 0.60 (saves ~47% cost); extend to max_samples on disagreement. Pessimistic fusion min(fast, entropy_conf) prevents over-optimism.
SAUP propagation (arXiv 2412.01033, ACL 2025 pp. 6064–6073) — Per-step situational weights (decision=0.70, llm=0.55, tool=0.45, memory_read=0.35) ensure a confident step cannot erase uncertain history. 20% AUROC improvement over single-step UQ.

The three-tier escalation ladder (Zylos Research, April 2026) maps propagated confidence → action: PROCEED → PROCEED_WITH_LOG → PAUSE_FOR_HUMAN → ABORT. Irreversible tool names trigger PAUSE at MEDIUM uncertainty. A System-2 reflection prompt is injected before PAUSE so the next LLM call sees explicit uncertainty acknowledgment.

Consequences: - Zero new required dependencies — stdlib math, json, re, asyncio plus existing Pydantic and OTEL - uncertainty_monitor=None (the default) means zero overhead for agents that don't need UQ - Two hooks in the runner loop: post-LLM (checks response confidence) and pre-tool (checks before irreversible actions); both break the loop cleanly with hit_limit = False - UncertaintyError (code UNCERTAINTY_CRITICAL) gives callers a machine-readable signal on ABORT - uncertainty_guard_node() provides an explicit graph checkpoint between nodes — composable with the existing debate_node() and human_node() primitives - OTEL spans (uncertainty.estimate, uncertainty.semantic, uncertainty.escalate) are emitted per step when a tracer is provided, enabling confidence dashboards alongside cost and latency metrics

ADR-014: Long-Horizon Planning as a First-Class Orchestration Layer¶

Status: Accepted

Context: The existing AgentRunner implements a greedy ReAct loop where each step is chosen independently from the prior step's observation. Research shows this is fundamentally broken for long-horizon tasks: locally optimal step choices lead to early commitments that compound — the longer the task, the worse the degradation ("Why Reasoning Fails to Plan", arXiv 2601.22311, Jan 2026). Existing mitigation strategies — increasing max_iterations, adding chain-of-thought — do not address the core problem of myopic greedy selection. Two additional failure modes motivated this decision: (a) passing full conversation history to every LLM call is the dominant token-cost driver for multi-step tasks, and (b) there is no recovery mechanism when an intermediate step fails other than starting over.

Decision: Implement PlanningRunner as a distinct orchestration layer that wraps AgentRunner without modifying it. Four research findings are baked directly into the implementation:

Task-Decoupled Planning / scoped context (arXiv 2601.07577, Jan 2026) — Each subgoal executor receives only: global task + one-line summaries of completed steps + current subgoal description. The full conversation history is never passed. This reduces token usage by ~82% on long plans and confines error propagation to the active node.
Fallback before replanning (ReAcTree, arXiv 2511.02424, AAMAS 2026) — When a subgoal fails after max_retries, a pre-specified fallback_strategy is tried once before triggering a full (partial) replan. This doubles success rate (61% vs 31%) at negligible cost.
Partial replan only (Google DeepMind Subgoal Framework, arXiv 2603.19685, Mar 2026) — When replanning is triggered, only the downstream unfinished subgoals are regenerated. Completed subgoals and their outputs are preserved. This reduces replan cost and eliminates the "restart from scratch" failure mode.
Adaptive engagement ("Learning When to Plan", arXiv 2509.03581) — A cheap complexity estimate call gates planning engagement. Tasks estimated at ≤ complexity_threshold tool calls delegate directly to AgentRunner, eliminating planning overhead (~40% of queries in typical workloads).

An optional FLARE-inspired lookahead (arXiv 2601.22311) generates n candidate execution paths before each subgoal and selects the highest-scoring approach. It is advisory only: parse failures are silently swallowed and execution continues without a hint.

Consequences: - AgentRunner is unchanged — PlanningRunner wraps it, so all existing ReAct agents continue to work without modification - Subgoal DAG topology is validated at plan creation: unique IDs, no missing dependency references, no cycles (Kahn's algorithm); PlanningError(code="CIRCULAR_DEPENDENCY") is raised on cycle detection - PostconditionVerifier introduces one extra LLM call per subgoal; with the fast model tier this is negligible relative to subgoal execution cost - Parallel wave execution via asyncio.gather matches the existing Graph engine's parallel branch model — the same event loop runs both - planning_node() integrates cleanly with the existing Graph conditional-edge API; failure escalation uses the existing human_node pattern - Zero new required dependencies — stdlib asyncio, json, re, collections plus existing Pydantic and structlog - PlanningError is a top-level peer of OrchestrationError, not a subclass, because planning failures are structurally different from runner failures (they occur before execution begins or during plan maintenance, not during the ReAct loop)

ADR-015: Artifact-Centric Collaboration as a First-Class Orchestration Pattern¶

Status: Accepted

Context: Multi-agent workflows that pass text strings between agents cannot enforce structure, detect conflicts, or guarantee consistency. Agents working on the same document or codebase independently create silently incompatible outputs. The Specification Gap paper (arXiv 2603.24284, March 2026) showed that implicit shared specifications reduce two-agent integration accuracy by 25–39 percentage points. STORM (arXiv 2605.20563, May 2026) showed that post-hoc conflict resolution is worse than write-time detection by 18.7 points on Commit0-Lite. Existing frameworks have no native artifact primitive — they pass strings or serialize to JSON ad hoc.

Decision: Implement ArtifactStore, SectionLockManager, ArtifactCollaborator, and ArtifactCrew in src/grampus/orchestration/artifact/. Key design choices:

Schema-first (Specification Gap): every artifact section has an explicit SectionSchema with description, content_type, and required_fields before any agent is assigned. Implicit specs are rejected at artifact creation time.
MESI-inspired ownership states (Token Coherence, arXiv 2603.15183): UNOWNED → CLAIMED → REVIEWING → MERGED. Prevents any silent writes and converts synchronization cost from O(n×S×|D|) to O((n+W)×|D|).
Write-time conflict detection (STORM): schema validation + dependency version check runs inside ArtifactStore.write_section() before persisting. Conflicts surface at write time, not post-hoc merge.
TODO-claim via Dapr distributed lock (CodeCRDT, arXiv 2510.18893): atomic, at-most-one-winner section claiming reuses the existing Phase 2 lock primitive.
Scoped per-agent context (CAID, arXiv 2603.21489): each agent receives only the artifact schema + its assigned section + one-line summaries of completed dependencies. Full artifact history is never passed, preventing error propagation across sections.
Wave-based parallel execution: sections within the same topological wave execute concurrently via asyncio.gather. Integration checks run between waves.

Consequences: - Zero new required dependencies — Dapr lock already in Phase 2; all else is stdlib + existing Pydantic - ArtifactCrew(agents=[...]) is the primary API; artifact_node() enables single-section graph integration - Artifact.schema is immutable after creation; sections are mutable only through the claim/write/release lifecycle - Circular dependencies in section DAGs are detected at wave-build time via Kahn's algorithm with ArtifactConflictError(code="CIRCULAR_DEPENDENCY") - Content type validation is strict: JSON sections must pass required_fields check; TEXT/MARKDOWN accept any string; CODE sections accept any string - ArtifactConflictError and ArtifactSectionNotFoundError are top-level peers of OrchestrationError in the error hierarchy

ADR-016: Dual-Tier Agent Self-Improvement as a First-Class Runner Feature¶

Status: Accepted

Context: Agents repeat mistakes across sessions because each run starts from the same static system prompt with no memory of past failures. Reflexion (NeurIPS 2023) demonstrated that verbal self-reflection stored in persistent memory enables agents to improve without weight updates. The 2025 SAGE framework (arXiv 2512.17102) extended this by showing that extracting validated reusable skills from successes produces compounding improvement (+8.9% goal completion, 26% fewer steps). ME-ICPO (arXiv 2603.01335, March 2026) established a theoretical grounding for self-reflection as in-context policy optimization. No competitor framework has shipped both failure reflection and success skill extraction as built-in primitives.

Decision: Implement ReflexionEngine and SkillLibrary in src/grampus/memory/reflexion/ as optional hooks in AgentRunner. Three integration points: (1) post-failure hook generates and stores a verbal reflection, (2) post-success hook attempts skill extraction, (3) pre-LLM-call hook retrieves and injects relevant reflections + skills. The PromptOptimizer completes the loop by automatically proposing and evaluating system prompt mutations when an EvalSuite is available.

Key design choices: 1. Both hooks are opt-in and suppressed (reflexion_engine=None by default; all hooks wrapped in contextlib.suppress(Exception)) — self-improvement never crashes the core execution path. 2. Skill lifecycle (SAGE): new skills start unvalidated; promote to validated after ≥3 successful uses; demote below success_rate=0.4 after ≥5 uses; delete below 0.2. 3. Quality confidence for reflections (ME-ICPO): a second LLM call rates reflection quality on 0–1; low-quality reflections (< 0.3) are stored but not surfaced, preventing low-signal noise from polluting context. 4. ProceduralMemory reuse: skills and reflections are stored as Procedure records with procedure_type=SKILL/REFLECTION — no new Dapr key namespace, no new storage infrastructure.

Consequences: - Zero new required dependencies — stdlib only plus existing Pydantic, Dapr, OTEL - AgentRunner with reflexion_engine=None, skill_library=None (the defaults) is behaviorally identical to the pre-F1 runner - PromptOptimizer.optimize() calls EvalSuite N+1 times (1 baseline + N candidates) — only use on non-production agents or with fast/cheap model configs - SkillLibrary.run_sequential() enables SAGE-style batch improvement where skills from earlier tasks in a sequence accelerate later tasks

ADR-017: Three-Tier User Memory Hierarchy as a First-Class Memory Layer¶

Status: Accepted

Context: Agents currently have no persistent model of the individual user. Each session starts cold — the agent cannot remember that this user is a senior engineer who prefers concise answers and is currently migrating a legacy system. Single-layer key-value user profiles (e.g., {expertise: "high"}) fail in practice because: (1) facts become stale without temporal validity metadata (Beyond Dialogue Time, arXiv 2601.07468); (2) extracting facts from noisy conversations without a reflective correction pass amplifies hallucinations during clustering (Bi-Mem, arXiv 2601.06490); (3) a flat profile has no mechanism to promote actively-relevant facts above infrequently-accessed background context (HMO, arXiv 2604.01670).

Decision: Implement UserFact, UserProfile, UserMemoryStore, FactExtractor, and ProfileSynthesizer in src/grampus/memory/user/. The design uses three tiers: Tier 3 (UserEpisodes — raw interactions in existing EpisodicMemory), Tier 2 (UserFacts — extracted, temporally-grounded facts about the user), and Tier 1 (UserProfile — synthesized persona, rebuilt from facts every N new extractions). UserMemoryAdapter integrates both hooks into AgentRunner as opt-in, zero-crash additions.

Key design choices: 1. Temporal validity on every fact (Beyond Dialogue Time): each UserFact has valid_from and valid_until. Contradicted facts are expired rather than overwritten, preserving history. 2. Bidirectional construction (Bi-Mem): inductive agent (FactExtractor) works bottom-up; reflective agent (ProfileSynthesizer) works top-down. This prevents hallucination amplification. 3. Deduplication by cosine similarity before storing: existing facts with similarity > 0.90 get a confidence update (EMA) rather than a duplicate record. 4. Synthesis threshold (HMO): ProfileSynthesizer only fires every 10 new facts (configurable) — prevents thrashing on rapid-fire short sessions while ensuring the profile stays fresh. 5. Context injection is selective: get_context() uses cosine similarity to surface only the facts most relevant to the current query. The full UserFact list is never injected wholesale. 6. Zero behavioral change when disabled: user_memory_adapter=None (the default) means the AgentRunner behaves identically to pre-F2. user_id=None silently skips all hooks.

Consequences: - Zero new required dependencies — existing Dapr state store, embedding_service, and model_client - FactExtractor and ProfileSynthesizer each make 1 LLM call post-session; total overhead is 2 cheap LLM calls (temperature=0.2, max_tokens=400/300) per session end - UserFacts and UserProfile persist independently of the agent — the same user model is available to any agent that shares the same UserMemoryStore instance - user_id is explicit — there is no implicit user tracking; the caller must pass it

ADR-018: Graph-Structured Memory Consolidation and Lifecycle Tiers¶

Status: Accepted

Context: The existing four-layer memory system (working, episodic, semantic, procedural) treats memory as a flat store of records. Two failure modes emerge at scale: (1) flat vector search over tens of thousands of episodic records degrades in quality and speed — relevant concepts buried under irrelevant matches; (2) all records are treated equally regardless of how often they're accessed, wasting retrieval overhead on rarely-used cold memories. The 2026 research produced two complementary solutions. GAM (arXiv 2604.12285) showed that building a two-level knowledge graph — a transient event-progression-graph per session and a stable topic-associative-network triggered by semantic shift — improves reasoning accuracy on long-horizon tasks. MemOS (arXiv 2505.22101, May 2025) showed that managing memory as a hot/warm/cold lifecycle resource achieves 35.24% token savings in production. FluxMem (arXiv 2602.14038) showed that adaptive routing per query type — graph traversal vs. flat vector vs. sequential — outperforms any fixed retrieval strategy.

Decision: Implement two new sub-packages: src/grampus/memory/graph/ (GraphBuilder, SemanticConsolidator, GraphRetriever) and src/grampus/memory/lifecycle/ (LifecycleTierManager, AdaptiveRetriever). Both are additive enhancements — the existing four memory layers are unchanged. MemoryManager receives three optional params (graph_consolidator, lifecycle_manager, adaptive_router); when all three are None, behavior is identical to pre-F3. AgentRunner receives one optional graph_builder param.

Key design choices: 1. Semantic-shift-triggered consolidation (GAM): the EventGraph integrates into the MemoryGraph only when cosine distance between current and last-consolidated state exceeds 0.30 — prevents transient noise from contaminating stable knowledge. Time-based consolidation is explicitly rejected. 2. Hot/warm/cold tiers map to existing infrastructure (MemOS): HOT = in-context (working memory), WARM = Redis cache (already in Dapr components), COLD = Postgres/Dapr state. No new infrastructure. 3. Adaptive routing is keyword-based, not ML-based (FluxMem inspiration): simple heuristics classify query type (sequential keywords → SEQUENTIAL; long queries or causal keywords → GRAPH; else → FLAT). This avoids adding an embedding call just to route queries. 4. SchematicMemory is implemented as tagged SemanticFacts not a new layer: ConceptNodes confirmed by

= 5 episodes with high frequency are tagged category="schematic" in SemanticMemory and always surfaced at the top of recall results. No new Dapr key namespace.

Consequences: - Zero new required dependencies — stdlib collections (for BFS deque), math, json, uuid, datetime, plus existing Pydantic, Dapr client, embedding_service, model_client - MemoryGraph is persisted as a single Dapr key per agent — no graph database required (adjacency list in JSON); revisit if graphs exceed 10K nodes per agent - SemanticConsolidator makes 1 LLM call per consolidation trigger; with semantic-shift gating this averages 1–3 calls per 30-minute session, not per event - LifecycleTierManager.sweep() should be called at session start to demote stale HOT records from the previous session — add to AgentRunner.run() pre-loop via contextlib.suppress

ADR-019: Two-Tier Causal Analysis — Trace Tracing + Lightweight SCM¶

Status: Accepted

Context: Agents have no mechanism to distinguish root causes from cascading effects in failures, and no persistent model of what actions cause what outcomes. Two distinct failure modes motivated F4: (1) post-failure diagnosis — when an agent fails at step 8, it is non-obvious whether step 2 or step 6 caused it; cascading failures look identical to root failures in flat logs; (2) proactive intervention reasoning — agents cannot answer "what would have happened if I had skipped that tool call?" without re-executing. The Rung Collapse proof (arXiv 2602.11675) established that LLMs cannot perform causal inference natively. However, two 2026 papers showed practical paths that do not require solving the LLM-native causal reasoning problem: AgentTrace (arXiv 2603.14688, March 2026) showed that causal graphs reconstructed from execution logs localize root causes with sub-second latency and high accuracy without any LLM inference at debug time. Causal-aware LLMs (IJCAI 2025, arXiv 2505.24710) showed that LLMs as graph-labelers (not causal reasoners) combined with code-level do-calculus produces reliable interventional answers.

Decision: Implement a two-tier causal analysis layer in src/grampus/causal/: Tier 1 (CausalTracer) reconstructs causal graphs from the existing Phase 9 event log and diagnoses root causes post-hoc with no LLM inference. Tier 2 (CausalWorldModel) builds a persistent SCM the LLM populates during execution; SimpleCausalInference answers P(Y|do(X)) queries via pure-Python backdoor adjustment. Both tiers are additive opt-ins to AgentRunner — when both params are None, behavior is identical to pre-F4.

Key design choices: 1. LLM labels, code reasons — the LLM's job is only to identify and name causal relationships from text. SimpleCausalInference does all causal inference. This circumvents the Rung Collapse limitation entirely. 2. CausalTracer uses the existing event log (Phase 9, ADR-005) — no new storage infrastructure. Three edge types (sequential, data-dependency, failure-cascade) are reconstructed purely from log structure. 3. Root cause composite score = 0.6 × structural + 0.4 × positional, matching the AgentTrace and CHIEF signal weighting from the papers. 4. SimpleCausalInference is zero-new-deps — pure Python backdoor adjustment over small DAGs (< 200 variables). For larger graphs, the optional causal extras group can wrap DoWhy instead. 5. Tier 1 feeds Tier 2 — absorb_diagnosis() converts structurally validated causal chains from failure diagnosis into WorldModelGraph edges, giving the SCM ground-truth signal that bypasses LLM extraction uncertainty. 6. WorldModelGraph storage follows the F3 MemoryGraph pattern: one Dapr key per agent, entity = "causal_world_model". No graph database required for typical agent world models.

Consequences: - Zero new required dependencies — stdlib re, uuid, math, collections.deque, json, contextlib, asyncio plus existing Pydantic, Dapr client, model_client - CausalTracer.diagnose() requires the Phase 9 event store to expose get_events_for_session(session_id, agent_id) -> list[dict]; if that method is not yet present on the event store, add it as part of this phase - SimpleCausalInference assumes a DAG; is_dag() should be checked before running intervene() on user-provided graphs; cyclic world models are silently handled by returning is_identifiable=False - The post-session failure hook in AgentRunner requires AgentState.last_event_id (optional field); if the event log does not surface this, the hook falls back to session_id as a proxy failure marker

ADR-020: Adversarial Red-Teaming as a First-Class Evaluation Primitive¶

Status: Accepted

Context: Agent safety testing in the industry is largely manual, expert-driven, and non-reproducible. Two developments in 2026 changed this calculus: (1) OWASP released the first dedicated Agentic Top 10 (ASI01–ASI10:2026), providing a standardised taxonomy for agent-specific attacks distinct from classic LLM jailbreaks; (2) automated red-teaming frameworks (AgenticRed arXiv 2601.13518, Dreadnode arXiv 2605.04019) demonstrated 85–100% attack success rates with sub-hour campaign execution, making manual red-teaming insufficient. Grampus has a uniquely rich attack surface: four memory layers (including the F1–F3 reflexion/user/graph additions), sandboxed code execution, multi-agent crews with A2A, and the F4 causal world model — all of which are novel attack vectors not covered by classic LLM red-teaming.

Decision: Implement src/grampus/evaluation/red_team/ as a first-class evaluation primitive alongside the existing EvalSuite. Architecture: Attacker (generates payloads) → Target (Grampus agent under test) → Judge (evaluates success) with an optional mutation feedback loop for failed attempts. Six attack strategy implementations cover the highest-impact OWASP Agentic Top 10 categories. Every finding maps to both the OWASP category and one of the four security properties formalized in arXiv 2603.19469 (task alignment, action alignment, source authorization, data isolation).

Key design choices: 1. Strategy + Judge separation: strategies generate payloads deterministically (reproducible); the judge evaluates success with LLM + rule-based fallback. 2. target_fn decoupling: RedTeamRunner takes any async (messages) -> str callable, not an AgentRunner instance. This lets users red-team agents running as HTTP servers, not just local instances. 3. Rule-based judge always runs: even with LLM judge enabled, rule-based regex patterns provide a fallback when LLM confidence < 0.5 or when the model is unavailable. 4. One mutation retry: when a payload fails and a model_client is available, AttackerAgent generates one adaptive mutation (AgenticRed pattern) before recording the result. This doubles ASR on rule-based targets without significant overhead. 5. CLI exit code 1 on CRITICAL/HIGH: grampus redteam exits non-zero on high-severity findings, enabling CI/CD pipeline integration (block merges that introduce vulnerabilities).

Consequences: - Zero new required dependencies — all stdlib + existing Pydantic, structlog, model_client - grampus redteam agent.py requires the agent file to expose get_agent_config() and run_conversation(messages) — a thin adapter contract - The RedTeamRunner is intentionally decoupled from AgentRunner to avoid re-initializing Dapr and memory infrastructure for each attack payload; the target_fn handles that - Multi-turn attacks (ReasoningHijackStrategy) require the target_fn to maintain conversation state across the prior_turns list — stateless target_fns will see reduced multi-turn ASR

ADR-021: Document Processing Tools — Optional Extras with Graceful Degradation¶

Status: Accepted

Context: Agents need to ingest PDF, Word (.docx), and Excel (.xlsx) documents for RAG pipelines and episodic memory. The three document libraries (pymupdf4llm, python-docx, openpyxl) add ~50 MB to the install. Not every Grampus deployment needs document ingestion — a pure API agent, a CLI tool, or a code-generation agent has no use for these libraries and should not be penalized with extra install weight.

Decision: All document libraries live under pip install grampus-ai[documents] as optional extras. The three tool functions (read_pdf, read_docx, read_excel) check for their respective imports at call time and return ToolError(code="MISSING_DEPENDENCY") with a clear install hint when the extra is absent. The chunking layer (DocumentChunker) is pure Python and always available — agents can chunk arbitrary text without the extras installed.

Chunking strategy: Recursive chunking (2026 benchmark winner, 69% E2E accuracy over 50 papers) is the default. The context_header field stores the heading breadcrumb ("Title > Section > Sub") separately from content — embedding layers concatenate them for self-contained retrieval without polluting the stored text. Target: 512 tokens, FIXED strategy supports 10% overlap for sliding-window retrieval.

PDF reader priority: PyMuPDF (fitz) is preferred over pypdf because it is faster and handles complex layouts better. pypdf is the fallback when PyMuPDF is absent.

Consequences: - Core Grampus install stays lean; document-capable deployments add ~50 MB with [documents]. - [documents] is the established pattern for all future heavy optional dependency groups. - All three tool functions always return {"ok": bool, ...} — never raise; callers need no try/except. - Excel sheets are capped at 1000 rows to prevent runaway memory usage on large spreadsheets; truncation is noted in the chunk content. - The documents group is also included in [all] for convenience.

ADR-022: Code Analysis Tools — stdlib AST Engine + Subprocess Lint Runners¶

Status: Accepted

Context: Agents analyzing code need targeted structural queries, not raw file reads. Research (arXiv 2603.27277, Codebase-Memory, March 2026) demonstrated that structured code analysis tools reduce agent token usage by 10x and tool calls by 2.1x versus grep+file-read patterns. The tool surface must answer: "what's in this file?", "where is X defined?", "what are the lint issues?", "what are the type errors?"

Decision: Five tools built in two tiers: (1) pure-stdlib AST engine for symbol extraction, complexity, import analysis, and symbol search — zero dependencies; (2) subprocess thin wrappers around ruff and mypy — both already in the Grampus toolchain — with graceful degradation when not on PATH. No tree-sitter (binary dep, overkill for Python-first framework). No radon/lizard (cyclomatic complexity computed directly from ast.NodeVisitor in ~20 lines). No new entries in [project.dependencies].

Consequences: - All five tools work on any Grampus installation — no [analysis] extras required - Lint and type-check tools degrade gracefully: return ok with available=False + install hint - Subprocess runners are tested with mocked subprocess — integration against real ruff/mypy is handled implicitly by the existing CI which runs ruff and mypy on every push - Symbol search is O(files) — the 200-file default cap keeps it interactive-speed for typical repos

ADR-023: Multi-Provider Embedding Service with Per-Memory-Type Routing¶

Status: Accepted

Context: EmbeddingService was hardwired to a single OpenAI client. Three production problems motivated this change: (1) no way to use cheaper/faster local embeddings (Ollama) for low-stakes memory types like working memory while keeping a higher-quality model for semantic memory; (2) no way to use Cohere's domain-tuned Embed v3 models; (3) a silent dimension-mismatch bug — switching providers without updating the pgvector column dimensions silently drops all writes with no error, confirmed in multiple production incident reports (2025–2026). Additionally, Cohere Embed v3+ requires an input_type parameter ("search_document" vs "search_query") that the old single- provider API had no mechanism to expose — omitting it is a silent quality degradation.

Decision: Introduce EmbeddingProvider ABC with three concrete implementations (OpenAIEmbeddingProvider, CohereEmbeddingProvider, OllamaEmbeddingProvider). Refactor EmbeddingService to wrap any provider while preserving the existing .embed() / .embed_batch() interface exactly — all call sites are unchanged. Add .dimensions property to surface the provider's output dimension for pgvector validation at setup time, not at write time. Add EmbeddingRouter for optional per-memory-type provider routing (opt-in; existing code that passes a single EmbeddingService is unaffected). Add an optional input_type parameter to embed() and embed_batch() so Cohere's search_query / search_document distinction is correctly handled without leaking provider-specific concepts into callers that don't need it.

Consequences: - All existing .embed(text) call sites continue to work without modification - OllamaEmbeddingProvider uses httpx (already a core dep) — zero new required dependencies - OpenAIEmbeddingProvider and CohereEmbeddingProvider require their respective optional extras ([openai], [cohere]) - .dimensions property enables pgvector setup code to validate column width before the first write, converting the silent dimension-mismatch bug into a startup-time error - EmbeddingRouter is duck-type compatible with EmbeddingService for the three shared methods (.embed(), .embed_batch(), .dimensions), so it can be injected anywhere an EmbeddingService is accepted - Cache keys now include a provider name prefix — existing cached embeddings are invalidated on upgrade (acceptable: the cache is a performance optimisation, not ground truth)

ADR-024: Lifecycle Hook Plugin System — stdlib Entry Points with Async-Native Registry¶

Status: Accepted

Context: Production deployments of Grampus require observability integrations (Datadog, Splunk), compliance controls (PII redaction, audit logging, HIPAA content filtering), and cross-cutting concerns (rate limiting, cost allocation, canary routing) that cannot be baked into the core framework without creating vendor coupling. Three prior approaches were considered: (a) subclassing AgentRunner / MemoryManager — brittle, requires forking for each integration; (b) middleware wrapping via httpx-style transports — applies only to HTTP calls, misses in-process hooks; (c) event callbacks via asyncio.Queue — decoupled but no pre-hook mutation capability, no blocking support. None of these patterns cover the full lifecycle (start → LLM call → tool call → memory write → end → error) with both observational and mutating semantics.

Decision: Implement a src/grampus/plugins/ package providing a two-tier hook system: (1) pre-hooks (pre_llm_call, pre_tool_call, pre_memory_write) run sequentially in priority order, thread their return values as a transformation pipeline, and surface HookBlockedError as SafetyError/MemorySecurityError with code="PLUGIN_BLOCKED"; (2) observational hooks (on_agent_start, on_agent_end, post_llm_call, post_tool_call, post_memory_write, on_error) run concurrently via asyncio.gather, with individual plugin failures logged and suppressed — a broken plugin never crashes agent execution. Third-party plugins are discovered via importlib.metadata.entry_points(group="grampus.plugins").

Key design choices: 1. plugin_manager=None default — AgentRunner and MemoryManager with no plugin manager are behaviorally identical to pre-H49. Zero overhead for deployments that don't use plugins. 2. HookBlockedError is the only bubbling exception — all other plugin exceptions are suppressed in both pre-hooks (caught, logged, chain continues) and observational hooks (gathered, logged, suppressed). This asymmetry is intentional: mutations must succeed cleanly or be skipped, but observation failures must never crash the agent. 3. Frozen context dataclasses — all 7 context objects are @dataclass(frozen=True). Plugins receive read-only contexts; they cannot modify agent state through the context object. 4. Inline imports under TYPE_CHECKING — PluginManager appears only in TYPE_CHECKING blocks in runner.py and memory/manager.py; actual plugin types are imported inline inside if self._plugins: guards. This eliminates any circular import risk. 5. Priority controls sequential order — lower priority integer runs earlier in pre-hooks. Observational hooks use insertion order (priority-independence for concurrent dispatch). 6. GrampusPlugin base class with no-op hooks — subclasses override only the hooks they need; all others are silent pass-throughs by default.

Consequences: - Zero new required dependencies — stdlib importlib.metadata, asyncio, dataclasses only - Third-party plugins ship as Python packages with [project.entry-points."grampus.plugins"] in their pyproject.toml; create_manager_from_entry_points() loads them automatically - Pre-hook mutation (messages, tool arguments, memory content) enables compliance plugins to redact PII, inject system context, or rewrite arguments without modifying agent code - The HookBlockedError → SafetyError(code="PLUGIN_BLOCKED") / MemorySecurityError(code= "PLUGIN_BLOCKED") mapping gives callers a machine-readable signal that is distinct from model errors, tool errors, and budget errors - contextlib.suppress(Exception) wraps all observational hook calls in runner and memory manager — plugin failures in on_agent_start, post_llm_call, etc. are logged but never surface to the caller

ADR-025: Content-Addressed Agent Versioning with Deterministic A/B Routing¶

Status: Accepted

Context: Agent definitions (system prompt, tools, temperature, model) change over time and teams need to track what was deployed when, roll back safely, and run controlled experiments. Three specific failure modes motivated this design: (1) without version identity, a prompt regression is invisible until users complain — there is no diff, no audit trail, and no rollback path; (2) existing A/B testing patterns require a separate routing service and database, creating operational overhead for a single toggle; (3) user assignment to A/B buckets is often non-sticky — the same user sees different agent behaviors on consecutive calls — which contaminates experiment results and degrades user experience.

Decision: Implement a self-contained versioning layer in src/grampus/versioning/ with four interlocking components:

Content-addressed version IDs — compute_version_id(definition) produces a deterministic SHA-256 over a canonicalized (key-sorted, tool-list-sorted) JSON representation of the AgentDefinition. The same definition always produces the same ID regardless of when or where it is created. Identical definitions are deduplicated at save time without special logic.
Dapr-backed persistence — VersionStore stores versions and deployments via the existing DaprStateStore abstraction. Two internal Pydantic wrappers (_VersionIndex, _DeploymentHistory) store the per-agent version index and capped (50-entry) deployment history as first-class state entries, enabling resilient list_versions() that skips corrupt records rather than failing.
Sticky deterministic A/B routing — VersionRouter.resolve() assigns users to control or treatment by computing SHA-256(experiment_id:user_id) % 100 < int(split * 100). The hash is deterministic: the same user always lands in the same bucket for a given experiment, with no server-side session state required. The routing logic is wrapped in contextlib.suppress so a broken experiment never crashes agent execution.
Pure-Python significance testing — two_proportion_z_test (for eval pass rate) and welch_t_test (for continuous metrics) are implemented from scratch using stdlib math.erfc and Lentz's continued-fraction regularized incomplete beta function. No scipy dependency. Auto- promotion fires when p < auto_promote_threshold and both groups have >= min_samples runs.

Consequences: - Zero new required dependencies — stdlib hashlib, difflib, math, uuid plus existing Pydantic and Dapr client - AgentRunner gains one optional version_router parameter; default None means no behavioral change — all existing callers are unaffected - VersionRouter is duck-type injectable: any object with async resolve(agent_id, user_id) can be substituted in tests without importing the full Dapr stack - compute_version_id is pure (no I/O, no randomness) — identical inputs always produce identical outputs, making version IDs reproducible across environments and process restarts - Deployment history is capped at 50 entries per agent; older entries are silently dropped — acceptable since the audit trail in the Dapr event log (ADR-005) is the authoritative record - The welch_t_test fallback for continuous metrics (avg_cost_usd, avg_latency_seconds) does not store raw sample arrays, so it uses a 10%-difference heuristic rather than a true p-value; teams needing rigorous continuous-metric significance should call record_eval_result on a discretized pass/fail threshold and use the eval_pass_rate metric path instead

ADR-026: RAG Pipeline as a First-Class Demo Template¶

Status: Accepted

Context: RAG (Retrieval-Augmented Generation) is the highest-demand agent use case in production deployments. Every team building with Grampus needs to index documents and answer questions from them. Without a complete, working reference implementation, each team reinvents the same pipeline — often making the same mistakes: dense-only retrieval (misses keyword queries), IVFFlat indexing (requires training, poor incremental performance), context concatenation without position-aware ordering, and no citation grounding. A complete template that avoids these mistakes removes the most common adoption barrier.

Decision: Implement demos/rag/ as a production-ready RAG template using the full Grampus stack. Key design choices baked in from 2025-2026 research and benchmarks:

Hybrid BM25 + vector search with RRF — PostgreSQL tsvector provides BM25 at zero additional infrastructure cost. RRF constant k=60 is the research-validated default.
HNSW over IVFFlat — no training pass, better recall, handles incremental inserts.
Lost-in-the-middle reordering — interleaves high-scoring chunks to start and end of context window based on 2023 Stanford findings replicated across 2024-2025.
Namespace scoping as a hard requirement — every SQL query filters by namespace, preventing cross-tenant data leakage without application-level enforcement.
Dimension mismatch detection at setup time — RAGStore.setup() checks existing table dimensions against the embedding service, converting a silent data corruption bug into a clear startup error.
Closure-based tool factory — make_retrieve_tool() binds store and embedding service into the tool function without global state, demonstrating the correct pattern for stateful tools in Grampus.

Consequences: - Template uses asyncpg directly for pgvector operations — Dapr state store API does not expose arbitrary SQL needed for hybrid search. Added as [rag] optional extra. - Template works without Dapr sidecar (only PostgreSQL required) to minimize quickstart friction. Production deployments add Dapr for embedding caching via Redis. - demos/ is not type-checked by mypy (demo code, not library code). New RAGError in src/grampus/core/errors.py is the only library addition from this phase. - Evaluation script uses LLM-as-judge (Claude) — requires Anthropic API key. The scoring is intentionally simplified (2 metrics) to remain readable as a reference, not to replace a full RAGAS setup in production.