Uncertainty Quantification¶
Uncertainty Quantification (UQ) gives every Grampus agent a real-time confidence signal and a three-tier escalation ladder. Instead of silently returning a low-quality answer when an LLM is unsure, UQ measures how confident the model actually is — step by step, across the entire run — and takes a principled action: proceed, log a warning, pause for human review, or abort.
Why verbalized confidence is not enough¶
Most agentic frameworks ask the model to write "confidence": 0.9 in its JSON output and treat that as the ground truth. Research shows this is unreliable:
| Signal | ECE (lower = better) | Notes |
|---|---|---|
| Verbalized confidence | 0.377+ | Aligned models cluster at 90–100% regardless of accuracy (arXiv 2412.14737, KDD 2025) |
| P(True) self-evaluation | ~0.10 | Single follow-up call; works on any black-box API |
| Semantic entropy | best AUROC | N-sample entropy; most accurate but slower (Farquhar et al., Nature 2024) |
Grampus uses all three signals in a dual-process architecture: a fast path that always runs, and a slow path that activates only when the fast path is uncertain.
Architecture: Dual-Process AUQ¶
LLM response
│
▼
┌─────────────────────────────────────────────┐
│ System 1 (fast — always runs) │
│ 1. Extract verbalized confidence │
│ 2. P(True) follow-up call (optional) │
│ 3. Weighted, calibrated fusion │
└───────────────┬─────────────────────────────┘
│ fused ∈ trigger zone?
▼
┌─────────────────────────────────────────────┐
│ System 2 (slow — opt-in) │
│ 4. Adaptive semantic entropy sampling │
│ ├─ Start with 2 samples │
│ ├─ Jaccard ≥ 0.6 → early stop │
│ └─ Else extend to max_samples │
│ 5. Pessimistic fusion: min(fast, entropy) │
└───────────────┬─────────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ SAUP Propagation (across steps) │
│ propagated = w·fused + (1-w)·cumulative │
│ weights: decision=0.70, llm=0.55, │
│ tool=0.45, memory_read=0.35 │
└───────────────┬─────────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ Three-Tier Escalation Policy │
│ ≥ 0.80 → LOW → PROCEED │
│ ≥ 0.60 → MEDIUM → PROCEED_WITH_LOG │
│ ≥ 0.40 → HIGH → PAUSE_FOR_HUMAN │
│ < 0.40 → CRITICAL → ABORT (UncertaintyError)│
└─────────────────────────────────────────────┘
Quick start¶
from grampus.orchestration import AgentRunner, UncertaintyMonitor, UncertaintyPolicy
policy = UncertaintyPolicy(
low_threshold=0.80, # PROCEED below this
medium_threshold=0.60, # warn below this
high_threshold=0.40, # pause for human below this
enable_p_true=True, # run P(True) follow-up call
irreversible_tool_names=["send_email", "delete_records", "deploy"],
)
monitor = UncertaintyMonitor(policy=policy)
runner = AgentRunner(
model_client=client,
tool_executor=executor,
uncertainty_monitor=monitor,
)
result = await runner.run(agent_def, "Summarise this legal brief.", session_id="s1")
if result.status == AgentStatus.WAITING_FOR_HUMAN:
meta = result.metadata.get("uncertainty", {})
print(f"Paused — level: {meta['overall_level']}, confidence: {meta['cumulative_confidence']:.2f}")
Escalation tiers¶
| Level | Propagated confidence | Action | What happens |
|---|---|---|---|
| LOW | ≥ 0.80 | PROCEED |
Run continues normally |
| MEDIUM | ≥ 0.60 | PROCEED_WITH_LOG |
Warning logged; run continues |
| HIGH | ≥ 0.40 | PAUSE_FOR_HUMAN |
status = WAITING_FOR_HUMAN; optional reflection prompt injected |
| CRITICAL | < 0.40 | ABORT |
UncertaintyError raised |
Irreversible tool override¶
When a tool name matches any entry in irreversible_tool_names (case-insensitive substring match), MEDIUM escalates to PAUSE_FOR_HUMAN. LOW is always safe, even for irreversible tools.
policy = UncertaintyPolicy(
irreversible_tool_names=["send_email", "delete", "deploy", "transfer"],
)
If the agent is about to call send_email_to_client and cumulative confidence is 0.72 (MEDIUM), execution pauses — even though MEDIUM normally proceeds.
Reflection injection¶
When HIGH uncertainty is detected on an LLM step and inject_reflection_on_high=True (the default), Grampus injects a System-2 reflection message before pausing:
Before you continue, assess your own uncertainty explicitly.
List the specific things you are NOT confident about in your current reasoning.
For each uncertain point state: (1) what you know, (2) what you don't know,
(3) what additional information would resolve the uncertainty.
The reflection appears as a SYSTEM message in result.messages. When you call runner.resume() with a human response, the next LLM call sees the reflection and the human's guidance together.
Semantic entropy (slow path)¶
Enable semantic entropy sampling for high-stakes tasks where P(True) accuracy is not sufficient:
policy = UncertaintyPolicy(
enable_p_true=True,
enable_semantic_sampling=True, # opt-in slow path
)
estimator = UncertaintyEstimator(
min_samples=2, # adaptive: start here
max_samples=5, # extend to this if first pair disagrees
early_stop_jaccard=0.60, # first-pair agreement threshold
semantic_trigger_low=0.50, # only sample when fused is in this zone
semantic_trigger_high=0.72,
)
monitor = UncertaintyMonitor(estimator=estimator, policy=policy)
The adaptive algorithm (arXiv 2504.03579) saves ~47% of sampling cost:
- Sample 2 responses at temperature 0.8.
- If first-pair Jaccard similarity ≥ 0.60 → stop early (model is clearly consistent).
- Otherwise → extend to
max_samples. - Compute Shannon entropy over semantic clusters. Cluster two responses together if Jaccard ≥ 0.40.
confidence = 1.0 - H_norm. Takemin(fast_path, entropy_conf)(pessimistic fusion).
SAUP propagation across steps¶
A single uncertain step should not be erased by subsequent confident steps. SAUP (arXiv 2412.01033, ACL 2025) weights each step type by its forward impact:
| Step type | Weight (w) |
Effect |
|---|---|---|
decision |
0.70 | High influence — decisions cascade downstream |
llm_call |
0.55 | Moderate — reasoning may drift |
tool_call |
0.45 | Lower — results often grounding facts |
memory_read |
0.35 | Lowest — retrieval rarely introduces new uncertainty |
Formula: propagated(t) = w × fused(t) + (1 − w) × cumulative(t−1)
A confident step 3 cannot erase a highly uncertain step 1 when w = 0.55.
Graph integration: uncertainty_guard_node¶
Insert an explicit uncertainty checkpoint between graph nodes:
from grampus.orchestration import uncertainty_guard_node, Graph, human_node
guard = uncertainty_guard_node(
monitor,
step_type="decision", # used for SAUP weight lookup
escalate_node="human_review", # sets metadata["uncertainty_escalate"] = True
)
async def route(state):
if state.metadata.get("uncertainty_escalate"):
return "human_review"
return "next_step"
graph = (
Graph(graph_id="qa")
.add_node("llm_step", llm_handler, entry=True)
.add_node("guard", guard)
.add_conditional_edge("guard", route, {"human_review": "human_review", "next_step": "final"})
.add_node("human_review", human_node("Uncertain answer — please review."))
.add_node("final", final_handler)
)
Belief state and metadata¶
After each runner.run(), uncertainty metadata is attached to result.metadata["uncertainty"]:
{
"overall_level": "medium", # UncertaintyLevel as string
"cumulative_confidence": 0.74, # EMA-propagated session-level confidence
"total_steps": 4,
"high_uncertainty_steps": 1,
"last_step_id": "llm_3"
}
Access the full AgentBeliefState from the monitor after a run:
belief = monitor.get_belief_state()
for step in belief.step_uncertainties:
print(f"{step.step_id}: level={step.level} propagated={step.propagated_confidence:.3f}")
OTEL spans¶
Three custom spans are emitted per step (when tracer is passed to UncertaintyMonitor):
| Span | Emitted when | Key attributes |
|---|---|---|
uncertainty.estimate |
Every step | step_id, step_type, verbalized_confidence, p_true_confidence, fused_confidence, propagated_confidence, level, action, p_true_ran, samples_used |
uncertainty.semantic |
Semantic sampling ran | step_id, sample_count, early_stopped |
uncertainty.escalate |
HIGH or CRITICAL level | step_id, level, cumulative_confidence, irreversible |
from grampus.observability.tracer import GrampusTracer
tracer = GrampusTracer(service_name="my-agent", otlp_endpoint="http://localhost:4317")
monitor = UncertaintyMonitor(policy=policy, tracer=tracer)
Handling a paused run¶
result = await runner.run(agent_def, task, session_id="s1")
if result.status == AgentStatus.WAITING_FOR_HUMAN:
# Inspect what caused the pause
meta = result.metadata.get("uncertainty", {})
print(f"Paused: {meta['overall_level']} confidence ({meta['cumulative_confidence']:.2f})")
# Optionally read the reflection message
for msg in result.messages:
if msg.role == Role.SYSTEM and "NOT confident" in (msg.content or ""):
print("Reflection:", msg.content)
# Resume after human provides guidance
human_guidance = "Focus on section 3.2 of the document."
resumed = await runner.resume("my-agent", "s1", human_guidance)
Configuration reference¶
UncertaintyPolicy¶
| Field | Type | Default | Description |
|---|---|---|---|
low_threshold |
float |
0.80 |
Propagated confidence floor for LOW (PROCEED) |
medium_threshold |
float |
0.60 |
Floor for MEDIUM (PROCEED_WITH_LOG) |
high_threshold |
float |
0.40 |
Floor for HIGH (PAUSE_FOR_HUMAN) |
enable_p_true |
bool |
True |
Run P(True) follow-up call after each LLM response |
enable_semantic_sampling |
bool |
False |
Enable adaptive semantic entropy slow path |
irreversible_tool_names |
list[str] |
[] |
Substrings; MEDIUM → PAUSE on match |
inject_reflection_on_high |
bool |
True |
Inject System-2 reflection before PAUSE |
UncertaintyEstimator¶
| Field | Type | Default | Description |
|---|---|---|---|
enable_p_true |
bool |
True |
Controls P(True) calls |
verbalized_weight |
float |
0.4 |
Fusion weight for verbalized signal |
p_true_weight |
float |
0.6 |
Fusion weight for P(True) signal |
verbalized_calibration_bias |
float |
0.25 |
ECE correction for verbalized confidence |
p_true_calibration_bias |
float |
0.10 |
ECE correction for P(True) |
min_samples |
int |
2 |
Adaptive entropy: minimum samples before early-stop check |
max_samples |
int |
5 |
Adaptive entropy: extend to this on disagreement |
semantic_trigger_low |
float |
0.50 |
Lower bound of sampling trigger zone |
semantic_trigger_high |
float |
0.72 |
Upper bound of sampling trigger zone |
early_stop_jaccard |
float |
0.60 |
First-pair agreement threshold for early stop |
Research citations¶
| Finding | Source | Baked-in design decision |
|---|---|---|
| Verbalized ECE ≥ 0.377 for frontier models | arXiv 2412.14737; ACM 3711896.3736569 | verbalized_calibration_bias=0.25; verbalized is weak signal (weight 0.4) |
| P(True) ECE ≈ 0.10 | Kadavath et al. 2022; validated 2023–2025 | P(True) is primary fast-path signal (weight 0.6) |
| Adaptive sampling saves 47% cost | arXiv 2504.03579 | min_samples=2, early-stop on Jaccard ≥ 0.6 |
| Semantic entropy best AUROC | Farquhar et al. 2024, Nature | Slow path; pessimistic fusion min(fast, entropy) |
| SAUP 20% AUROC improvement | arXiv 2412.01033; ACL 2025 pp. 6064–6073 | Per-step-type situational weights in UncertaintyPropagator |
| Dual-Process AUQ | arXiv 2601.15703, Jan 2026 | System 1 always; System 2 on uncertain zone |
| Three-tier escalation (production consensus) | Zylos Research April 2026 | PROCEED → PROCEED_WITH_LOG → PAUSE_FOR_HUMAN → ABORT |