Multi-Agent Debate¶
Multi-agent debate runs the same question past several LLMs simultaneously, lets them critique each other's reasoning, then aggregates the results into a single high-confidence answer. Use it when a single LLM call is not reliable enough — factual questions with high stakes, decisions with legal or financial impact, or any case where you need an auditable confidence score alongside the answer.
When to use debate vs. crew¶
| Crew | Debate | |
|---|---|---|
| Structure | Pipeline — each agent does a different job | Panel — all agents answer the same question |
| Composition | Decided before execution | Configured via DebateConfig |
| Output | Accumulated outputs from N agents | Single aggregated answer + confidence score |
| Cost model | Sequential; stops at each member | Concurrent per round; early-stops when agents agree |
| Best for | Research → critique → write pipelines | High-stakes Q&A, fact-checking, uncertain classifications |
Prerequisites¶
Debate uses whatever model clients you already configure — no additional dependencies.
Minimal example¶
import asyncio
import os
from grampus.core.models.anthropic import AnthropicClient
from grampus.orchestration.debate import (
AggregationStrategy,
DebateConfig,
DebaterConfig,
DebateOrchestrator,
)
async def main() -> None:
client = AnthropicClient(api_key=os.environ["GRAMPUS_MODEL__ANTHROPIC_API_KEY"])
cfg = DebateConfig(
debaters=[
DebaterConfig(model_client=client, model_id="claude-haiku-4-5", temperature=0.5),
DebaterConfig(model_client=client, model_id="claude-sonnet-4-6", temperature=0.7),
DebaterConfig(
model_client=client,
model_id="claude-sonnet-4-6",
temperature=0.9,
role_hint="You are a skeptical devil's advocate.",
),
],
max_rounds=3,
aggregation=AggregationStrategy.WEIGHTED_VOTE,
)
orch = DebateOrchestrator(cfg)
result = await orch.run("Is the following contract clause legally enforceable in California? ...")
print(f"Answer: {result.final_answer}")
print(f"Confidence: {result.confidence:.0%}")
print(f"Rounds run: {result.total_rounds_run}")
print(f"Cost: ${result.total_cost_usd:.4f}")
if result.escalate_to_human:
print("⚠ Low confidence — flagged for human review")
asyncio.run(main())
How it works¶
Round structure¶
sequenceDiagram
participant Q as Question
participant D0 as Debater 0 (haiku)
participant D1 as Debater 1 (sonnet)
participant D2 as Debater 2 (devil's advocate)
participant Agg as Aggregator
Note over D0,D2: Round 1 — independent reasoning
Q->>D0: "Answer independently"
Q->>D1: "Answer independently"
Q->>D2: "Answer independently"
D0-->>Q: Answer A + confidence 0.82
D1-->>Q: Answer A + confidence 0.71
D2-->>Q: Answer B + confidence 0.65
Note over D0,D2: Round 2 — sycophancy-resistant critique
Q->>D0: "Restate your answer. Critique peers. Update only if logically compelled."
Q->>D1: same prompt + all peer answers
Q->>D2: same prompt + all peer answers
Note over Agg: Convergence ≥ threshold → stop early
Agg->>Q: Final answer: A, confidence 0.78
All debaters in each round run concurrently via asyncio.gather. Round N starts only after all of round N-1's positions are in.
Adaptive routing¶
When adaptive_routing=True (the default), the orchestrator first sends the question to a single fast model and checks the reported confidence. If confidence ≥ routing_confidence_threshold (default 0.85), the full debate is skipped and that answer is returned immediately with routing_decision=SINGLE_AGENT. This eliminates ~40% of unnecessary debate calls with no quality loss (arXiv 2504.05047).
cfg = DebateConfig(
...,
adaptive_routing=True,
routing_confidence_threshold=0.85, # skip debate if this confident
routing_model_client=haiku_client, # fast model for the routing check
routing_model_id="claude-haiku-4-5",
)
Sycophancy resistance¶
A known failure mode in multi-agent debate is sycophancy: agents flip their position not because of logic, but because of social pressure from peers (ACL 2025 CONSENSAGENT). Grampus mitigates this with a mandatory three-step prompt for rounds 2+:
- Restate your previous answer verbatim
- Critique each peer position based on evidence
- Update only if there is a compelling logical reason — and explicitly state that reason
Position changes are tracked in DebaterPosition.changed_from_previous and change_justification.
Convergence detection¶
After each round, a ConvergenceDetector clusters positions by Jaccard word-overlap similarity. Two answers are considered equivalent if their word-set Jaccard score ≥ 0.4. The convergence score is:
When score >= convergence_threshold (default 0.8), the debate stops early. This is stopped_early=True on the final DebateRound.
Escalation to humans¶
When the final convergence score is below escalate_threshold (default 0.5), the result sets escalate_to_human=True. This surfaces low-confidence cases rather than silently returning a guess ("From Debate to Decision", April 2026).
Aggregation strategies¶
Majority vote (fast baseline)¶
Finds the largest Jaccard-similarity cluster in the final round. The position with the highest self-reported confidence in that cluster is the representative. Returns the average cluster confidence.
Weighted vote (recommended)¶
aggregation=AggregationStrategy.WEIGHTED_VOTE
# Assign higher weight to your most capable debater
DebaterConfig(..., weight=2.0) # this debater's vote counts double
Scores each cluster by sum(debater.weight × position.confidence). Picks the highest-scoring cluster. This lets you bias toward a more capable model without fully excluding others.
Judge model (highest quality)¶
from grampus.orchestration.debate import DebaterConfig, AggregationStrategy
judge_cfg = DebaterConfig(
model_client=opus_client,
model_id="claude-opus-4-7",
temperature=0.0,
)
cfg = DebateConfig(
...,
aggregation=AggregationStrategy.JUDGE,
judge_config=judge_cfg,
)
After all rounds, a separate judge model receives all final-round positions verbatim and synthesises a single answer with explicit reasoning. Falls back to majority vote if the judge returns non-parseable output.
Integrating with the Graph engine¶
debate_node() wraps a DebateOrchestrator as a graph node, and optionally routes low-confidence results to a human review node:
from grampus.orchestration import Graph, debate_node, human_node
orch = DebateOrchestrator(cfg)
async def route(state):
return "escalate" if state.metadata.get("debate_escalate") else "end"
graph = (
Graph(graph_id="contract-review")
.add_node("debate", debate_node(orch, on_escalate="human_review"), entry=True)
.add_conditional_edge("debate", route, {"escalate": "human_review", "end": None})
.add_node("human_review", human_node("Low-confidence answer — please review."))
)
result = await graph.execute(initial_state)
After the debate node runs, the last ASSISTANT message contains:
message.metadata["debate_result"] # full DebateResult dict
message.metadata["debate_confidence"] # float
message.metadata["debate_escalate"] # bool — True when human review needed
message.metadata["debate_rounds"] # int
message.metadata["debate_routing"] # "debate" or "single_agent"
Budget enforcement¶
Pass a CostTracker to cap total spend across all debate rounds:
from grampus.orchestration import CostTracker
tracker = CostTracker(agent_id="qa-agent", session_id="s1", budget_usd=0.05)
orch = DebateOrchestrator(cfg, cost_tracker=tracker)
The orchestrator calls check_budget() before each round. If the budget is exhausted, BudgetExceededError is raised immediately.
Observability¶
Every debate emits structured OTEL spans:
| Span | Key attributes |
|---|---|
debate.run |
question_len, num_debaters, max_rounds, aggregation, adaptive_routing |
debate.route_check |
routing_model_id, confidence, decision |
debate.round |
round_number, convergence_score, stopped_early |
debate.debater |
debater_index, model_id, confidence, changed |
debate.aggregate |
strategy, final_confidence, escalate_to_human |
Pass any tracer with a span(name, **attrs) context manager interface:
from grampus.observability.tracer import GrampusTracer
tracer = GrampusTracer(service_name="my-agent", otlp_endpoint="http://localhost:4317")
orch = DebateOrchestrator(cfg, tracer=tracer)
Configuration reference¶
DebateConfig¶
| Field | Type | Default | Description |
|---|---|---|---|
debaters |
list[DebaterConfig] |
required (min 2) | Panel of debaters |
max_rounds |
int |
3 |
Maximum debate rounds before forced aggregation |
aggregation |
AggregationStrategy |
WEIGHTED_VOTE |
How to pick the winner |
convergence_threshold |
float |
0.8 |
Fraction of debaters that must agree to stop early |
adaptive_routing |
bool |
True |
Skip debate when single-agent confidence is high |
routing_confidence_threshold |
float |
0.85 |
Confidence that triggers routing bypass |
routing_model_client |
ModelClient \| None |
None → debaters[0] |
Fast model for routing check |
routing_model_id |
str |
"" → debaters[0] |
Model ID for routing check |
judge_config |
DebaterConfig \| None |
None |
Required when aggregation=JUDGE |
cost_budget_usd |
float \| None |
None |
Hard budget ceiling across all rounds |
escalate_threshold |
float |
0.5 |
Set escalate_to_human=True when convergence below this |
DebaterConfig¶
| Field | Type | Default | Description |
|---|---|---|---|
model_client |
ModelClient |
required | Any Grampus model client |
model_id |
str |
required | Model identifier string |
temperature |
float |
0.7 |
Sampling temperature |
role_hint |
str |
"" |
Appended to the system prompt — use for adversarial personas |
weight |
float |
1.0 |
Vote weight for WEIGHTED_VOTE aggregation |
DebateResult fields¶
| Field | Type | Description |
|---|---|---|
final_answer |
str |
Aggregated winning answer |
final_reasoning |
str |
Reasoning from the winning position |
confidence |
float |
Aggregated confidence score (0–1) |
escalate_to_human |
bool |
True when convergence < escalate_threshold |
rounds |
list[DebateRound] |
Full per-round transcript |
routing_decision |
RoutingDecision |
"debate" or "single_agent" |
total_rounds_run |
int |
Rounds actually run (may be < max_rounds due to early stop) |
converged |
bool |
Whether early stopping triggered |
final_convergence_score |
float |
Convergence in the final round |
total_token_usage |
TokenUsage |
Cumulative tokens across all rounds |
total_cost_usd |
float |
Total spend across all rounds |
duration_seconds |
float |
Wall-clock time |
Design notes¶
Use heterogeneous models — different model families outperform the same model at varied temperatures. At identical question difficulty, claude-haiku + claude-sonnet + claude-opus consistently beats claude-sonnet × 3 (M3MAD-Bench, ICLR 2025).
Three debaters is usually enough — beyond five debaters, cost grows linearly while accuracy improvement diminishes. Start with three: a fast model, a balanced model, and one with a devil's advocate role_hint.
Trust confidence, not consensus — WEIGHTED_VOTE with tuned weights outperforms pure majority in ambiguous cases. If you have a model you trust more, give it a weight of 2–3× to let it break ties.
Watch the escalation rate — if escalate_to_human=True fires too often, either reduce escalate_threshold or increase max_rounds. If it fires too rarely, reduce the threshold.
Next steps¶
- Multi-Agent Crew → — Sequential pipeline patterns (researcher → critic → writer)
- Agent Handoffs → — Runtime delegation between agents
- Cost Management → — Budget enforcement and spend reporting
- Orchestration API → — Full
DebateOrchestratorreference