Multi-Agent Debate¶

Multi-agent debate runs the same question past several LLMs simultaneously, lets them critique each other's reasoning, then aggregates the results into a single high-confidence answer. Use it when a single LLM call is not reliable enough — factual questions with high stakes, decisions with legal or financial impact, or any case where you need an auditable confidence score alongside the answer.

When to use debate vs. crew¶

	Crew	Debate
Structure	Pipeline — each agent does a different job	Panel — all agents answer the same question
Composition	Decided before execution	Configured via `DebateConfig`
Output	Accumulated outputs from N agents	Single aggregated answer + confidence score
Cost model	Sequential; stops at each member	Concurrent per round; early-stops when agents agree
Best for	Research → critique → write pipelines	High-stakes Q&A, fact-checking, uncertain classifications

Prerequisites¶

pip install "grampus-ai[anthropic]"   # or openai, gemini

Debate uses whatever model clients you already configure — no additional dependencies.

Minimal example¶

import asyncio
import os

from grampus.core.models.anthropic import AnthropicClient
from grampus.orchestration.debate import (
    AggregationStrategy,
    DebateConfig,
    DebaterConfig,
    DebateOrchestrator,
)


async def main() -> None:
    client = AnthropicClient(api_key=os.environ["GRAMPUS_MODEL__ANTHROPIC_API_KEY"])

    cfg = DebateConfig(
        debaters=[
            DebaterConfig(model_client=client, model_id="claude-haiku-4-5", temperature=0.5),
            DebaterConfig(model_client=client, model_id="claude-sonnet-4-6", temperature=0.7),
            DebaterConfig(
                model_client=client,
                model_id="claude-sonnet-4-6",
                temperature=0.9,
                role_hint="You are a skeptical devil's advocate.",
            ),
        ],
        max_rounds=3,
        aggregation=AggregationStrategy.WEIGHTED_VOTE,
    )

    orch = DebateOrchestrator(cfg)
    result = await orch.run("Is the following contract clause legally enforceable in California?  ...")

    print(f"Answer:     {result.final_answer}")
    print(f"Confidence: {result.confidence:.0%}")
    print(f"Rounds run: {result.total_rounds_run}")
    print(f"Cost:       ${result.total_cost_usd:.4f}")

    if result.escalate_to_human:
        print("⚠ Low confidence — flagged for human review")


asyncio.run(main())

How it works¶

Round structure¶

sequenceDiagram
    participant Q as Question
    participant D0 as Debater 0 (haiku)
    participant D1 as Debater 1 (sonnet)
    participant D2 as Debater 2 (devil's advocate)
    participant Agg as Aggregator

    Note over D0,D2: Round 1 — independent reasoning
    Q->>D0: "Answer independently"
    Q->>D1: "Answer independently"
    Q->>D2: "Answer independently"
    D0-->>Q: Answer A + confidence 0.82
    D1-->>Q: Answer A + confidence 0.71
    D2-->>Q: Answer B + confidence 0.65

    Note over D0,D2: Round 2 — sycophancy-resistant critique
    Q->>D0: "Restate your answer. Critique peers. Update only if logically compelled."
    Q->>D1: same prompt + all peer answers
    Q->>D2: same prompt + all peer answers

    Note over Agg: Convergence ≥ threshold → stop early
    Agg->>Q: Final answer: A, confidence 0.78

All debaters in each round run concurrently via asyncio.gather. Round N starts only after all of round N-1's positions are in.

Adaptive routing¶

When adaptive_routing=True (the default), the orchestrator first sends the question to a single fast model and checks the reported confidence. If confidence ≥ routing_confidence_threshold (default 0.85), the full debate is skipped and that answer is returned immediately with routing_decision=SINGLE_AGENT. This eliminates ~40% of unnecessary debate calls with no quality loss (arXiv 2504.05047).

cfg = DebateConfig(
    ...,
    adaptive_routing=True,
    routing_confidence_threshold=0.85,     # skip debate if this confident
    routing_model_client=haiku_client,     # fast model for the routing check
    routing_model_id="claude-haiku-4-5",
)

Sycophancy resistance¶

A known failure mode in multi-agent debate is sycophancy: agents flip their position not because of logic, but because of social pressure from peers (ACL 2025 CONSENSAGENT). Grampus mitigates this with a mandatory three-step prompt for rounds 2+:

Restate your previous answer verbatim
Critique each peer position based on evidence
Update only if there is a compelling logical reason — and explicitly state that reason

Position changes are tracked in DebaterPosition.changed_from_previous and change_justification.

Convergence detection¶

After each round, a ConvergenceDetector clusters positions by Jaccard word-overlap similarity. Two answers are considered equivalent if their word-set Jaccard score ≥ 0.4. The convergence score is:

score = size_of_largest_cluster / total_debaters

When score >= convergence_threshold (default 0.8), the debate stops early. This is stopped_early=True on the final DebateRound.

Escalation to humans¶

When the final convergence score is below escalate_threshold (default 0.5), the result sets escalate_to_human=True. This surfaces low-confidence cases rather than silently returning a guess ("From Debate to Decision", April 2026).

Aggregation strategies¶

Majority vote (fast baseline)¶

aggregation=AggregationStrategy.MAJORITY_VOTE

Finds the largest Jaccard-similarity cluster in the final round. The position with the highest self-reported confidence in that cluster is the representative. Returns the average cluster confidence.

Weighted vote (recommended)¶

aggregation=AggregationStrategy.WEIGHTED_VOTE

# Assign higher weight to your most capable debater
DebaterConfig(..., weight=2.0)   # this debater's vote counts double

Scores each cluster by sum(debater.weight × position.confidence). Picks the highest-scoring cluster. This lets you bias toward a more capable model without fully excluding others.

Judge model (highest quality)¶

from grampus.orchestration.debate import DebaterConfig, AggregationStrategy

judge_cfg = DebaterConfig(
    model_client=opus_client,
    model_id="claude-opus-4-7",
    temperature=0.0,
)

cfg = DebateConfig(
    ...,
    aggregation=AggregationStrategy.JUDGE,
    judge_config=judge_cfg,
)

After all rounds, a separate judge model receives all final-round positions verbatim and synthesises a single answer with explicit reasoning. Falls back to majority vote if the judge returns non-parseable output.

Integrating with the Graph engine¶

debate_node() wraps a DebateOrchestrator as a graph node, and optionally routes low-confidence results to a human review node:

from grampus.orchestration import Graph, debate_node, human_node

orch = DebateOrchestrator(cfg)

async def route(state):
    return "escalate" if state.metadata.get("debate_escalate") else "end"

graph = (
    Graph(graph_id="contract-review")
    .add_node("debate", debate_node(orch, on_escalate="human_review"), entry=True)
    .add_conditional_edge("debate", route, {"escalate": "human_review", "end": None})
    .add_node("human_review", human_node("Low-confidence answer — please review."))
)

result = await graph.execute(initial_state)

After the debate node runs, the last ASSISTANT message contains:

message.metadata["debate_result"]      # full DebateResult dict
message.metadata["debate_confidence"]  # float
message.metadata["debate_escalate"]    # bool — True when human review needed
message.metadata["debate_rounds"]      # int
message.metadata["debate_routing"]     # "debate" or "single_agent"

Budget enforcement¶

Pass a CostTracker to cap total spend across all debate rounds:

from grampus.orchestration import CostTracker

tracker = CostTracker(agent_id="qa-agent", session_id="s1", budget_usd=0.05)
orch = DebateOrchestrator(cfg, cost_tracker=tracker)

The orchestrator calls check_budget() before each round. If the budget is exhausted, BudgetExceededError is raised immediately.

Observability¶

Every debate emits structured OTEL spans:

Span	Key attributes
`debate.run`	`question_len`, `num_debaters`, `max_rounds`, `aggregation`, `adaptive_routing`
`debate.route_check`	`routing_model_id`, `confidence`, `decision`
`debate.round`	`round_number`, `convergence_score`, `stopped_early`
`debate.debater`	`debater_index`, `model_id`, `confidence`, `changed`
`debate.aggregate`	`strategy`, `final_confidence`, `escalate_to_human`

Pass any tracer with a span(name, **attrs) context manager interface:

from grampus.observability.tracer import GrampusTracer

tracer = GrampusTracer(service_name="my-agent", otlp_endpoint="http://localhost:4317")
orch = DebateOrchestrator(cfg, tracer=tracer)

Configuration reference¶

DebateConfig¶

Field	Type	Default	Description
`debaters`	`list[DebaterConfig]`	required (min 2)	Panel of debaters
`max_rounds`	`int`	`3`	Maximum debate rounds before forced aggregation
`aggregation`	`AggregationStrategy`	`WEIGHTED_VOTE`	How to pick the winner
`convergence_threshold`	`float`	`0.8`	Fraction of debaters that must agree to stop early
`adaptive_routing`	`bool`	`True`	Skip debate when single-agent confidence is high
`routing_confidence_threshold`	`float`	`0.85`	Confidence that triggers routing bypass
`routing_model_client`	`ModelClient \\| None`	`None` → debaters[0]	Fast model for routing check
`routing_model_id`	`str`	`""` → debaters[0]	Model ID for routing check
`judge_config`	`DebaterConfig \\| None`	`None`	Required when `aggregation=JUDGE`
`cost_budget_usd`	`float \\| None`	`None`	Hard budget ceiling across all rounds
`escalate_threshold`	`float`	`0.5`	Set `escalate_to_human=True` when convergence below this

DebaterConfig¶

Field	Type	Default	Description
`model_client`	`ModelClient`	required	Any Grampus model client
`model_id`	`str`	required	Model identifier string
`temperature`	`float`	`0.7`	Sampling temperature
`role_hint`	`str`	`""`	Appended to the system prompt — use for adversarial personas
`weight`	`float`	`1.0`	Vote weight for `WEIGHTED_VOTE` aggregation

DebateResult fields¶

Field	Type	Description
`final_answer`	`str`	Aggregated winning answer
`final_reasoning`	`str`	Reasoning from the winning position
`confidence`	`float`	Aggregated confidence score (0–1)
`escalate_to_human`	`bool`	`True` when convergence < `escalate_threshold`
`rounds`	`list[DebateRound]`	Full per-round transcript
`routing_decision`	`RoutingDecision`	`"debate"` or `"single_agent"`
`total_rounds_run`	`int`	Rounds actually run (may be < `max_rounds` due to early stop)
`converged`	`bool`	Whether early stopping triggered
`final_convergence_score`	`float`	Convergence in the final round
`total_token_usage`	`TokenUsage`	Cumulative tokens across all rounds
`total_cost_usd`	`float`	Total spend across all rounds
`duration_seconds`	`float`	Wall-clock time

Design notes¶

Use heterogeneous models — different model families outperform the same model at varied temperatures. At identical question difficulty, claude-haiku + claude-sonnet + claude-opus consistently beats claude-sonnet × 3 (M3MAD-Bench, ICLR 2025).

Three debaters is usually enough — beyond five debaters, cost grows linearly while accuracy improvement diminishes. Start with three: a fast model, a balanced model, and one with a devil's advocate role_hint.

Trust confidence, not consensus — WEIGHTED_VOTE with tuned weights outperforms pure majority in ambiguous cases. If you have a model you trust more, give it a weight of 2–3× to let it break ties.

Watch the escalation rate — if escalate_to_human=True fires too often, either reduce escalate_threshold or increase max_rounds. If it fires too rarely, reduce the threshold.

Next steps¶

Multi-Agent Crew → — Sequential pipeline patterns (researcher → critic → writer)
Agent Handoffs → — Runtime delegation between agents
Cost Management → — Budget enforcement and spend reporting
Orchestration API → — Full DebateOrchestrator reference