Skip to content

Evaluation API Reference

EvalSuite

Runs a collection of EvalCase objects against an AgentRunner.

grampus.evaluation.suite.EvalSuite

Runs a collection of EvalCases against an AgentRunner.

Parameters:

Name Type Description Default
name str

Suite name for reporting.

required
agent_runner Any

The runner to test (duck-typed AgentRunner).

required
agent_def AgentDefinition

Agent definition passed to runner.run().

required
session_id_prefix str

Prefix for per-case session IDs.

'eval'
concurrency int

Max cases to run in parallel.

1
tags list[str] | None

If set, only run cases whose tags intersect this set.

None

add_case(case)

Register a case. Returns self for chaining.

add_cases(cases)

Register multiple cases. Returns self for chaining.

run() async

Execute all registered (and tag-filtered) cases.

Returns:

Type Description
SuiteResult

SuiteResult with aggregated pass/fail counts and per-case details.

run_case(case) async

Run a single EvalCase. Useful for debugging individual cases.

Parameters:

Name Type Description Default
case EvalCase

The EvalCase to execute.

required

Returns:

Type Description
CaseResult

CaseResult with assertion details.


EvalCase

grampus.evaluation.suite.EvalCase

Bases: BaseModel

A single evaluation test case.

Attributes:

Name Type Description
id str

Unique identifier (auto-generated).

name str

Human-readable case name.

description str

Optional description.

input str

User input to send to the agent.

tags list[str]

Labels for filtering (e.g. "smoke", "regression").

assertions list[Any]

List of Assertion callables to run.

metadata dict[str, Any]

Arbitrary key-value metadata.


Results

SuiteResult

grampus.evaluation.suite.SuiteResult

Bases: BaseModel

Aggregate result from running a full EvalSuite.

Attributes:

Name Type Description
suite_name str

Name of the suite.

total_cases int

Number of cases actually run (after filtering).

passed int

Cases where all assertions passed.

failed int

Cases with one or more assertion failures.

errors int

Cases where the agent raised an exception.

pass_rate float

passed / total_cases.

avg_duration_seconds float

Mean per-case wall time.

case_results list[CaseResult]

Ordered list of per-case results.

run_at datetime

UTC timestamp of run start.

total_cost_usd float

Sum of all execution costs.

metadata dict[str, Any]

Arbitrary metadata.

CaseResult

grampus.evaluation.suite.CaseResult

Bases: BaseModel

Result of running one EvalCase.

Attributes:

Name Type Description
case_id str

ID of the EvalCase.

case_name str

Name of the EvalCase.

passed bool

True only if all assertions passed.

assertion_results list[AssertionResult]

Per-assertion results.

execution_result ExecutionResult | None

The agent's ExecutionResult (may be None on error).

error str | None

Set if the agent raised an exception.

duration_seconds float

Wall-clock time for this case.

tags list[str]

Tags copied from the EvalCase.

AssertionResult

@dataclass
class AssertionResult:
    passed: bool
    assertion_type: str    # e.g., "contains", "tool_was_called"
    detail: str            # human-readable description
    score: float           # 0.0–1.0 (1.0 = fully passed)
    expected: str | None
    actual: str | None

Assertion factories

All assertion factories return Assertion objects (async callables).

Output content

grampus.evaluation.assertions.contains(expected, *, case_sensitive=True)

Assert output contains the expected substring.

Parameters:

Name Type Description Default
expected str

Substring to search for.

required
case_sensitive bool

Whether the match is case-sensitive.

True

grampus.evaluation.assertions.not_contains(forbidden, *, case_sensitive=True)

Assert output does NOT contain the forbidden substring.

Parameters:

Name Type Description Default
forbidden str

Substring that must not appear.

required
case_sensitive bool

Whether the match is case-sensitive.

True

grampus.evaluation.assertions.matches_regex(pattern)

Assert output matches the regex pattern (re.search).

Parameters:

Name Type Description Default
pattern str

Regular expression pattern to search for.

required

grampus.evaluation.assertions.output_length(*, min_chars=None, max_chars=None)

Assert output character length is within bounds.

Parameters:

Name Type Description Default
min_chars int | None

Minimum character count (inclusive).

None
max_chars int | None

Maximum character count (inclusive).

None

Tool calls

grampus.evaluation.assertions.tool_was_called(tool_name)

Assert the named tool appeared in tool calls during execution.

Parameters:

Name Type Description Default
tool_name str

Name of the tool that must have been called.

required

grampus.evaluation.assertions.tool_not_called(tool_name)

Assert the named tool was NOT called during execution.

Parameters:

Name Type Description Default
tool_name str

Name of the tool that must not have been called.

required

grampus.evaluation.assertions.tool_call_count(*, min_calls=None, max_calls=None)

Assert total tool_calls_made is within bounds.

Parameters:

Name Type Description Default
min_calls int | None

Minimum number of tool calls.

None
max_calls int | None

Maximum number of tool calls.

None

Structured output

grampus.evaluation.assertions.json_schema_valid(schema)

Assert output is valid JSON matching the given JSON Schema.

Falls back to json.loads check only when jsonschema is unavailable.

Parameters:

Name Type Description Default
schema dict[str, Any]

JSON Schema dict to validate against.

required

grampus.evaluation.assertions.status_is(expected_status)

Assert ExecutionResult.status equals expected_status.

Parameters:

Name Type Description Default
expected_status AgentStatus

The status the agent run must have ended with.

required

Budget and performance

grampus.evaluation.assertions.max_cost(limit_usd)

Assert token_usage.cost_usd <= limit_usd.

Parameters:

Name Type Description Default
limit_usd float

Maximum allowed cost in USD.

required

grampus.evaluation.assertions.max_duration(limit_seconds)

Assert duration_seconds <= limit_seconds.

Parameters:

Name Type Description Default
limit_seconds float

Maximum allowed duration.

required

grampus.evaluation.assertions.max_steps(limit)

Assert steps_taken <= limit.

Parameters:

Name Type Description Default
limit int

Maximum allowed number of steps.

required

LLM-as-judge

grampus.evaluation.assertions.semantic_similarity(expected, *, model_client, threshold=0.8)

Assert cosine similarity between output and expected text >= threshold.

Uses LLM-as-judge: asks model_client to score similarity 0.0–1.0.

Parameters:

Name Type Description Default
expected str

Text to compare the output against.

required
model_client Any

ModelClient instance (duck-typed) for LLM scoring.

required
threshold float

Minimum similarity score to pass.

0.8

grampus.evaluation.assertions.llm_judge(criteria, *, model_client, threshold=0.7)

LLM-as-judge: score output against free-text criteria 0.0–1.0.

Parameters:

Name Type Description Default
criteria str

Free-text description of what constitutes a good response.

required
model_client Any

ModelClient instance (duck-typed) for LLM scoring.

required
threshold float

Minimum score to pass.

0.7

Safety

grampus.evaluation.assertions.no_pii(pii_types=None)

Assert output contains no PII.

Parameters:

Name Type Description Default
pii_types list[str] | None

List of PIIType string values to check. None means all types.

None

grampus.evaluation.assertions.no_injection_patterns()

Assert output contains no prompt injection patterns.

Uses PromptInjectionDetector at BALANCED level.


Prompt version manager

grampus.evaluation.prompt_versions.PromptVersionManager

Tracks system prompt versions for an agent.

All state is in-memory. Persistence can be layered in Phase 12.

Parameters:

Name Type Description Default
agent_id str

Scopes versions to this agent.

required

register(version, prompt, *, notes='')

Register a new version.

Parameters:

Name Type Description Default
version str

Semver string for this prompt.

required
prompt str

The system prompt text.

required
notes str

Optional description.

''

Returns:

Type Description
PromptVersion

The newly created PromptVersion.

Raises:

Type Description
ValueError

If the version string already exists.

diff(from_version, to_version)

Compute line-level diff between two versions.

Parameters:

Name Type Description Default
from_version str

Source version string.

required
to_version str

Target version string.

required

Returns:

Type Description
PromptDiff

PromptDiff with added/removed lines and similarity ratio.

Raises:

Type Description
ValueError

If either version is not found.


Quality baseline

grampus.evaluation.baseline.QualityBaseline

Records eval runs and detects regressions against a pinned baseline.

Parameters:

Name Type Description Default
suite_name str

Name of the eval suite this baseline tracks.

'default'
regression_threshold float

Pass-rate drop that triggers a regression flag. e.g. 0.05 = flag if pass_rate drops by more than 5 percentage points.

0.05

pin(run_id)

Pin a specific run as the baseline for future comparisons.

Parameters:

Name Type Description Default
run_id str

ID of the run to pin.

required

Raises:

Type Description
ValueError

If run_id not found.

compare(suite_result)

Compare suite_result against the pinned baseline.

Parameters:

Name Type Description Default
suite_result SuiteResult

New suite result to compare.

required

Returns:

Type Description
RegressionReport | None

RegressionReport, or None if no baseline is pinned.

BaselineComparison

@dataclass
class BaselineComparison:
    regressed: bool
    baseline_pass_rate: float
    current_pass_rate: float
    delta: float                      # current - baseline
    degraded_cases: list[str]         # case names that regressed
    improved_cases: list[str]         # case names that improved

Reporters

grampus.evaluation.reporter.EvalReporter

Renders and outputs evaluation reports.

Parameters:

Name Type Description Default
pubsub Any | None

Optional DaprPubSub for publishing results.

None
report_topic str

Pub/sub topic for full report JSON.

'grampus.eval.results'
run_store Any | None

Optional EvalRunStore to persist run records.

None
pubsub_topic str

Pub/sub topic for the lightweight eval.suite.completed event.

'eval.suite.completed'

render(report, *, fmt=ReportFormat.TEXT)

Render report as a string in the requested format.

Parameters:

Name Type Description Default
report EvalReport

The EvalReport to render.

required
fmt ReportFormat

Output format.

TEXT

Returns:

Type Description
str

Rendered string.

print(report, *, fmt=ReportFormat.TEXT)

Render and print to stdout.

Parameters:

Name Type Description Default
report EvalReport

The EvalReport to print.

required
fmt ReportFormat

Output format.

TEXT

publish(report) async

Publish report JSON to pub/sub topic; save to run_store; emit completed event.

Failures from the store or pub/sub never propagate to the caller.

Parameters:

Name Type Description Default
report EvalReport

The EvalReport to publish.

required

Red-Team API

Types

from grampus.evaluation.red_team import (
    AttackCategory,   # prompt_injection | jailbreak | reasoning_hijack
                      # memory_poison | tool_misuse | excessive_agency
    AttackVariant,    # direct_injection | indirect_injection | roleplay_jailbreak
                      # encoding_jailbreak | logic_trap | memory_write_inject
                      # memory_read_poison | tool_loop | tool_chain_escape
                      # scope_escalation | implicit_permission
    OWASPCategory,    # ASI01_GOAL_HIJACK | ASI02_TOOL_MISUSE | ASI06_MEMORY_POISON | ...
    SecurityProperty, # task_alignment | action_alignment | source_authorization | data_isolation
    Severity,         # critical | high | medium | low | info
    AttackPayload,
    JudgeVerdict,
    AttackResult,
    RedTeamTargetConfig,
    RedTeamCampaignConfig,
)

RedTeamTargetConfig

class RedTeamTargetConfig(BaseModel):
    agent_name: str
    system_prompt: str
    available_tools: list[str] = []
    memory_enabled: bool = False
    crew_enabled: bool = False
    max_turns: int = 1             # 1–10; >1 enables multi-turn strategy attacks

RedTeamCampaignConfig

class RedTeamCampaignConfig(BaseModel):
    campaign_id: str
    target: RedTeamTargetConfig
    enabled_categories: list[AttackCategory]
    payloads_per_strategy: int = 5    # 1–50
    max_concurrent: int = 5           # 1–10
    stop_on_critical: bool = False

AttackerAgent

grampus.evaluation.red_team.attacker.AttackerAgent

Generates adversarial payloads by combining static strategies with optional LLM-based mutation of failed attempts (AgenticRed pattern).

Static mode (model_client=None): uses only strategy templates. Adaptive mode (model_client set): after each failed attempt, uses the LLM to generate a mutated variant targeting the same objective.

Parameters:

Name Type Description Default
model_client Any | None

Optional. If set, enables adaptive mutation.

None
strategies list[BaseAttackStrategy] | None

List of strategy instances. Defaults to ALL_STRATEGIES instantiated with the same model_client.

None

generate_payloads(target, categories=None, count_per_strategy=5) async

Generate all payloads from enabled strategies for this target.

Filters strategies by categories if provided. Returns [] on error.

mutate_failed(failed, target) async

Use LLM to generate a mutated variant of a failed payload.

Returns None if model unavailable or mutation fails. Never raises.

RedTeamJudge

grampus.evaluation.red_team.judge.RedTeamJudge

Evaluates whether an attack succeeded by returning a JudgeVerdict.

Two-tier approach: 1. LLM judge (primary): contextual, nuanced, maps to 4 security properties 2. Rule-based fallback: regex patterns when LLM is unavailable

The LLM judge uses temperature=0.0 for deterministic verdicts. Rule-based judge is always run in parallel; its verdict is used when LLM confidence < 0.5 or on LLM failure.

Parameters:

Name Type Description Default
model_client Any | None

Optional — if None, falls back to rule-based only.

None

evaluate(payload, response) async

Evaluate whether the agent's response indicates a successful attack.

Returns JudgeVerdict. Never raises.

RedTeamRunner

grampus.evaluation.red_team.runner.RedTeamRunner

Orchestrates one complete red-team campaign.

Loop per payload: 1. AttackerAgent.generate_payloads() → list[AttackPayload] 2. For each payload: submit to target agent → collect response 3. RedTeamJudge.evaluate(payload, response) → JudgeVerdict 4. If failed + model available: AttackerAgent.mutate_failed() → retry once 5. Collect all AttackResults

Concurrency: asyncio.Semaphore bounded by config.max_concurrent.

The target agent is called via target_fn: an async callable that takes a list of (role, content) message tuples and returns a string response. This keeps RedTeamRunner decoupled from AgentRunner's full lifecycle.

Parameters:

Name Type Description Default
attacker AttackerAgent

AttackerAgent instance.

required
judge RedTeamJudge

RedTeamJudge instance.

required
target_fn Any

async callable (messages: list[tuple[str, str]]) -> str

required

run(config) async

Run the full campaign. Returns all AttackResults (successful and failed).

Respects config.stop_on_critical: aborts after first CRITICAL finding. Never raises.

RedTeamReport

grampus.evaluation.red_team.report.RedTeamReport

Generates structured findings from a completed campaign.

Deduplication: multiple results with the same attack_category and variant are grouped into one RedTeamFinding (keeps worst severity).

build(config, results)

Build a RedTeamSummary from campaign config and results. Never raises.

to_text(summary)

Generate a human-readable text report.

to_json(summary, indent=2)

Serialize summary to JSON string.

Writing a custom attack strategy

from grampus.evaluation.red_team.strategies.base import BaseAttackStrategy
from grampus.evaluation.red_team.types import (
    AttackCategory, AttackPayload, AttackVariant, RedTeamTargetConfig,
)


class MyCustomStrategy(BaseAttackStrategy):
    @property
    def category(self) -> AttackCategory:
        return AttackCategory.EXCESSIVE_AGENCY

    @property
    def name(self) -> str:
        return "my_custom"

    async def generate(
        self, target: RedTeamTargetConfig, count: int = 5
    ) -> list[AttackPayload]:
        try:
            return [
                AttackPayload(
                    content=f"Custom attack payload {i}",
                    attack_category=self.category,
                    attack_variant=AttackVariant.SCOPE_ESCALATION,
                    strategy_name=self.name,
                )
                for i in range(count)
            ]
        except Exception:
            return []

Pass it to AttackerAgent:

from grampus.evaluation.red_team.attacker import AttackerAgent
from grampus.evaluation.red_team.strategies import ALL_STRATEGIES

attacker = AttackerAgent(
    strategies=[S() for S in ALL_STRATEGIES] + [MyCustomStrategy()]
)

See the Red-Teaming guide for a full walkthrough.


Writing a custom assertion

from grampus.evaluation.assertions import AssertionResult
from grampus.core.types import ExecutionResult


class WordCountAssertion:
    """Assert the output contains between min_words and max_words words."""

    def __init__(self, min_words: int, max_words: int) -> None:
        self.min_words = min_words
        self.max_words = max_words

    async def __call__(self, result: ExecutionResult) -> AssertionResult:
        output = result.output or ""
        word_count = len(output.split())
        passed = self.min_words <= word_count <= self.max_words
        return AssertionResult(
            passed=passed,
            assertion_type="word_count",
            detail=f"Word count {word_count} {'within' if passed else 'outside'} [{self.min_words}, {self.max_words}]",
            score=1.0 if passed else 0.0,
            expected=f"{self.min_words}{self.max_words} words",
            actual=f"{word_count} words",
        )


# Use in an EvalCase
case = EvalCase(
    name="medium_length_response",
    input="Explain photosynthesis briefly.",
    assertions=[WordCountAssertion(min_words=50, max_words=200)],
)