Evaluation API Reference¶
EvalSuite¶
Runs a collection of EvalCase objects against an AgentRunner.
grampus.evaluation.suite.EvalSuite
¶
Runs a collection of EvalCases against an AgentRunner.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Suite name for reporting. |
required |
agent_runner
|
Any
|
The runner to test (duck-typed AgentRunner). |
required |
agent_def
|
AgentDefinition
|
Agent definition passed to runner.run(). |
required |
session_id_prefix
|
str
|
Prefix for per-case session IDs. |
'eval'
|
concurrency
|
int
|
Max cases to run in parallel. |
1
|
tags
|
list[str] | None
|
If set, only run cases whose tags intersect this set. |
None
|
add_case(case)
¶
Register a case. Returns self for chaining.
add_cases(cases)
¶
Register multiple cases. Returns self for chaining.
run()
async
¶
Execute all registered (and tag-filtered) cases.
Returns:
| Type | Description |
|---|---|
SuiteResult
|
SuiteResult with aggregated pass/fail counts and per-case details. |
run_case(case)
async
¶
Run a single EvalCase. Useful for debugging individual cases.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
case
|
EvalCase
|
The EvalCase to execute. |
required |
Returns:
| Type | Description |
|---|---|
CaseResult
|
CaseResult with assertion details. |
EvalCase¶
grampus.evaluation.suite.EvalCase
¶
Bases: BaseModel
A single evaluation test case.
Attributes:
| Name | Type | Description |
|---|---|---|
id |
str
|
Unique identifier (auto-generated). |
name |
str
|
Human-readable case name. |
description |
str
|
Optional description. |
input |
str
|
User input to send to the agent. |
tags |
list[str]
|
Labels for filtering (e.g. "smoke", "regression"). |
assertions |
list[Any]
|
List of Assertion callables to run. |
metadata |
dict[str, Any]
|
Arbitrary key-value metadata. |
Results¶
SuiteResult¶
grampus.evaluation.suite.SuiteResult
¶
Bases: BaseModel
Aggregate result from running a full EvalSuite.
Attributes:
| Name | Type | Description |
|---|---|---|
suite_name |
str
|
Name of the suite. |
total_cases |
int
|
Number of cases actually run (after filtering). |
passed |
int
|
Cases where all assertions passed. |
failed |
int
|
Cases with one or more assertion failures. |
errors |
int
|
Cases where the agent raised an exception. |
pass_rate |
float
|
passed / total_cases. |
avg_duration_seconds |
float
|
Mean per-case wall time. |
case_results |
list[CaseResult]
|
Ordered list of per-case results. |
run_at |
datetime
|
UTC timestamp of run start. |
total_cost_usd |
float
|
Sum of all execution costs. |
metadata |
dict[str, Any]
|
Arbitrary metadata. |
CaseResult¶
grampus.evaluation.suite.CaseResult
¶
Bases: BaseModel
Result of running one EvalCase.
Attributes:
| Name | Type | Description |
|---|---|---|
case_id |
str
|
ID of the EvalCase. |
case_name |
str
|
Name of the EvalCase. |
passed |
bool
|
True only if all assertions passed. |
assertion_results |
list[AssertionResult]
|
Per-assertion results. |
execution_result |
ExecutionResult | None
|
The agent's ExecutionResult (may be None on error). |
error |
str | None
|
Set if the agent raised an exception. |
duration_seconds |
float
|
Wall-clock time for this case. |
tags |
list[str]
|
Tags copied from the EvalCase. |
AssertionResult¶
@dataclass
class AssertionResult:
passed: bool
assertion_type: str # e.g., "contains", "tool_was_called"
detail: str # human-readable description
score: float # 0.0–1.0 (1.0 = fully passed)
expected: str | None
actual: str | None
Assertion factories¶
All assertion factories return Assertion objects (async callables).
Output content¶
grampus.evaluation.assertions.contains(expected, *, case_sensitive=True)
¶
Assert output contains the expected substring.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expected
|
str
|
Substring to search for. |
required |
case_sensitive
|
bool
|
Whether the match is case-sensitive. |
True
|
grampus.evaluation.assertions.not_contains(forbidden, *, case_sensitive=True)
¶
Assert output does NOT contain the forbidden substring.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
forbidden
|
str
|
Substring that must not appear. |
required |
case_sensitive
|
bool
|
Whether the match is case-sensitive. |
True
|
grampus.evaluation.assertions.matches_regex(pattern)
¶
Assert output matches the regex pattern (re.search).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pattern
|
str
|
Regular expression pattern to search for. |
required |
grampus.evaluation.assertions.output_length(*, min_chars=None, max_chars=None)
¶
Assert output character length is within bounds.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
min_chars
|
int | None
|
Minimum character count (inclusive). |
None
|
max_chars
|
int | None
|
Maximum character count (inclusive). |
None
|
Tool calls¶
grampus.evaluation.assertions.tool_was_called(tool_name)
¶
Assert the named tool appeared in tool calls during execution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tool_name
|
str
|
Name of the tool that must have been called. |
required |
grampus.evaluation.assertions.tool_not_called(tool_name)
¶
Assert the named tool was NOT called during execution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tool_name
|
str
|
Name of the tool that must not have been called. |
required |
grampus.evaluation.assertions.tool_call_count(*, min_calls=None, max_calls=None)
¶
Assert total tool_calls_made is within bounds.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
min_calls
|
int | None
|
Minimum number of tool calls. |
None
|
max_calls
|
int | None
|
Maximum number of tool calls. |
None
|
Structured output¶
grampus.evaluation.assertions.json_schema_valid(schema)
¶
Assert output is valid JSON matching the given JSON Schema.
Falls back to json.loads check only when jsonschema is unavailable.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
schema
|
dict[str, Any]
|
JSON Schema dict to validate against. |
required |
grampus.evaluation.assertions.status_is(expected_status)
¶
Assert ExecutionResult.status equals expected_status.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expected_status
|
AgentStatus
|
The status the agent run must have ended with. |
required |
Budget and performance¶
grampus.evaluation.assertions.max_cost(limit_usd)
¶
Assert token_usage.cost_usd <= limit_usd.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
limit_usd
|
float
|
Maximum allowed cost in USD. |
required |
grampus.evaluation.assertions.max_duration(limit_seconds)
¶
Assert duration_seconds <= limit_seconds.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
limit_seconds
|
float
|
Maximum allowed duration. |
required |
grampus.evaluation.assertions.max_steps(limit)
¶
Assert steps_taken <= limit.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
limit
|
int
|
Maximum allowed number of steps. |
required |
LLM-as-judge¶
grampus.evaluation.assertions.semantic_similarity(expected, *, model_client, threshold=0.8)
¶
Assert cosine similarity between output and expected text >= threshold.
Uses LLM-as-judge: asks model_client to score similarity 0.0–1.0.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expected
|
str
|
Text to compare the output against. |
required |
model_client
|
Any
|
ModelClient instance (duck-typed) for LLM scoring. |
required |
threshold
|
float
|
Minimum similarity score to pass. |
0.8
|
grampus.evaluation.assertions.llm_judge(criteria, *, model_client, threshold=0.7)
¶
LLM-as-judge: score output against free-text criteria 0.0–1.0.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
criteria
|
str
|
Free-text description of what constitutes a good response. |
required |
model_client
|
Any
|
ModelClient instance (duck-typed) for LLM scoring. |
required |
threshold
|
float
|
Minimum score to pass. |
0.7
|
Safety¶
grampus.evaluation.assertions.no_pii(pii_types=None)
¶
Assert output contains no PII.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pii_types
|
list[str] | None
|
List of PIIType string values to check. None means all types. |
None
|
grampus.evaluation.assertions.no_injection_patterns()
¶
Assert output contains no prompt injection patterns.
Uses PromptInjectionDetector at BALANCED level.
Prompt version manager¶
grampus.evaluation.prompt_versions.PromptVersionManager
¶
Tracks system prompt versions for an agent.
All state is in-memory. Persistence can be layered in Phase 12.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
agent_id
|
str
|
Scopes versions to this agent. |
required |
register(version, prompt, *, notes='')
¶
Register a new version.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
version
|
str
|
Semver string for this prompt. |
required |
prompt
|
str
|
The system prompt text. |
required |
notes
|
str
|
Optional description. |
''
|
Returns:
| Type | Description |
|---|---|
PromptVersion
|
The newly created PromptVersion. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the version string already exists. |
diff(from_version, to_version)
¶
Compute line-level diff between two versions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
from_version
|
str
|
Source version string. |
required |
to_version
|
str
|
Target version string. |
required |
Returns:
| Type | Description |
|---|---|
PromptDiff
|
PromptDiff with added/removed lines and similarity ratio. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If either version is not found. |
Quality baseline¶
grampus.evaluation.baseline.QualityBaseline
¶
Records eval runs and detects regressions against a pinned baseline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
suite_name
|
str
|
Name of the eval suite this baseline tracks. |
'default'
|
regression_threshold
|
float
|
Pass-rate drop that triggers a regression flag. e.g. 0.05 = flag if pass_rate drops by more than 5 percentage points. |
0.05
|
pin(run_id)
¶
Pin a specific run as the baseline for future comparisons.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_id
|
str
|
ID of the run to pin. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If run_id not found. |
compare(suite_result)
¶
Compare suite_result against the pinned baseline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
suite_result
|
SuiteResult
|
New suite result to compare. |
required |
Returns:
| Type | Description |
|---|---|
RegressionReport | None
|
RegressionReport, or None if no baseline is pinned. |
BaselineComparison¶
@dataclass
class BaselineComparison:
regressed: bool
baseline_pass_rate: float
current_pass_rate: float
delta: float # current - baseline
degraded_cases: list[str] # case names that regressed
improved_cases: list[str] # case names that improved
Reporters¶
grampus.evaluation.reporter.EvalReporter
¶
Renders and outputs evaluation reports.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pubsub
|
Any | None
|
Optional DaprPubSub for publishing results. |
None
|
report_topic
|
str
|
Pub/sub topic for full report JSON. |
'grampus.eval.results'
|
run_store
|
Any | None
|
Optional EvalRunStore to persist run records. |
None
|
pubsub_topic
|
str
|
Pub/sub topic for the lightweight eval.suite.completed event. |
'eval.suite.completed'
|
render(report, *, fmt=ReportFormat.TEXT)
¶
Render report as a string in the requested format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
report
|
EvalReport
|
The EvalReport to render. |
required |
fmt
|
ReportFormat
|
Output format. |
TEXT
|
Returns:
| Type | Description |
|---|---|
str
|
Rendered string. |
print(report, *, fmt=ReportFormat.TEXT)
¶
Render and print to stdout.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
report
|
EvalReport
|
The EvalReport to print. |
required |
fmt
|
ReportFormat
|
Output format. |
TEXT
|
publish(report)
async
¶
Publish report JSON to pub/sub topic; save to run_store; emit completed event.
Failures from the store or pub/sub never propagate to the caller.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
report
|
EvalReport
|
The EvalReport to publish. |
required |
Red-Team API¶
Types¶
from grampus.evaluation.red_team import (
AttackCategory, # prompt_injection | jailbreak | reasoning_hijack
# memory_poison | tool_misuse | excessive_agency
AttackVariant, # direct_injection | indirect_injection | roleplay_jailbreak
# encoding_jailbreak | logic_trap | memory_write_inject
# memory_read_poison | tool_loop | tool_chain_escape
# scope_escalation | implicit_permission
OWASPCategory, # ASI01_GOAL_HIJACK | ASI02_TOOL_MISUSE | ASI06_MEMORY_POISON | ...
SecurityProperty, # task_alignment | action_alignment | source_authorization | data_isolation
Severity, # critical | high | medium | low | info
AttackPayload,
JudgeVerdict,
AttackResult,
RedTeamTargetConfig,
RedTeamCampaignConfig,
)
RedTeamTargetConfig¶
class RedTeamTargetConfig(BaseModel):
agent_name: str
system_prompt: str
available_tools: list[str] = []
memory_enabled: bool = False
crew_enabled: bool = False
max_turns: int = 1 # 1–10; >1 enables multi-turn strategy attacks
RedTeamCampaignConfig¶
class RedTeamCampaignConfig(BaseModel):
campaign_id: str
target: RedTeamTargetConfig
enabled_categories: list[AttackCategory]
payloads_per_strategy: int = 5 # 1–50
max_concurrent: int = 5 # 1–10
stop_on_critical: bool = False
AttackerAgent¶
grampus.evaluation.red_team.attacker.AttackerAgent
¶
Generates adversarial payloads by combining static strategies with optional LLM-based mutation of failed attempts (AgenticRed pattern).
Static mode (model_client=None): uses only strategy templates. Adaptive mode (model_client set): after each failed attempt, uses the LLM to generate a mutated variant targeting the same objective.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_client
|
Any | None
|
Optional. If set, enables adaptive mutation. |
None
|
strategies
|
list[BaseAttackStrategy] | None
|
List of strategy instances. Defaults to ALL_STRATEGIES instantiated with the same model_client. |
None
|
generate_payloads(target, categories=None, count_per_strategy=5)
async
¶
Generate all payloads from enabled strategies for this target.
Filters strategies by categories if provided. Returns [] on error.
mutate_failed(failed, target)
async
¶
Use LLM to generate a mutated variant of a failed payload.
Returns None if model unavailable or mutation fails. Never raises.
RedTeamJudge¶
grampus.evaluation.red_team.judge.RedTeamJudge
¶
Evaluates whether an attack succeeded by returning a JudgeVerdict.
Two-tier approach: 1. LLM judge (primary): contextual, nuanced, maps to 4 security properties 2. Rule-based fallback: regex patterns when LLM is unavailable
The LLM judge uses temperature=0.0 for deterministic verdicts. Rule-based judge is always run in parallel; its verdict is used when LLM confidence < 0.5 or on LLM failure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_client
|
Any | None
|
Optional — if None, falls back to rule-based only. |
None
|
evaluate(payload, response)
async
¶
Evaluate whether the agent's response indicates a successful attack.
Returns JudgeVerdict. Never raises.
RedTeamRunner¶
grampus.evaluation.red_team.runner.RedTeamRunner
¶
Orchestrates one complete red-team campaign.
Loop per payload: 1. AttackerAgent.generate_payloads() → list[AttackPayload] 2. For each payload: submit to target agent → collect response 3. RedTeamJudge.evaluate(payload, response) → JudgeVerdict 4. If failed + model available: AttackerAgent.mutate_failed() → retry once 5. Collect all AttackResults
Concurrency: asyncio.Semaphore bounded by config.max_concurrent.
The target agent is called via target_fn: an async callable that takes
a list of (role, content) message tuples and returns a string response.
This keeps RedTeamRunner decoupled from AgentRunner's full lifecycle.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
attacker
|
AttackerAgent
|
AttackerAgent instance. |
required |
judge
|
RedTeamJudge
|
RedTeamJudge instance. |
required |
target_fn
|
Any
|
async callable (messages: list[tuple[str, str]]) -> str |
required |
run(config)
async
¶
Run the full campaign. Returns all AttackResults (successful and failed).
Respects config.stop_on_critical: aborts after first CRITICAL finding. Never raises.
RedTeamReport¶
grampus.evaluation.red_team.report.RedTeamReport
¶
Generates structured findings from a completed campaign.
Deduplication: multiple results with the same attack_category and variant are grouped into one RedTeamFinding (keeps worst severity).
Writing a custom attack strategy¶
from grampus.evaluation.red_team.strategies.base import BaseAttackStrategy
from grampus.evaluation.red_team.types import (
AttackCategory, AttackPayload, AttackVariant, RedTeamTargetConfig,
)
class MyCustomStrategy(BaseAttackStrategy):
@property
def category(self) -> AttackCategory:
return AttackCategory.EXCESSIVE_AGENCY
@property
def name(self) -> str:
return "my_custom"
async def generate(
self, target: RedTeamTargetConfig, count: int = 5
) -> list[AttackPayload]:
try:
return [
AttackPayload(
content=f"Custom attack payload {i}",
attack_category=self.category,
attack_variant=AttackVariant.SCOPE_ESCALATION,
strategy_name=self.name,
)
for i in range(count)
]
except Exception:
return []
Pass it to AttackerAgent:
from grampus.evaluation.red_team.attacker import AttackerAgent
from grampus.evaluation.red_team.strategies import ALL_STRATEGIES
attacker = AttackerAgent(
strategies=[S() for S in ALL_STRATEGIES] + [MyCustomStrategy()]
)
See the Red-Teaming guide for a full walkthrough.
Writing a custom assertion¶
from grampus.evaluation.assertions import AssertionResult
from grampus.core.types import ExecutionResult
class WordCountAssertion:
"""Assert the output contains between min_words and max_words words."""
def __init__(self, min_words: int, max_words: int) -> None:
self.min_words = min_words
self.max_words = max_words
async def __call__(self, result: ExecutionResult) -> AssertionResult:
output = result.output or ""
word_count = len(output.split())
passed = self.min_words <= word_count <= self.max_words
return AssertionResult(
passed=passed,
assertion_type="word_count",
detail=f"Word count {word_count} {'within' if passed else 'outside'} [{self.min_words}, {self.max_words}]",
score=1.0 if passed else 0.0,
expected=f"{self.min_words}–{self.max_words} words",
actual=f"{word_count} words",
)
# Use in an EvalCase
case = EvalCase(
name="medium_length_response",
input="Explain photosynthesis briefly.",
assertions=[WordCountAssertion(min_words=50, max_words=200)],
)