Evaluation Guide¶
What you'll learn¶
- Why eval-driven development is essential for agents
- Structure of
EvalCaseandEvalSuite - All 16 assertion types with examples
- Prompt version management
- Quality baselines and regression detection
- CI integration with JUnit XML output
Why eval-driven development?¶
Unit tests verify that code runs correctly. Eval suites verify that agents behave correctly. An agent can pass all its unit tests and still:
- Hallucinate tool arguments
- Fail to use a required tool
- Produce output that violates safety policies
- Cost 3× more than expected
- Regress when the system prompt is tweaked
Eval suites catch all of these.
EvalCase structure¶
from grampus.evaluation.suite import EvalCase
from grampus.evaluation.assertions import contains, tool_was_called, max_cost
case = EvalCase(
name="research_uses_search",
description="Research agent must call web_search before answering",
input="What is the capital of Brazil?",
tags=["smoke", "regression"],
assertions=[
tool_was_called("web_search"),
contains("Brasília"),
max_cost(0.05),
],
)
| Field | Type | Description |
|---|---|---|
name |
str |
Unique case identifier |
description |
str |
Human-readable description |
input |
str |
User message sent to the agent |
tags |
list[str] |
Used for filtering: smoke, regression, safety, etc. |
assertions |
list[Assertion] |
Checks run against the ExecutionResult |
EvalSuite¶
from grampus.evaluation.suite import EvalSuite
suite = EvalSuite(
name="research-agent-suite",
agent_runner=runner,
agent_def=agent_def,
session_id_prefix="eval", # sessions: eval-0, eval-1, ...
concurrency=4, # run 4 cases in parallel
tags=["smoke"], # run only cases tagged "smoke" (None = run all)
)
suite.add_case(case)
suite.add_cases([case2, case3, case4])
# Chain style
suite.add_case(case5).add_case(case6)
result = await suite.run()
Reading results¶
print(f"Suite: {result.suite_name}")
print(f"Pass rate: {result.pass_rate:.0%} ({result.passed}/{result.total_cases})")
print(f"Total cost: ${result.total_cost_usd:.4f}")
print(f"Avg duration: {result.avg_duration_seconds:.2f}s")
for case_result in result.case_results:
status = "PASS" if case_result.passed else "FAIL"
print(f" [{status}] {case_result.case_name}")
if not case_result.passed:
for ar in case_result.assertion_results:
if not ar.passed:
print(f" {ar.assertion_type}: {ar.detail}")
All 16 assertion types¶
Output content¶
contains(expected, *, case_sensitive=True)
from grampus.evaluation.assertions import contains
contains("Brasília") # output must contain "Brasília"
contains("brasília", case_sensitive=False) # case-insensitive match
not_contains(forbidden, *, case_sensitive=True)
from grampus.evaluation.assertions import not_contains
not_contains("I don't know") # agent must not say this
not_contains("Error")
matches_regex(pattern)
from grampus.evaluation.assertions import matches_regex
matches_regex(r"\d{4}-\d{2}-\d{2}") # output contains a date
matches_regex(r"https?://\S+") # output contains a URL
output_length(*, min_chars=None, max_chars=None)
from grampus.evaluation.assertions import output_length
output_length(min_chars=100) # at least 100 chars
output_length(max_chars=2000) # at most 2000 chars
output_length(min_chars=50, max_chars=500)
Tool calls¶
tool_was_called(tool_name)
from grampus.evaluation.assertions import tool_was_called
tool_was_called("web_search") # web_search must have been called
tool_not_called(tool_name)
from grampus.evaluation.assertions import tool_not_called
tool_not_called("delete_file") # delete_file must NOT have been called
tool_call_count(*, min_calls=None, max_calls=None)
from grampus.evaluation.assertions import tool_call_count
tool_call_count(min_calls=1) # at least one tool call
tool_call_count(max_calls=5) # no more than 5 tool calls
tool_call_count(min_calls=2, max_calls=4)
Structured output¶
json_schema_valid(schema)
from grampus.evaluation.assertions import json_schema_valid
json_schema_valid({
"type": "object",
"required": ["answer", "sources"],
"properties": {
"answer": {"type": "string"},
"sources": {"type": "array", "items": {"type": "string"}},
},
})
status_is(expected_status)
from grampus.core.types import AgentStatus
from grampus.evaluation.assertions import status_is
status_is(AgentStatus.COMPLETED) # agent must not have failed
Budget and performance¶
max_cost(limit_usd)
max_duration(limit_seconds)
from grampus.evaluation.assertions import max_duration
max_duration(30.0) # must complete in under 30 seconds
max_steps(limit)
from grampus.evaluation.assertions import max_steps
max_steps(5) # agent must complete in 5 or fewer iterations
LLM-as-judge¶
semantic_similarity(expected, *, model_client, threshold=0.8)
Use this when exact string matching is too brittle:
from grampus.evaluation.assertions import semantic_similarity
semantic_similarity(
expected="Brasília is the capital of Brazil, founded in 1960.",
model_client=client,
threshold=0.8, # 0.8 cosine similarity required
)
llm_judge(criteria, *, model_client, threshold=0.7)
Ask a second LLM to score the output against free-text criteria:
from grampus.evaluation.assertions import llm_judge
llm_judge(
criteria=(
"The response must: (1) name the correct capital city, "
"(2) cite at least one source, (3) be written in a professional tone."
),
model_client=client,
threshold=0.7, # LLM judge must score >= 0.7/1.0
)
Safety assertions¶
no_pii(pii_types=None)
from grampus.evaluation.assertions import no_pii
no_pii() # no PII of any type in output
no_pii(pii_types=["email", "phone"]) # no email or phone in output
no_injection_patterns()
from grampus.evaluation.assertions import no_injection_patterns
no_injection_patterns() # output contains no prompt injection patterns
Prompt version management¶
from grampus.evaluation.prompt_versions import PromptVersionManager
manager = PromptVersionManager(state_store=state_store)
# Register a new version
await manager.register(
agent_name="research-agent",
version="v2",
system_prompt="You are a research assistant. Always cite sources with URLs.",
notes="Added URL citation requirement",
)
# List all versions
versions = await manager.list_versions("research-agent")
for v in versions:
print(f" {v.version}: {v.notes}")
# Diff two versions
diff = await manager.diff("research-agent", from_version="v1", to_version="v2")
print(diff)
# A/B test: run the same eval suite against both versions
result_v1 = await suite_v1.run()
result_v2 = await suite_v2.run()
print(f"v1 pass rate: {result_v1.pass_rate:.0%}")
print(f"v2 pass rate: {result_v2.pass_rate:.0%}")
# Pin the winning version
await manager.set_active(agent_name="research-agent", version="v2")
# Rollback if needed
await manager.set_active(agent_name="research-agent", version="v1")
Quality baselines¶
Pin a baseline score and detect regressions automatically:
from grampus.evaluation.baseline import QualityBaseline
baseline = QualityBaseline(state_store=state_store, agent_name="research-agent")
# Establish baseline after a known-good run
result = await suite.run()
await baseline.pin(suite_result=result)
print(f"Baseline pinned: {result.pass_rate:.0%}")
# On subsequent runs, compare against baseline
new_result = await suite.run()
comparison = await baseline.compare(new_result)
if comparison.regressed:
print(f"REGRESSION: pass rate dropped from {comparison.baseline_pass_rate:.0%} "
f"to {comparison.current_pass_rate:.0%}")
print(f"Degraded cases: {comparison.degraded_cases}")
else:
print(f"No regression detected ({new_result.pass_rate:.0%})")
Reporters¶
Streaming eval assertions¶
Standard eval assertions test a completed ExecutionResult. Streaming eval assertions go further — they measure how a response streams, catching latency regressions and stall conditions that only appear at runtime.
import asyncio
from grampus.evaluation.streaming import (
StreamingEvalCase,
StreamingEvalSuite,
chunk_count_between,
first_token_within,
min_throughput,
no_repetition,
no_stall,
stream_contains,
stream_output_length,
token_usage_reported,
)
async def main() -> None:
suite = StreamingEvalSuite(runner=runner)
suite.add_case(
StreamingEvalCase(
name="fast-response",
user_message="What is 2 + 2?",
assertions=[
first_token_within(seconds=2.0),
no_stall(max_gap_seconds=5.0),
min_throughput(tokens_per_second=10.0),
stream_contains("4"),
no_repetition(window=20),
],
)
)
suite.add_case(
StreamingEvalCase(
name="detailed-explanation",
user_message="Explain the water cycle.",
assertions=[
first_token_within(seconds=3.0),
stream_output_length(min_chars=200, max_chars=2000),
no_stall(max_gap_seconds=8.0),
chunk_count_between(min_count=5, max_count=200),
token_usage_reported(),
],
)
)
results = await suite.run()
print(f"Pass rate: {results.pass_rate:.0%} ({results.passed}/{results.total_cases})")
asyncio.run(main())
For the full list of streaming assertions and parameters, see the Streaming guide.
CI integration¶
Gate deployments on eval pass rate:
# Fail the pipeline if pass rate drops below 90%
grampus eval eval_suite.py --format junit --output results.xml --fail-under 0.9
echo $? # 0 if passed, 1 if below threshold
In GitHub Actions:
- name: Run eval suite
run: grampus eval tests/eval_suite.py --format junit --output results.xml --fail-under 0.9
- name: Publish test results
uses: EnricoMi/publish-unit-test-result-action@v2
if: always()
with:
files: results.xml
Next steps¶
- Evaluation API reference → — Full
EvalSuiteand assertion reference - Safety guide → — Write injection and PII safety assertions
- Observability guide → — Correlate eval runs with OTEL traces