Prompt Playground¶

grampus playground is an interactive CLI for testing prompts, comparing model responses, and iterating on system prompts before wiring an agent into production. Think of it as a REPL for prompt engineering — with cost tracking, session history, and direct export to eval cases.

Quick start¶

$ grampus playground start --model claude-haiku-4-5

Grampus Prompt Playground
Model: claude-haiku-4-5  |  Cost: $0.0000
Type /help for commands. Ctrl+C to exit.

[claude-haiku-4-5] > What is the capital of France?

╭─── claude-haiku-4-5 ──────────────────────────────────────────────────╮
│ Paris is the capital of France.                                         │
╰─ ↑45 ↓8 tokens · $0.0001 · 0.8s ─────────────────────────────────────╯

[claude-haiku-4-5] >

Responses stream in real time. Each response shows input/output token count, cost, and latency.

REPL commands¶

Command	Description
`/model <name>`	Switch to a different model for subsequent turns
`/models`	List available model names from configured providers
`/system <text>`	Set the system prompt for this session
`/system file:<path>`	Load a system prompt from a local file
`/compare <model2> [model3]`	Run the last user message against additional models concurrently
`/cost`	Show accumulated cost for this session
`/reset`	Clear conversation history (system prompt preserved)
`/save [name]`	Save the current session to `~/.grampus/playground/`
`/load <name>`	Load a previously saved session
`/sessions`	List all saved sessions
`/export [path]`	Export the last turn as an `EvalCase` JSON file
`/version save <name>`	Save the current system prompt as a versioned entry
`/version diff <v1> <v2>`	Diff two saved system prompt versions
`/help`	Show all available commands
`/exit`	Exit the REPL

One-shot mode¶

Run a single prompt and exit — useful in scripts:

$ grampus playground run "What is the capital of France?" --model gpt-4o-mini
Paris is the capital of France.
↑52 ↓5 tokens · $0.00003 · 0.3s

Options:

Flag	Default	Description
`--model TEXT`	`claude-haiku-4-5`	Model to use
`--system TEXT`	`None`	System prompt string
`--no-stream`	`False`	Disable streaming (wait for full response)

Comparing models¶

Test the same prompt across multiple models simultaneously to compare quality and cost:

$ grampus playground compare "Explain async/await in Python in one paragraph" \
    --models claude-haiku-4-5,gpt-4o-mini,llama3.2

Running on 3 models concurrently...

╭─── claude-haiku-4-5 ─────────────────────────────────────────────────────╮
│ async/await is Python's syntax for writing asynchronous code that runs    │
│ concurrently without blocking...                                           │
╰─ ↑62 ↓89 tokens · $0.0001 · 0.9s ────────────────────────────────────────╯

╭─── gpt-4o-mini ──────────────────────────────────────────────────────────╮
│ In Python, async/await enables non-blocking I/O operations by allowing    │
│ functions to pause execution while waiting for results...                  │
╰─ ↑62 ↓94 tokens · $0.0001 · 0.7s ────────────────────────────────────────╯

╭─── llama3.2 (ollama) ────────────────────────────────────────────────────╮
│ The async/await pattern lets Python programs handle multiple operations    │
│ simultaneously by yielding control during I/O waits...                     │
╰─ ↑62 ↓81 tokens · $0.0000 · 1.2s ────────────────────────────────────────╯

Total cost: $0.0002  |  Fastest: gpt-4o-mini (0.7s)  |  Cheapest: llama3.2 ($0.00)

You can also run /compare gpt-4o-mini llama3.2 from inside the REPL to compare the most recent message against other models without retyping the prompt.

Saving and reusing sessions¶

Sessions save everything: conversation history, system prompt, model choice, and cost summary.

# From inside the REPL
[claude-haiku-4-5] > /save python-tutor
Session saved: python-tutor

# List saved sessions
$ grampus playground sessions
NAME             MODEL              TURNS  COST      SAVED
python-tutor     claude-haiku-4-5   8      $0.0012   2026-06-01 14:22
billing-tests    gpt-4o-mini        3      $0.0003   2026-05-30 09:15

# Resume a session
$ grampus playground start --load python-tutor

Sessions are stored as JSON files in ~/.grampus/playground/. They are portable — you can commit them to your repository as regression fixtures.

Exporting to eval cases¶

Any turn can be exported as an EvalCase JSON that feeds directly into an EvalSuite:

# Inside the REPL, after getting a response you want to test
[claude-haiku-4-5] > What is the capital of Brazil?
╭─── claude-haiku-4-5 ─────────╮
│ Brasília is the capital ...   │
╰─ ↑42 ↓12 tokens · $0.0001 ───╯

[claude-haiku-4-5] > /export cases/capital_brazil.json
Exported: cases/capital_brazil.json

The exported file is an EvalCase with the user message pre-filled and a contains assertion based on the observed response:

{
  "name": "capital_brazil",
  "input": "What is the capital of Brazil?",
  "assertions": [
    {"type": "contains", "expected": "Brasília"}
  ]
}

Load it in your eval suite:

import json
from grampus.evaluation.suite import EvalCase, EvalSuite

with open("cases/capital_brazil.json") as f:
    data = json.load(f)

case = EvalCase(**data)
suite.add_case(case)