Documentation Index
Fetch the complete documentation index at: https://docs.crewship.ai/llms.txt
Use this file to discover all available pages before exploring further.
crewship eval
Two distinct evaluation surfaces share the crewship eval namespace:
- Mission-level — replay a single mission deterministically or regression-diff two missions. Uses the Quartermaster engine. See Quartermaster guide.
- Routine-level — sweep the workspace’s
eval-* routines × tier matrix, head-to-head tier compare, save/diff regression baselines. Built on top of the Routines runtime. See Routines guide and Routines cookbook.
crewship eval <subcommand> [flags]
Mission-level replay and regression enqueue work asynchronously and return a run_id (poll completion via crewship eval runs). The runs subcommand itself is a synchronous list query against existing eval runs. Routine-level subcommands all run synchronously and emit results inline.
Subcommands
| Group | Command | Description |
|---|
| Mission | replay <mission-id> | Replay a mission deterministically. |
| Mission | regression <baseline-id> <candidate-id> | Regression-diff two missions. |
| Mission | runs | List recent eval runs. |
| Routine | scenarios | Batch-run eval-* routines × tier list × N runs, output pass-rate matrix. |
| Routine | compare <slug> | Head-to-head: run one scenario back-to-back on two tiers. |
| Routine | baseline save <name> | Snapshot the current matrix as a regression baseline. |
| Routine | baseline list | List stored baselines. |
| Routine | baseline show <name> | Inspect a stored baseline’s matrix. |
| Routine | baseline diff <name> | Re-run + diff against a stored baseline. CI-friendly (non-zero exit on regression). |
| Routine | baseline delete <name> | Remove a stored baseline. |
crewship eval replay <mission-id>
crewship eval replay MIS-42 --seed 42
| Flag | Type | Default | Description |
|---|
--seed | int | 0 | Deterministic seed for the replay (0 = server default). |
Sample output:
Replay queued: run_id=er_a1b2c3d4e5f60718 status=queued
Auth: Requires OWNER or ADMIN role (403 otherwise). Mission must belong to your workspace (404 otherwise, same shape as “not found” so existence isn’t leaked).
crewship eval regression <baseline-id> <candidate-id>
crewship eval regression MIS-41 MIS-42
No flags. Both mission IDs must belong to your workspace — a partial spoof (valid baseline + foreign candidate) still 404s.
Sample output:
Regression queued: run_id=er_b2c3d4e5f6071829 status=queued
crewship eval runs
crewship eval runs
crewship eval runs --limit 20
| Flag | Type | Default | Description |
|---|
--limit | int | 50 | Max rows (1-500). |
Sample output:
ID KIND MISSION BASELINE STATUS CREATED
er_a1b2c3d4e5f60718 replay MIS-42 completed 2026-04-17T10:00:00Z
er_b2c3d4e5f6071829 regression MIS-42 MIS-41 completed 2026-04-17T10:15:00Z
er_c3d4e5f607182930 replay MIS-43 failed 2026-04-17T10:30:00Z
Status progresses queued -> running -> completed | failed. For regression runs, the result column (visible in JSON output) contains either no_regression or regressed: <delta summary> like regressed: tool success -8% cost +22%.
Fetch a specific run result
The CLI does not yet expose a dedicated get subcommand. For a specific run, query the API:
curl -H "Authorization: Bearer $CREWSHIP_TOKEN" \
"$CREWSHIP_URL/api/v1/eval/runs?limit=200" | jq '.rows[] | select(.id == "er_...")'
Or watch the Crew Journal for eval.run_started, eval.metric, eval.regression_detected entries correlated by the run_id in the payload.
Routine-level eval
The routine-level commands target the eval suite seeded under the eval- slug prefix (see the Routines migration guide for the framework concept and cookbook recipe 6 for the eval-driven promotion workflow).
Unlike the mission commands, these run synchronously and emit pass/fail and stats inline.
crewship eval scenarios
Batch-run the workspace’s eval-* routines across one or more tiers and report a pass-rate matrix. The canonical “weak vs strong agent same outcome?” harness — same DSL, same inputs, only the resolved tier differs across cells.
# Default: every eval-* routine, once, at the workspace's authored tier
crewship eval scenarios
# Targeted sweep with N runs per (scenario, tier) cell
crewship eval scenarios --scenarios eval-extract-emails,eval-classify-sentiment \
--tiers fast,smart --runs 5
# Machine-readable output for CI / regression dashboards
crewship eval scenarios --runs 3 --tiers fast,smart -f json
crewship eval scenarios --runs 3 --tiers fast,smart -f markdown
| Flag | Description |
|---|
--scenarios | Comma-separated routine slugs (default: every routine whose slug starts with eval-). |
--tiers | Comma-separated tier overrides (trivial/fast/moderate/smart). Empty = use authored complexity. |
--runs | Runs per (scenario, tier) cell (default 1). |
--inputs | JSON inputs forwarded to every run (default: each routine’s authored defaults). |
--fail-fast | Abort on the first non-pass instead of running every cell. |
Output (table format): rows = scenarios, columns = tiers, cells = pass/total $avg-cost. A cell with 5/5 $0.0012 means every run passed the routine’s gate at that tier. Use the JSON or Markdown formats for downstream tooling.
crewship eval compare <slug>
Run ONE eval scenario back-to-back on two tiers and report a head-to-head verdict. Use this when investigating a specific scenario rather than sweeping the whole suite.
crewship eval compare eval-extract-emails
crewship eval compare eval-syllogism-reasoning --tier-a moderate --tier-b smart
crewship eval compare eval-classify-sentiment -f markdown
| Flag | Default | Description |
|---|
--tier-a | fast | Tier override for side A. |
--tier-b | smart | Tier override for side B. |
--inputs | | JSON inputs forwarded to both sides (default: routine’s authored defaults). |
Verdict (top of output):
| Verdict | Meaning |
|---|
AGREE-PASS | Both tiers passed the routine’s gate. |
AGREE-FAIL | Both tiers failed (regression on the routine itself, not a tier-strength signal). |
DIVERGE-A-PASS | Only side A passed — weak-tier-only or strong-tier-only path discovered. |
DIVERGE-B-PASS | Only side B passed. |
AMBIGUOUS | At least one side errored at transport level — verdict can’t be cleanly assigned. |
--tier-a and --tier-b MUST differ — passing the same value on both errors out.
Output text identity is NOT asserted — two LLM runs are rarely identical at the byte level. The verdict is about gate-pass agreement, not text agreement.
crewship eval baseline
File-local snapshots of the eval matrix (~/.crewship/eval-baselines/<name>.json) plus regression diff. Per-developer / per-CI-run state — no server-side migration, no RBAC surface. Designed to be the CI gate that fails the build when a model swap or DSL refactor changes gate-pass behaviour.
baseline save <name>
Run the eval matrix once and persist it as a regression baseline.
# Snapshot current main behaviour as the baseline
crewship eval baseline save main --tiers fast,smart --runs 5
| Flag | Default | Description |
|---|
--scenarios | (every eval-*) | Comma-separated routine slugs. |
--tiers | (no override) | Comma-separated tier overrides. |
--runs | 5 | Runs per (scenario, tier) cell. |
--inputs | | JSON inputs forwarded to every run. |
Baselines record the source workspace ID at save time. diff refuses to compare across workspaces — reusing a baseline name in another workspace would silently report bogus regressions.
Names are restricted to [A-Za-z0-9_-]{1,64} to prevent path-traversal in the on-disk file location.
baseline list
crewship eval baseline list
crewship eval baseline list -f json
Columns: NAME, GENERATED, SCENARIOS, TIERS, RUNS/CELL.
baseline show <name>
crewship eval baseline show main
Prints the full matrix from the stored baseline (cells = pass/total $avg-cost). Use -f json for the raw record.
baseline diff <name>
Re-run the matrix against the same scenario / tier / runs combo and diff each cell against the stored baseline. Exits non-zero on any regression beyond --tolerance.
# In CI: fail on regression beyond ±10pp pass-rate change
crewship eval baseline diff main --tiers fast,smart --runs 5
# Looser tolerance for known-flaky routines, JSON output for CI capture
crewship eval baseline diff main --tolerance 0.20 --runs 10 -f json
| Flag | Default | Description |
|---|
--scenarios | (every eval-*) | Comma-separated routine slugs. |
--tiers | (no override) | Comma-separated tier overrides. |
--runs | 5 | Runs per cell. |
--inputs | | JSON inputs forwarded to every run. |
--tolerance | 0.10 | Pass-rate delta tolerance for the STABLE verdict (e.g. 0.10 = ±10pp). |
Per-cell verdict:
| Verdict | When |
|---|
STABLE | Pass-rate delta within ±tolerance. |
IMPROVED | Pass-rate rose by more than tolerance. |
REGRESSION | Pass-rate dropped by more than tolerance. Triggers non-zero exit. |
NEW | Cell exists in current run but not in baseline. |
REMOVED | Cell exists in baseline but not in current run (warning only). |
Cells that exist in NEITHER baseline nor current are skipped (they get no row).
baseline delete <name>
crewship eval baseline delete main
Permanently removes the on-disk baseline file. No confirmation prompt — file is per-developer state.