crewship eval

Two distinct evaluation surfaces share the crewship eval namespace:

Mission-level — replay a single mission deterministically or regression-diff two missions. Uses the Quartermaster engine. See Quartermaster guide.
Routine-level — sweep the workspace’s eval-* routines × tier matrix, head-to-head tier compare, save/diff regression baselines. Built on top of the Routines runtime. See Routines guide and Routines cookbook.

crewship eval <subcommand> [flags]

Mission-level replay and regression enqueue work asynchronously and return a run_id (poll completion via crewship eval runs). The runs subcommand itself is a synchronous list query against existing eval runs. Routine-level subcommands all run synchronously and emit results inline.

Subcommands

Group	Command	Description
Mission	`replay <mission-id>`	Replay a mission deterministically.
Mission	`regression <baseline-id> <candidate-id>`	Regression-diff two missions.
Mission	`runs`	List recent eval runs.
Routine	`scenarios`	Batch-run `eval-*` routines × tier list × N runs, output pass-rate matrix.
Routine	`compare <slug>`	Head-to-head: run one scenario back-to-back on two tiers.
Routine	`baseline save <name>`	Snapshot the current matrix as a regression baseline.
Routine	`baseline list`	List stored baselines.
Routine	`baseline show <name>`	Inspect a stored baseline’s matrix.
Routine	`baseline diff <name>`	Re-run + diff against a stored baseline. CI-friendly (non-zero exit on regression).
Routine	`baseline delete <name>`	Remove a stored baseline.

`crewship eval replay <mission-id>`

Replay a single mission deterministically. Enqueues asynchronously and returns a run_id — poll completion via eval runs.

crewship eval replay MIS-42 --seed 42

Flag	Type	Default	Description
`--seed`	`int`	`0`	Deterministic seed for the replay (0 = server default).

Sample output:

Replay queued: run_id=er_a1b2c3d4e5f60718 status=queued

Requires OWNER or ADMIN role (403 otherwise). Mission must belong to your workspace (404 otherwise, same shape as “not found” so existence isn’t leaked).

`crewship eval regression <baseline-id> <candidate-id>`

Regression-diff two missions. Enqueues asynchronously and returns a run_id.

crewship eval regression MIS-41 MIS-42

No flags. Both mission IDs must belong to your workspace — a partial spoof (valid baseline + foreign candidate) still 404s. Sample output:

Regression queued: run_id=er_b2c3d4e5f6071829 status=queued

`crewship eval runs`

crewship eval runs
crewship eval runs --limit 20

Flag	Type	Default	Description
`--limit`	`int`	`50`	Max rows (1-500).

Sample output:

ID                         KIND        MISSION     BASELINE    STATUS       CREATED
er_a1b2c3d4e5f60718        replay      MIS-42                  completed    2026-04-17T10:00:00Z
er_b2c3d4e5f6071829        regression  MIS-42      MIS-41      completed    2026-04-17T10:15:00Z
er_c3d4e5f607182930        replay      MIS-43                  failed       2026-04-17T10:30:00Z

Status progresses queued -> running -> completed | failed. For regression runs, the result column (visible in JSON output) contains either no_regression or regressed: <delta summary> like regressed: tool success -8% cost +22%.

Fetch a specific run result

The CLI does not yet expose a dedicated get subcommand. For a specific run, query the API:

# $CREWSHIP_SERVER is the instance URL; pass a CLI token as the bearer.
curl -H "Authorization: Bearer crewship_cli_xxxxx" \
  "$CREWSHIP_SERVER/api/v1/eval/runs?limit=200" | jq '.rows[] | select(.id == "er_...")'

Or watch the Crew Journal for eval.run_started, eval.metric, eval.regression_detected entries correlated by the run_id in the payload.

Routine-level eval

The routine-level commands target the eval suite seeded under the eval- slug prefix (see the Routines migration guide for the framework concept and cookbook recipe 6 for the eval-driven promotion workflow). Unlike the mission commands, these run synchronously and emit pass/fail and stats inline.

`crewship eval scenarios`

Batch-run the workspace’s eval-* routines across one or more tiers and report a pass-rate matrix. The canonical “weak vs strong agent same outcome?” harness — same DSL, same inputs, only the resolved tier differs across cells.

# Default: every eval-* routine, once, at the workspace's authored tier
crewship eval scenarios

# Targeted sweep with N runs per (scenario, tier) cell
crewship eval scenarios --scenarios eval-extract-emails,eval-classify-sentiment \
                        --tiers fast,smart --runs 5

# Machine-readable output for CI / regression dashboards
crewship eval scenarios --runs 3 --tiers fast,smart -f json
crewship eval scenarios --runs 3 --tiers fast,smart -f markdown

Flag	Description
`--scenarios`	Comma-separated routine slugs (default: every routine whose slug starts with `eval-`).
`--tiers`	Comma-separated tier overrides (`trivial`/`fast`/`moderate`/`smart`). Empty = use authored complexity.
`--runs`	Runs per (scenario, tier) cell (default 1).
`--inputs`	JSON inputs forwarded to every run (default: each routine’s authored defaults).
`--fail-fast`	Abort on the first non-pass instead of running every cell.

Output (table format): rows = scenarios, columns = tiers, cells = pass/total $avg-cost. A cell with 5/5 $0.0012 means every run passed the routine’s gate at that tier. Use the JSON or Markdown formats for downstream tooling.

`crewship eval compare <slug>`

Run ONE eval scenario back-to-back on two tiers and report a head-to-head verdict. Use this when investigating a specific scenario rather than sweeping the whole suite.

crewship eval compare eval-extract-emails
crewship eval compare eval-syllogism-reasoning --tier-a moderate --tier-b smart
crewship eval compare eval-classify-sentiment -f markdown

Flag	Default	Description
`--tier-a`	`fast`	Tier override for side A.
`--tier-b`	`smart`	Tier override for side B.
`--inputs`		JSON inputs forwarded to both sides (default: routine’s authored defaults).

Verdict (top of output):

Verdict	Meaning
`AGREE-PASS`	Both tiers passed the routine’s gate.
`AGREE-FAIL`	Both tiers failed (regression on the routine itself, not a tier-strength signal).
`DIVERGE-A-PASS`	Only side A passed — weak-tier-only or strong-tier-only path discovered.
`DIVERGE-B-PASS`	Only side B passed.
`AMBIGUOUS`	At least one side errored at transport level — verdict can’t be cleanly assigned.

--tier-a and --tier-b MUST differ — passing the same value on both errors out.

Output text identity is NOT asserted — two LLM runs are rarely identical at the byte level. The verdict is about gate-pass agreement, not text agreement.

`crewship eval baseline`

File-local snapshots of the eval matrix (~/.crewship/eval-baselines/<name>.json) plus regression diff. Per-developer / per-CI-run state — no server-side migration, no RBAC surface. Designed to be the CI gate that fails the build when a model swap or DSL refactor changes gate-pass behaviour.

`baseline save <name>`

Run the eval matrix once and persist it as a regression baseline.

# Snapshot current main behaviour as the baseline
crewship eval baseline save main --tiers fast,smart --runs 5

Flag	Default	Description
`--scenarios`	(every `eval-*`)	Comma-separated routine slugs.
`--tiers`	(no override)	Comma-separated tier overrides.
`--runs`	`5`	Runs per (scenario, tier) cell.
`--inputs`		JSON inputs forwarded to every run.

Baselines record the source workspace ID at save time. diff refuses to compare across workspaces — reusing a baseline name in another workspace would silently report bogus regressions. Names are restricted to [A-Za-z0-9_-]{1,64} to prevent path-traversal in the on-disk file location.

`baseline list`

crewship eval baseline list
crewship eval baseline list -f json

Columns: NAME, GENERATED, SCENARIOS, TIERS, RUNS/CELL.

`baseline show <name>`

crewship eval baseline show main

Prints the full matrix from the stored baseline (cells = pass/total $avg-cost). Use -f json for the raw record.

`baseline diff <name>`

Re-run the matrix against the same scenario / tier / runs combo and diff each cell against the stored baseline. Exits non-zero on any regression beyond --tolerance.

# In CI: fail on regression beyond ±10pp pass-rate change
crewship eval baseline diff main --tiers fast,smart --runs 5

# Looser tolerance for known-flaky routines, JSON output for CI capture
crewship eval baseline diff main --tolerance 0.20 --runs 10 -f json

Flag	Default	Description
`--scenarios`	(every `eval-*`)	Comma-separated routine slugs.
`--tiers`	(no override)	Comma-separated tier overrides.
`--runs`	`5`	Runs per cell.
`--inputs`		JSON inputs forwarded to every run.
`--tolerance`	`0.10`	Pass-rate delta tolerance for the `STABLE` verdict (e.g. `0.10` = ±10pp).

Per-cell verdict:

Verdict	When
`STABLE`	Pass-rate delta within ±tolerance.
`IMPROVED`	Pass-rate rose by more than tolerance.
`REGRESSION`	Pass-rate dropped by more than tolerance. Triggers non-zero exit.
`NEW`	Cell exists in current run but not in baseline.
`REMOVED`	Cell exists in baseline but not in current run (warning only).

Cells that exist in NEITHER baseline nor current are skipped (they get no row).

`baseline delete <name>`

crewship eval baseline delete main

Permanently removes the on-disk baseline file. No confirmation prompt — the file is per-developer state.

Quartermaster guide — mission-level replay/regression engine.
Routines guide — concept of routines + DSL spec.
Routines cookbook — recipe 6 walks the eval-driven promotion workflow end-to-end.
Routine CLI — routine bench, routine doctor (per-routine variance + preflight).
Eval API.

​crewship eval

​Subcommands

​crewship eval replay <mission-id>

​crewship eval regression <baseline-id> <candidate-id>

​crewship eval runs

​Fetch a specific run result

​Routine-level eval

​crewship eval scenarios

​crewship eval compare <slug>

​crewship eval baseline

​baseline save <name>

​baseline list

​baseline show <name>

​baseline diff <name>

​baseline delete <name>

​Related

crewship eval

Subcommands

`crewship eval replay <mission-id>`

`crewship eval regression <baseline-id> <candidate-id>`

`crewship eval runs`

Fetch a specific run result

Routine-level eval

`crewship eval scenarios`

`crewship eval compare <slug>`

`crewship eval baseline`

`baseline save <name>`

`baseline list`

`baseline show <name>`

`baseline diff <name>`

`baseline delete <name>`

Related