Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.crewship.ai/llms.txt

Use this file to discover all available pages before exploring further.

crewship eval

Two distinct evaluation surfaces share the crewship eval namespace:
  1. Mission-level — replay a single mission deterministically or regression-diff two missions. Uses the Quartermaster engine. See Quartermaster guide.
  2. Routine-level — sweep the workspace’s eval-* routines × tier matrix, head-to-head tier compare, save/diff regression baselines. Built on top of the Routines runtime. See Routines guide and Routines cookbook.
crewship eval <subcommand> [flags]
Mission-level replay and regression enqueue work asynchronously and return a run_id (poll completion via crewship eval runs). The runs subcommand itself is a synchronous list query against existing eval runs. Routine-level subcommands all run synchronously and emit results inline.

Subcommands

GroupCommandDescription
Missionreplay <mission-id>Replay a mission deterministically.
Missionregression <baseline-id> <candidate-id>Regression-diff two missions.
MissionrunsList recent eval runs.
RoutinescenariosBatch-run eval-* routines × tier list × N runs, output pass-rate matrix.
Routinecompare <slug>Head-to-head: run one scenario back-to-back on two tiers.
Routinebaseline save <name>Snapshot the current matrix as a regression baseline.
Routinebaseline listList stored baselines.
Routinebaseline show <name>Inspect a stored baseline’s matrix.
Routinebaseline diff <name>Re-run + diff against a stored baseline. CI-friendly (non-zero exit on regression).
Routinebaseline delete <name>Remove a stored baseline.

crewship eval replay <mission-id>

crewship eval replay MIS-42 --seed 42
FlagTypeDefaultDescription
--seedint0Deterministic seed for the replay (0 = server default).
Sample output:
Replay queued: run_id=er_a1b2c3d4e5f60718 status=queued
Auth: Requires OWNER or ADMIN role (403 otherwise). Mission must belong to your workspace (404 otherwise, same shape as “not found” so existence isn’t leaked).

crewship eval regression <baseline-id> <candidate-id>

crewship eval regression MIS-41 MIS-42
No flags. Both mission IDs must belong to your workspace — a partial spoof (valid baseline + foreign candidate) still 404s. Sample output:
Regression queued: run_id=er_b2c3d4e5f6071829 status=queued

crewship eval runs

crewship eval runs
crewship eval runs --limit 20
FlagTypeDefaultDescription
--limitint50Max rows (1-500).
Sample output:
ID                         KIND        MISSION     BASELINE    STATUS       CREATED
er_a1b2c3d4e5f60718        replay      MIS-42                  completed    2026-04-17T10:00:00Z
er_b2c3d4e5f6071829        regression  MIS-42      MIS-41      completed    2026-04-17T10:15:00Z
er_c3d4e5f607182930        replay      MIS-43                  failed       2026-04-17T10:30:00Z
Status progresses queued -> running -> completed | failed. For regression runs, the result column (visible in JSON output) contains either no_regression or regressed: <delta summary> like regressed: tool success -8% cost +22%.

Fetch a specific run result

The CLI does not yet expose a dedicated get subcommand. For a specific run, query the API:
curl -H "Authorization: Bearer $CREWSHIP_TOKEN" \
  "$CREWSHIP_URL/api/v1/eval/runs?limit=200" | jq '.rows[] | select(.id == "er_...")'
Or watch the Crew Journal for eval.run_started, eval.metric, eval.regression_detected entries correlated by the run_id in the payload.

Routine-level eval

The routine-level commands target the eval suite seeded under the eval- slug prefix (see the Routines migration guide for the framework concept and cookbook recipe 6 for the eval-driven promotion workflow). Unlike the mission commands, these run synchronously and emit pass/fail and stats inline.

crewship eval scenarios

Batch-run the workspace’s eval-* routines across one or more tiers and report a pass-rate matrix. The canonical “weak vs strong agent same outcome?” harness — same DSL, same inputs, only the resolved tier differs across cells.
# Default: every eval-* routine, once, at the workspace's authored tier
crewship eval scenarios

# Targeted sweep with N runs per (scenario, tier) cell
crewship eval scenarios --scenarios eval-extract-emails,eval-classify-sentiment \
                        --tiers fast,smart --runs 5

# Machine-readable output for CI / regression dashboards
crewship eval scenarios --runs 3 --tiers fast,smart -f json
crewship eval scenarios --runs 3 --tiers fast,smart -f markdown
FlagDescription
--scenariosComma-separated routine slugs (default: every routine whose slug starts with eval-).
--tiersComma-separated tier overrides (trivial/fast/moderate/smart). Empty = use authored complexity.
--runsRuns per (scenario, tier) cell (default 1).
--inputsJSON inputs forwarded to every run (default: each routine’s authored defaults).
--fail-fastAbort on the first non-pass instead of running every cell.
Output (table format): rows = scenarios, columns = tiers, cells = pass/total $avg-cost. A cell with 5/5 $0.0012 means every run passed the routine’s gate at that tier. Use the JSON or Markdown formats for downstream tooling.

crewship eval compare <slug>

Run ONE eval scenario back-to-back on two tiers and report a head-to-head verdict. Use this when investigating a specific scenario rather than sweeping the whole suite.
crewship eval compare eval-extract-emails
crewship eval compare eval-syllogism-reasoning --tier-a moderate --tier-b smart
crewship eval compare eval-classify-sentiment -f markdown
FlagDefaultDescription
--tier-afastTier override for side A.
--tier-bsmartTier override for side B.
--inputsJSON inputs forwarded to both sides (default: routine’s authored defaults).
Verdict (top of output):
VerdictMeaning
AGREE-PASSBoth tiers passed the routine’s gate.
AGREE-FAILBoth tiers failed (regression on the routine itself, not a tier-strength signal).
DIVERGE-A-PASSOnly side A passed — weak-tier-only or strong-tier-only path discovered.
DIVERGE-B-PASSOnly side B passed.
AMBIGUOUSAt least one side errored at transport level — verdict can’t be cleanly assigned.
--tier-a and --tier-b MUST differ — passing the same value on both errors out. Output text identity is NOT asserted — two LLM runs are rarely identical at the byte level. The verdict is about gate-pass agreement, not text agreement.

crewship eval baseline

File-local snapshots of the eval matrix (~/.crewship/eval-baselines/<name>.json) plus regression diff. Per-developer / per-CI-run state — no server-side migration, no RBAC surface. Designed to be the CI gate that fails the build when a model swap or DSL refactor changes gate-pass behaviour.

baseline save <name>

Run the eval matrix once and persist it as a regression baseline.
# Snapshot current main behaviour as the baseline
crewship eval baseline save main --tiers fast,smart --runs 5
FlagDefaultDescription
--scenarios(every eval-*)Comma-separated routine slugs.
--tiers(no override)Comma-separated tier overrides.
--runs5Runs per (scenario, tier) cell.
--inputsJSON inputs forwarded to every run.
Baselines record the source workspace ID at save time. diff refuses to compare across workspaces — reusing a baseline name in another workspace would silently report bogus regressions. Names are restricted to [A-Za-z0-9_-]{1,64} to prevent path-traversal in the on-disk file location.

baseline list

crewship eval baseline list
crewship eval baseline list -f json
Columns: NAME, GENERATED, SCENARIOS, TIERS, RUNS/CELL.

baseline show <name>

crewship eval baseline show main
Prints the full matrix from the stored baseline (cells = pass/total $avg-cost). Use -f json for the raw record.

baseline diff <name>

Re-run the matrix against the same scenario / tier / runs combo and diff each cell against the stored baseline. Exits non-zero on any regression beyond --tolerance.
# In CI: fail on regression beyond ±10pp pass-rate change
crewship eval baseline diff main --tiers fast,smart --runs 5

# Looser tolerance for known-flaky routines, JSON output for CI capture
crewship eval baseline diff main --tolerance 0.20 --runs 10 -f json
FlagDefaultDescription
--scenarios(every eval-*)Comma-separated routine slugs.
--tiers(no override)Comma-separated tier overrides.
--runs5Runs per cell.
--inputsJSON inputs forwarded to every run.
--tolerance0.10Pass-rate delta tolerance for the STABLE verdict (e.g. 0.10 = ±10pp).
Per-cell verdict:
VerdictWhen
STABLEPass-rate delta within ±tolerance.
IMPROVEDPass-rate rose by more than tolerance.
REGRESSIONPass-rate dropped by more than tolerance. Triggers non-zero exit.
NEWCell exists in current run but not in baseline.
REMOVEDCell exists in baseline but not in current run (warning only).
Cells that exist in NEITHER baseline nor current are skipped (they get no row).

baseline delete <name>

crewship eval baseline delete main
Permanently removes the on-disk baseline file. No confirmation prompt — file is per-developer state.