Documentation Index
Fetch the complete documentation index at: https://docs.crewship.ai/llms.txt
Use this file to discover all available pages before exploring further.
Quartermaster
Quartermaster is the eval framework. It reads the Crew Journal and derives typed “what happened” artifacts: a step-by-step trajectory, aggregate metrics, regression reports that compare two runs, and LLM-as-judge verdicts for qualitative scoring. Replay in this package means observational replay — rehydrate the trajectory from the journal and recompute metrics. Re-executing agents end-to-end is a later tier and not in scope. The durable index of runs lives ineval_runs (migration 53). Each run status is updated from queued -> running -> completed/failed by the background worker.
Metrics
exec.output_chunk, container.metrics, network.port_*) are dropped during extraction so metrics reflect meaningful actions, not chatter.
Replay
seed_signature (sha256 over the step-type and tool-name sequence). If you replay the same mission twice with no drift the signature is stable — divergence flags intermittent tool behaviour or non-determinism.
Workspace-scoped: the mission must belong to the caller’s workspace. OWNER or ADMIN role required.
The handler returns 202 Accepted immediately; the actual extract+compute+emit runs in a 10-minute goroutine. Poll via crewship eval runs or GET /api/v1/eval/runs.
Regression
| Metric | Direction that counts as regression | Threshold |
|---|---|---|
ToolSuccessRate | Down | > 5% |
TotalCostUSD | Up | > 20% |
StepsToGoal | Up | > 30% |
Hallucinations | Up | Any increase |
result: "regressed: <delta summary>" and emits eval.regression_detected into the journal (which Episodic memory picks up).
Both mission IDs must belong to the caller’s workspace — a partial spoof (valid baseline + foreign candidate) still 404s.
List runs
LLM-as-judge
TheJudge interface is provider-neutral — callers plug Ollama, Anthropic, or a stub.
EnsembleJudge runs the same grade multiple times with rubric-shuffle anti-bias (permuting rubric order per invocation) and aggregates via median + stddev. Stddev > 0.25 lowers the verdict’s Confidence and flips HumanEscalate = true.
Used by future eval rubrics and crew-to-crew critique flows.
CLI
crewship eval.
Gotchas
- Seed is informational today.
Replaycomputes a seed signature but does not re-execute agents with that seed — observational replay only. The seed column exists so future re-execution can anchor to it. - Regression thresholds are hard-coded. Edit
internal/quartermaster/regression.goto change them. A DB-backed config is a follow-up. - 10-minute deadline on workers. A mission with very long journals may not finish replay. If you see
failed: context deadline exceeded, raise theReplayBudgetin the handler or trim the mission. - Terminal updates use a fresh context. The worker has 10 min; each status write gets 5s. If the worker deadline fires, the terminal
failedrow write still succeeds because its context is independent.
Related
- Crew Journal —
eval.run_started,eval.metric,eval.regression_detected. - Episodic memory — embeds regression events for future recall.
crewship eval, Eval API.