> ## Documentation Index
> Fetch the complete documentation index at: https://docs.crewship.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Quartermaster

> Mission replay, regression detection, and LLM-as-judge evaluation over journal trajectories.

# Quartermaster

Quartermaster is the eval framework. It reads the [Crew Journal](/guides/crew-journal) and derives typed "what happened" artifacts: a step-by-step trajectory, aggregate metrics, regression reports that compare two runs, and LLM-as-judge verdicts for qualitative scoring.

<Note>
  **Replay** in this package means **observational replay** -- rehydrate the trajectory from the journal and recompute metrics. Re-executing agents end-to-end is a later tier and not in scope.
</Note>

The durable index of runs lives in `eval_runs` (migration 53). Each run status is updated from `queued` -> `running` -> `completed`/`failed` by the background worker.

## Metrics

```go theme={null}
type EvalMetrics struct {
    ToolCallCount    int
    ToolSuccessRate  float64  // 0..1 passed/total over exec+keeper outcomes
    StepsToGoal      int
    ConvergenceRatio float64  // optimal/actual heuristic
    TotalCostUSD     float64
    TotalTokens      int64
    Hallucinations   int      // guardrail.output_blocked count at warn/error
    FailureModes     []string // MAST taxonomy from journal patterns
}
```

Derived from the filtered trajectory -- low-value entry types (`exec.output_chunk`, `container.metrics`, `network.port_*`) are dropped during extraction so metrics reflect meaningful actions, not chatter.

## Replay

```
POST /api/v1/eval/replay
Body: {"mission_id": "MIS-42", "seed": 42}
-> {"run_id": "er_a1b2c3d4e5f60718", "status": "queued"}
```

Reads the mission's journal, extracts a trajectory, and computes a `seed_signature` (sha256 over the step-type and tool-name sequence). If you replay the same mission twice with no drift the signature is stable -- divergence flags intermittent tool behaviour or non-determinism.

Workspace-scoped: the mission must belong to the caller's workspace. `OWNER` or `ADMIN` role required.

<Info>
  The handler returns **202 Accepted** immediately; the actual extract+compute+emit runs in a 10-minute goroutine. Poll via `crewship eval runs` or `GET /api/v1/eval/runs`.
</Info>

## Regression

```
POST /api/v1/eval/regression
Body: {"baseline_mission_id": "MIS-41", "candidate_mission_id": "MIS-42"}
-> {"run_id": "er_...", "status": "queued"}
```

Computes metrics for both missions and compares. Delta signs matter -- cost going up is a regression, tool success rate going up is an improvement:

| Metric            | Direction that counts as regression | Threshold        |
| ----------------- | ----------------------------------- | ---------------- |
| `ToolSuccessRate` | Down                                | > 5% (absolute)  |
| `TotalCostUSD`    | Up                                  | > 15% (relative) |
| `StepsToGoal`     | Up                                  | > 20% (relative) |
| `Hallucinations`  | Up                                  | Any increase     |

When a regression is detected the run completes with `result: "regressed: <delta summary>"` and emits `eval.regression_detected` into the journal (which [Episodic memory](/guides/episodic-memory) picks up).

Both mission IDs must belong to the caller's workspace -- a partial spoof (valid baseline + foreign candidate) still 404s.

## List runs

```
GET /api/v1/eval/runs?limit=50
```

Returns newest-first, workspace-scoped. Limit 1-200, default 50.

```json theme={null}
{
  "rows": [
    {
      "id": "er_a1b2c3d4",
      "kind": "regression",
      "mission_id": "MIS-42",
      "baseline_mission_id": "MIS-41",
      "status": "completed",
      "result": "regressed: tool success -8%",
      "created_at": "2026-04-17T10:00:00Z"
    }
  ],
  "count": 1
}
```

## LLM-as-judge

The `Judge` interface is provider-neutral -- callers plug Ollama, Anthropic, or a stub.

```go theme={null}
type JudgeInterface interface {
    Judge(ctx context.Context, prompt string, rubric []string) (JudgeVerdict, error)
}
```

`EnsembleJudge` runs `k` random judges from the pool, each seeing a **rubric-shuffled** copy (permuting rubric order per invocation to mitigate position bias) and aggregates via median score + averaged confidence. Stddev > 0.25 annotates the verdict's reasoning with a high-disagreement warning; an averaged confidence below 0.9 flips `HumanEscalate = true`.

Used by future eval rubrics and crew-to-crew critique flows.

## CLI

```bash theme={null}
crewship eval replay MIS-42 --seed 42
crewship eval regression MIS-41 MIS-42
crewship eval runs --limit 20
```

Full reference: [`crewship eval`](/cli/eval).

## Gotchas

* **Seed is informational today.** `Replay` computes a seed signature but does not re-execute agents with that seed -- observational replay only. The seed column exists so future re-execution can anchor to it.
* **Regression thresholds are hard-coded.** Edit `internal/quartermaster/regression.go` to change them. A DB-backed config is a follow-up.
* **10-minute deadline on workers.** A mission with very long journals may not finish replay. If you see `failed: context deadline exceeded`, raise the `ReplayBudget` in the handler or trim the mission.
* **Terminal updates use a fresh context.** The worker has 10 min; each status write gets 5s. If the worker deadline fires, the terminal `failed` row write still succeeds because its context is independent.

## Related

* [Crew Journal](/guides/crew-journal) -- `eval.run_started`, `eval.metric`, `eval.regression_detected`.
* [Episodic memory](/guides/episodic-memory) -- embeds regression events for future recall.
* [`crewship eval`](/cli/eval), [Eval API](/api-reference/eval).