Eval - Crewship

Quartermaster replays a mission deterministically or diffs two missions to detect regressions in tool success, cost, and step signature. Both mutating endpoints return 202 Accepted immediately and run the work in a background goroutine; poll List runs for results. See the Quartermaster guide.

All endpoints require authentication and are workspace-scoped. Mutating endpoints (replay, regression) require OWNER or ADMIN role. Mission IDs must belong to the caller’s workspace — cross-tenant IDs return 404 with the same shape as “not found”.

Endpoints

Method	Endpoint	Purpose
POST	`/api/v1/eval/replay`	Queue a deterministic mission replay
POST	`/api/v1/eval/regression`	Queue a baseline-vs-candidate regression
GET	`/api/v1/eval/runs`	List eval runs (poll for results)

Queueing runs

Replay and regression both return 202 Accepted immediately and perform the work in a 10-minute background goroutine. Poll via List runs.

Queue a replay

POST /api/v1/eval/replay

Request body:

{
  "mission_id": "MIS-42",
  "seed": 42
}

Field	Type	Required	Description
`mission_id`	string	Yes	Target mission. Must be in the caller’s workspace.
`seed`	integer	No	Deterministic seed recorded in the run row. 0 = server default.

Response: 202 Accepted

{
  "run_id": "er_a1b2c3d4e5f60718",
  "status": "queued"
}

Errors:

Status	Condition
400	Invalid JSON or missing `mission_id`.
401	No workspace.
403	Not OWNER/ADMIN.
404	`mission_id` not in your workspace.
500	DB / token generation failure.

Queue a regression

POST /api/v1/eval/regression

Request body:

{
  "baseline_mission_id": "MIS-41",
  "candidate_mission_id": "MIS-42"
}

Field	Type	Required	Description
`baseline_mission_id`	string	Yes	The reference mission.
`candidate_mission_id`	string	Yes	The mission under test.

Both must be in the caller’s workspace. The handler checks them independently so a partial spoof still 404s. Response: 202 Accepted

{
  "run_id": "er_b2c3d4e5f6071829",
  "status": "queued"
}

Errors: Same as replay, plus 400 if either mission ID is empty.

Results

Poll for run status, results, and per-run token/cost totals.

List runs

GET /api/v1/eval/runs?limit=50

Query parameters:

Param	Type	Default	Description
`limit`	integer	50	1-200.

Response: 200 OK

{
  "rows": [
    {
      "id": "er_a1b2c3d4e5f60718",
      "workspace_id": "ws_123",
      "kind": "replay",
      "mission_id": "MIS-42",
      "status": "completed",
      "result": "ok",
      "seed": 42,
      "signature": "7c1b...",
      "total_tokens": 184251,
      "total_cost_usd": 0.8421,
      "regressed": false,
      "created_by": "user_123",
      "created_at": "2026-04-17T10:00:00Z",
      "completed_at": "2026-04-17T10:02:41Z"
    },
    {
      "id": "er_b2c3d4e5f6071829",
      "workspace_id": "ws_123",
      "kind": "regression",
      "baseline_mission_id": "MIS-41",
      "candidate_mission_id": "MIS-42",
      "status": "completed",
      "result": "regressed: tool success -8% cost +22%",
      "seed": 0,
      "total_tokens": 201044,
      "total_cost_usd": 0.9133,
      "regressed": true,
      "created_at": "2026-04-17T10:15:00Z"
    }
  ],
  "count": 2,
  "limit": 50
}

The serialized struct (quartermaster.RunRecord) marks mission_id, baseline_mission_id, candidate_mission_id, result, signature, created_by, and completed_at as omitempty — empty values are dropped from the row, not emitted as ""/null. seed, total_tokens, total_cost_usd, and regressed are always present.

Field	Type	Description
`rows[].kind`	string	`replay` or `regression`.
`rows[].status`	string	`queued`, `running`, `completed`, `failed`.
`rows[].result`	string	Human-readable outcome. On failure, the error message. Regression success uses `no_regression`; a detected regression reads `regressed: <delta summary>`. Omitted while empty.
`rows[].signature`	string	sha256 over step-type + tool-name sequence; stable across deterministic replays. Populated for replay runs; omitted when empty.
`rows[].total_tokens`	integer	Total tokens consumed by the run.
`rows[].total_cost_usd`	number	Total USD cost of the run.
`rows[].completed_at`	string?	RFC3339 completion timestamp; omitted until the run finishes.
`rows[].regressed`	boolean	For regression kind only; true if at least one metric crossed the threshold.

Tenancy and role gates

All reads + writes scoped to the session’s workspace.
Decisions on replay/regression require OWNER or ADMIN.
Cross-tenant mission IDs return 404, not 403, to avoid leaking cross-workspace existence.

Journal side-effects

The background worker emits eval.run_started at the start, eval.metric for each computed metric, and eval.regression_detected when a regression run crosses a threshold. Correlate by run_id in the payload.

​Endpoints

​Queueing runs

​Queue a replay

​Queue a regression

​Results

​List runs

​Tenancy and role gates

​Journal side-effects

​Related

Endpoints

Queueing runs

Queue a replay

Queue a regression

Results

List runs

Tenancy and role gates

Journal side-effects

Related