Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.crewship.ai/llms.txt

Use this file to discover all available pages before exploring further.

All endpoints require authentication and are workspace-scoped. Mutating endpoints (replay, regression) require OWNER or ADMIN role. Mission IDs must belong to the caller’s workspace — cross-tenant IDs return 404 with the same shape as “not found”. See the Quartermaster guide. Replay and regression both return 202 Accepted immediately and perform the work in a 10-minute background goroutine. Poll via List Runs.

Queue a replay

POST /api/v1/eval/replay
Request body:
{
  "mission_id": "MIS-42",
  "seed": 42
}
FieldTypeRequiredDescription
mission_idstringYesTarget mission. Must be in the caller’s workspace.
seedintegerNoDeterministic seed recorded in the run row. 0 = server default.
Response: 202 Accepted
{
  "run_id": "er_a1b2c3d4e5f60718",
  "status": "queued"
}
Errors:
StatusCondition
400Invalid JSON or missing mission_id.
401No workspace.
403Not OWNER/ADMIN.
404mission_id not in your workspace.
500DB / token generation failure.

Queue a regression

POST /api/v1/eval/regression
Request body:
{
  "baseline_mission_id": "MIS-41",
  "candidate_mission_id": "MIS-42"
}
FieldTypeRequiredDescription
baseline_mission_idstringYesThe reference mission.
candidate_mission_idstringYesThe mission under test.
Both must be in the caller’s workspace. The handler checks them independently so a partial spoof still 404s. Response: 202 Accepted
{
  "run_id": "er_b2c3d4e5f6071829",
  "status": "queued"
}
Errors: Same as replay, plus 400 if either mission ID is empty.

List runs

GET /api/v1/eval/runs?limit=50
Query parameters:
ParamTypeDefaultDescription
limitinteger501-200.
Response: 200 OK
{
  "rows": [
    {
      "id": "er_a1b2c3d4e5f60718",
      "workspace_id": "ws_123",
      "kind": "replay",
      "mission_id": "MIS-42",
      "baseline_mission_id": "",
      "candidate_mission_id": "",
      "seed": 42,
      "seed_signature": "sha256:7c1b...",
      "status": "completed",
      "result": "ok",
      "tokens": 184251,
      "cost_usd": 0.8421,
      "regressed": false,
      "created_by": "user_123",
      "created_at": "2026-04-17T10:00:00Z",
      "updated_at": "2026-04-17T10:02:41Z"
    },
    {
      "id": "er_b2c3d4e5f6071829",
      "kind": "regression",
      "baseline_mission_id": "MIS-41",
      "candidate_mission_id": "MIS-42",
      "status": "completed",
      "result": "regressed: tool success -8% cost +22%",
      "regressed": true,
      "created_at": "2026-04-17T10:15:00Z"
    }
  ],
  "count": 2,
  "limit": 50
}
FieldTypeDescription
rows[].kindstringreplay or regression.
rows[].statusstringqueued, running, completed, failed.
rows[].resultstringHuman-readable outcome. On failure, the error message.
rows[].seed_signaturestringsha256 over step-type + tool-name sequence; stable across deterministic replays.
rows[].regressedbooleanFor regression kind only; true if at least one metric crossed the threshold.

Tenancy and role gates

  • All reads + writes scoped to the session’s workspace.
  • Decisions on replay/regression require OWNER or ADMIN.
  • Cross-tenant mission IDs return 404, not 403, to avoid leaking cross-workspace existence.

Journal side-effects

The background worker emits eval.run_started at the start, eval.metric for each computed metric, and eval.regression_detected when a regression run crosses a threshold. Correlate by run_id in the payload.