Skip to main content

What is a Routine?

A Routine is a declarative recipe for repeatable AI work. You author it once (or an agent authors it for you when it spots a repetitive pattern), it runs the same way every time, and it lives in your workspace as a reusable asset alongside crews, skills, and credentials. Each routine bundles:
  • A JSON DSL definition — inputs, outputs, ordered or DAG-structured steps, validation gates, declared egress, credential requirements
  • Authorship metadata — which crew, which agent (or user), via which path
  • Triggers — cron schedules and webhook tokens that fire it autonomously
  • Run history — every invocation as an immutable journal trace with step-level events
  • Versions — every save creates a new version; rollback is one API call
Compared to other layers in Crewship:
LayerScopeAuthoringUse case
RoutineatomicAI-authored OR human DSLOne repeatable AI workflow
Recipe (future)compositeMarketplace templateCrew + agents + integrations + routines bundled
Cyclic Issue (future)issue-trackerrecurring user issue”Standup every Monday” without leaving Issues
The user-facing label is Routine across web UI, CLI, agent system prompts and marketing surfaces. The backend identifier (database table, Go package, HTTP route paths) remains pipelines for backwards compatibility.
Naval theme. Routines are the boring, accurate, repeated procedures a ship’s crew runs every day — the equivalent of naval drills in the Crewship metaphor. Programs your AI agents follow.

Three authoring paths

Crewship is unusual among workflow systems because agents author routines, not just execute them. Three paths converge on the same pipelines table:

Agent

The most common path. An agent that’s solved a repetitive problem twice posts to http://localhost:9119/pipelines/save from inside its container; the sidecar injects authorship and forwards to the main API. Next time [AVAILABLE ROUTINES] block in the system prompt advertises it to other crews.You can also just describe a routine in chat — “make a routine that summarizes yesterday’s commits and posts to Slack.” A crew Lead carries the bundled routine-author skill (an authoring playbook): it clarifies the essentials, grounds the DSL in the crew’s [CONNECTED INTEGRATIONS] and [AVAILABLE ROUTINES], writes and test-runs the routine, then tells you whether it went live or landed as a proposed routine for a Manager to approve (see Governance).

UI

Open /routines, click + New routine, pick a starter template, edit the DSL JSON, click Test & Save. Test_run runs against the execution tier; on pass the routine is persisted with authored_via=user_api and the JWT user as author. An existing routine’s DSL is also fully editable in-place from its detail page’s Editor / JSON tab (CodeMirror, format + revert + copy) — see the note below on how that path’s Save differs from Test & Save.

CLI

crewship routine save --name "..." --definition file.json --author-crew <crew_id>. The server validates the DSL on save (same gate as the UI). CI-friendly: validate offline first with crewship routine validate file.json, then save.
Editing an existing routine’s DSL bypasses the test-run gate. The Editor / JSON tab’s Save button posts straight to /pipelines/save with skip_test_gate: true — it does not send the definition through test_run first. The server only honors that flag for OWNER/ADMIN (lower roles get 403), so this is a fast lane for trusted operators, not a hole in the gate: a MEMBER/MANAGER can view and copy the DSL from this tab but can’t save an edit without going through the create flow’s Test & Save. A follow-up will chain test_run → save_token → save behind one button so any MANAGER+ role can edit safely.

DSL anatomy

Minimal valid routine:
{
  "dsl_version": "1.0",
  "name": "summarize-text",
  "description": "Summarize input text in 3 bullets.",
  "inputs": [
    { "name": "text", "type": "string", "required": true }
  ],
  "outputs": [
    { "name": "summary", "type": "string" }
  ],
  "steps": [
    {
      "id": "summarize",
      "type": "agent_run",
      "agent_slug": "tomas",
      "complexity": "fast",
      "prompt": "Summarize: {{ inputs.text }}"
    }
  ]
}

Top-level fields

FieldTypeNotes
dsl_versionstringAlways "1.0" for now. Forward-compat field.
namestringSlug-friendly identifier. Workspace-unique.
display_namestringOptional pretty label.
descriptionstringOne-line summary; shown in lists + [AVAILABLE ROUTINES].
inputs[]array of InputSpecDeclared parameters.
outputs[]array of OutputSpecRead from the final step’s output by name. Documentary in MVP.
steps[]array of StepSequential by default; DAG with needs:.
execution_tierobject{preferred, fallback} overriding workspace default.
estimated_cost_usdnumberAuthor estimate; UI surfaces overrun warnings.
egress_targetsarray of stringDeclared outbound hosts. Enforced at run time for every http step and http hook, including redirect hops; empty/omitted = unrestricted at the routine layer (the crew network policy and SSRF guards still apply). See http.
integrations_requiredarray of stringDeclared third-party integrations (Composio connector slugs like "github", "slack"). Declared AND enforced at run time — see Required integrations below. Slugs are lowercased/trimmed.
credentials_requiredarray of CredReqDeclared credential needs (type + scope). Declared-only today (not yet enforced at run time).
resourcesobjectAgent-declared capability manifest — datastores[] and tools[] the routine touches that can’t be inferred from the step graph. Declared AND enforced at run time against the crew’s container resources — see Required resources below. See Capability manifest below for the shape.
max_cost_usdnumberRuntime cost cap; run aborts if exceeded between steps.
concurrency_keystringTemplate that gates how many runs can be in flight at once for the same workspace + rendered key. Typical pattern "{{ inputs.account_id }}" to serialise per-tenant runs. Empty (default) = no gate. See Concurrency + idempotency recipe and the fail-fast note in Troubleshooting.
max_concurrentintegerCap on simultaneous runs sharing the resolved concurrency_key. Defaults to 1 when concurrency_key is set (strict per-key serialisation); ignored otherwise.
guardrailsobjectPer-routine Lookout policy. guardrails.input.prompt_injection.action: block (default) | sanitize | log.
evalobjectContinuous online grading. See Online eval below.
agentlessbooleanToken-zero guarantee: the routine may only contain http / code / wait / transform steps and can never invoke an LLM. See Agentless routines.

InputSpec

{
  "name": "since",
  "type": "string",
  "required": false,
  "default": "yesterday",
  "description": "ISO date or 'yesterday'/'today'",
  "min": 0,
  "max": 100
}
type is one of string | integer | number | boolean | array | object. min / max are *float64 so decimal constraints work for number types.

Template substitution

Anywhere a string is interpolated (prompt, http URL/body/headers, wait until, code, transform, conditional if), placeholder {{ ... }} resolves against:
  • inputs.X — declared input value
  • steps.Y.output — full text output of an earlier step
  • steps.Y.output.path — JSON path into a step’s output (when output parses as JSON)
  • env.AUTHOR_CREW_NAME / etc. — read-only allowlist of execution context
Save-time validator walks every template-bearing field (prompt, nested inputs, http url/body/headers, wait until, event_filter, approval_prompt, code body, code env values, transform input, transform expression, if condition) and rejects placeholders that reference unknown inputs or unseen-yet steps.
Templates are regex substitution, not expression evaluation. There is no arithmetic, no function calls, no inline conditionals. If you need logic, add another step.
Save-time validation walks every template-bearing field and rejects placeholders that reference unknown inputs or unseen-yet steps — so a typo’d {{ inputs.tyop }} fails at save, not at 3am in production.

Required integrations

A routine can DECLARE the third-party integrations (Composio connectors) it needs with the top-level integrations_required array, and the run path will block a run when the executing crew hasn’t connected one of them. This closes the “integration forgotten” gap — the same shape as egress_targets, but for connectors instead of hosts.
{
  "name": "triage-incident",
  "integrations_required": ["slack", "github"],
  "steps": [ ... ]
}
Semantics:
  • Declared is always allowed. Saving a routine that names an integration the crew lacks is fine — declaring is a contract, not a connection. Only the run enforces. (Save-time validation only checks the list is well-formed: non-empty slugs, lowercased/trimmed, within a sane count cap.)
  • Enforced at run time. Before a run starts, Crewship resolves the integrations the routine’s author crew has connected and compares them to integrations_required. If any are missing the run is blocked with an RFC 7807 Problem Details response, HTTP 422, carrying a machine-readable missing_integrations: string[] member and a human detail like routine requires integration "slack" not connected for crew "Marketing". The UI uses missing_integrations to render a Connect action. The run never starts — no tokens spent.
  • No-op fast path. An empty / absent integrations_required does zero resolution work — no overhead for routines that don’t use it.
  • Fail-open. If integration-availability resolution itself errors, the run is allowed (a warning is logged). A bug in resolution must never wedge every run of every routine — a forgotten integration is a soft, recoverable failure; a hard block on all runs would be a self-inflicted outage.
  • run is gated; dry_run is not. A live run executes against the author crew’s agents, so it’s gated (fail fast rather than land an unrunnable routine). The internal save gate’s draft validation applies the same integration check. dry_run is a preview that invokes nothing, so it’s left ungated — it shows what the routine would need, even integrations the crew hasn’t connected yet.
Integration availability is resolved from the crew’s connected Composio connectors (the MCP server rows the bind flow writes). Two limitations: under the workspace default connector (every agent inherits all connected apps), the gate treats integrations as available without enumerating them; and resolution reflects what’s wired, not live connection health (a revoked account still reads as available until its binding is removed). credentials_required remains declared-only for now — it is not yet enforced at run time the way integrations_required is.

Capability manifest

Every routine has a derived capability manifest — the full “what this routine touches” blast radius. The detail API (GET .../routines/{id}) returns it under a manifest member so the UI can render a data-flow diagram and governance can reason about the whole footprint of a run, not just its visible steps. Most of the manifest is auto-derived from the DSL — you don’t declare it:
Manifest fieldDerived from
integrationsintegrations_required (normalized)
egressegress_targets plus any host parseable from http step URLs (templated {{ }} URLs are skipped)
credentialscredentials_required
agentsagent_slug of every agent_run step
routinespipeline_slug of every call_pipeline step
toolscode step runtimes plus declared resources.tools
datastoresdeclared resources.datastores
has_http / has_codewhether any http / code step exists
The walk covers routine-level (before_all / after_all / on_failure) and per-step (before / after) lifecycle hooks too — a capability hidden in an on_failure hook is still part of the blast radius. Every list is deduped + sorted and always rendered as [] (never null), so the diagram is stable.

The resources block

Two things can’t be inferred from the step graph: datastores a routine reads/writes, and CLI tools/scripts it runs. In production, code steps aren’t wired — agents run scripts (ansible, kubectl, …) via an agent_run step that shells out, so static analysis can’t see them. Declare them so they show up in the manifest:
{
  "name": "deploy-service",
  "resources": {
    "datastores": [
      { "type": "postgres", "name": "main", "note": "writes table runs" },
      { "type": "redis",    "name": "cache" }
    ],
    "tools": [
      { "type": "ansible",  "name": "deploy.yml" },
      { "type": "kubectl" }
    ]
  },
  "steps": [ ... ]
}
  • datastores[].typeany string, but use the canonical vocabulary redis | postgres | mysql | mongodb | other so it matches what the crew’s container catalog reports (the precondition gate compares type case-insensitively; a non-canonical type simply won’t match a crew resource).
  • tools[].typeany string; canonical examples ansible | terraform | kubectl | bash | python | other. Same matching rule as datastores.
  • name / note are free-form and optional.
Validation is lenient: declaring a resource is always allowed at save time (declaring is a contract, not a provisioning request). Save only rejects malformed entries — a blank or non-slug type, or more than 32 of either kind. The run path enforces availability — see Required resources.

Required resources

The resources block is a run-time precondition gate, the resource sibling of Required integrations: it states what the routine requires (datastores like Postgres/Redis, CLI tools like ansible), and the run path checks it against what the executing crew’s container actually has ([CONTAINER RESOURCES] — the crew’s sidecar datastores + installed tools). Semantics:
  • Enforced at run time. Before a run starts, Crewship resolves the author crew’s container resources and compares them to the declared resources.datastores + resources.tools. If the routine requires a datastore or tool the crew doesn’t have, the run is blocked with an RFC 7807 Problem Details response, HTTP 422, carrying a machine-readable missing_resources: [{ kind, type, name }] member (kind is "datastore" or "tool") and a human detail like routine needs datastore postgres, tool ansible, not available to crew "Ops". The run never starts — no tokens spent — until the resource is provisioned on the crew.
  • Matching. A required datastore is satisfied when the crew has a datastore of the same engine type (case-insensitive; the service name is advisory and not matched). A required tool is satisfied when the crew has an installed tool whose name matches the required type (the tool’s name, e.g. deploy.yml, is the concrete artifact and isn’t matched against the container).
  • No-op fast path. A routine that declares no resources block does zero resolution work. Only the declared resources are gated — code-step runtimes folded into the manifest’s tools (e.g. cel) are internal executor runtimes, not crew CLIs, and are never gated.
  • Fail-open. If the routine has no author crew, or resource resolution itself errors, the run is allowed (a warning is logged) — same reasoning as the integration gate: a resolver bug must never wedge every run.
  • run is gated; dry_run is not. (The internal save gate’s draft validation applies the same resource check.)

Agents already know what the container has

You don’t have to tell an agent that its crew runs Postgres or ships kubectl — Crewship surfaces it automatically. Every agent’s system prompt now includes a [CONTAINER RESOURCES] block listing the crew’s datastores (derived from the crew’s sidecar services — a service named postgres is reachable at host postgres on its declared port) and installed CLI tools (derived from the crew’s devcontainer features + mise toolchain, e.g. ansible, kubectl, git, python, node). The agent is instructed to use these directly instead of probing or trying to install them. So when you author a routine that needs a datastore, declare it under resources.datastores and connect via the host/port the agent already sees in that block. The block is omitted entirely when a crew has no services and no notable tools.

Step types

agent_run

{
  "id": "summarize",
  "type": "agent_run",
  "agent_slug": "tomas",
  "complexity": "fast",
  "prompt": "Summarize: {{ inputs.text }}",
  "model_override": null,
  "timeout_seconds": 60,
  "validation": {
    "schema": { "type": "object", "required": ["title"] },
    "must_not_contain": ["API_KEY=", "Bearer "],
    "min_length": 30
  },
  "on_fail": "escalate_tier"
}
complexity resolves through the workspace’s execution_tiers_json mapping into (adapter, model). With complexity: "fast", the agent’s CLI gets --model claude-haiku-4-5-20251001 (or whatever the workspace mapped fast to). model_override is the explicit pin that wins over complexity. on_fail lives at the step level (not inside validation) and is one of escalate_tier | abort | retry_stepescalate_tier walks the fallback chain (e.g., Haiku → Sonnet → Opus) until validation passes or the chain exhausts.

call_pipeline

{
  "id": "review_summary",
  "type": "call_pipeline",
  "pipeline_slug": "human-approval-step",
  "nested_inputs": {
    "content": "{{ steps.summarize.output }}",
    "approver_role": "marketing_lead"
  }
}
Composition primitive. Save-time cycle detection rejects loops; runtime depth limit caps at 10 levels.

http

{
  "id": "fetch",
  "type": "http",
  "http": {
    "method": "GET",
    "url": "{{ inputs.url }}",
    "max_response_bytes": 200000,
    "success_codes": [200],
    "credential_ref": { "type": "API_KEY", "inject_as": "bearer" }
  },
  "timeout_seconds": 30
}
Egress enforcement. Every http step (and http hook) passes two host gates at run time, checked pre-flight AND on every redirect hop (CheckRedirect callback):
  1. Routine layer — egress_targets. When the routine declares egress_targets, the request host must be one of the declared hosts or a subdomain of one (api.x.com matches target x.com; evilx.com does not). A routine that declares no egress_targets is unrestricted at this layer — backward-compatible with every routine that predates the field. The SSRF guard (private/link-local IPs, DNS-rebind-safe dialing) applies regardless.
  2. Crew layer — network policy. The authoring crew’s network policy (network_mode + allowed domains — the same dial that governs the crew’s agent containers) also applies to direct http steps. Crews on the default free mode are unaffected; a restricted crew’s routines can only reach the crew’s allowed domains (exact host match, same as the container proxy).
A blocked request fails the step with a structured error naming the step, the host, and which layer denied it — before any bytes leave the server. Credential injection. credential_ref.type is resolved at run time against the workspace credential vault by type (case-insensitive match on the vault type, e.g. API_KEY, GENERIC_SECRET), never by ID — so a shared routine runs against any workspace holding a credential of the right type. Only ACTIVE credentials resolve; credentials pinned to another crew are invisible; when several match, the authoring crew’s own credential wins over workspace-shared ones and the newest wins within each group (rotation). The decrypted value goes into the outbound request only — never into logs, the journal, or step output. If nothing matches, the request is sent without credentials (public endpoints keep working). Injection schemes: bearer (default), header with explicit name, query with explicit name.

wait

{
  "id": "human_approval",
  "type": "wait",
  "wait": {
    "kind": "approval",
    "approval_prompt": "Approve summary?",
    "timeout_sec": 86400
  }
}
Three kinds: approval (HITL token), datetime (sleep until ISO timestamp), event (filter on journal events). The DB-backed waitpoint store survives process restart for the approval kind: at boot, the run that was parked on the wait step is resumed and re-attaches to the original pending token, so the approval stays answerable (see Durability and restart recovery). Async, non-blocking. Hitting a kind: approval gate does not hold a goroutine or fail the run. A foreground crewship routine run returns promptly with status: WAITING (exit 0) and a waitpoint token, and releases its execution slot while it waits. Approving or rejecting resumes the run from the gate (already-completed steps are restored/skipped); a rejection resolves the run to FAILED cleanly rather than stranding it. Parked approvals whose timeout_sec elapsed are reconciled at the next boot scan. Approve via UI Inbox, CLI crewship routine waitpoints approve <token>, or API.

code

A code step has a runtime. Two runtimes are wired today — expr and cel — both in-process, pure-Go, token-zero (no container, no LLM, no filesystem, no network). Use expr for a single boolean comparison (wake-gate probes); reach for cel as soon as you need real logic (booleans, arithmetic, string/list ops) — its own code comments call it out as the general-purpose deterministic primitive.

runtime: expr (wired, token-zero)

- id: probe
  type: code
  code:
    runtime: expr
    code: "{{ inputs.spend_usd }} > {{ inputs.threshold_usd }}"
expr evaluates a single comparison and emits true or false:
  • Operators: > >= < <= == !=.
  • The body is Render()-ed first, so {{ inputs.x }} / {{ steps.y.output }} placeholders substitute before evaluation.
  • Anything that isn’t a single comparison (multiple operators, function calls, arbitrary code) fails closed with a clear error — expr is deliberately not a scripting language.
This is the canonical primitive for agentless probes and schedule wake-gates (emit true only when work is needed). See Wake gates.

runtime: cel (wired, token-zero, general logic)

- id: spike
  type: code
  code:
    runtime: cel
    # Real logic in one deterministic, token-zero step — no {{ }} needed,
    # inputs are exposed as a typed `inputs` map:
    code: 'inputs.spend_usd > inputs.threshold_usd && inputs.region in ["eu", "us"]'
cel evaluates a Google CEL expression — non-Turing-complete (every expression provably terminates), so it keeps the token-zero / no-execution-surface guarantee of expr while adding boolean operators (&&, ||, !), arithmetic, string ops, list/map membership, ternaries, and field access. Reach for it when expr’s single comparison isn’t enough. A bool result emits true/false; numeric and string results emit their canonical string form. Compile/eval errors fail closed.

runtime: bash | python | go (rejected at author time)

- id: process
  type: code
  code:
    runtime: python
    code: "import json; print(json.dumps({'sum': sum(input_nums)}))"
These are schema-legal runtime names but have no sandboxed runner wired. As of PR #710 they’re no longer “saves-cleanly-then-fails-at-3am”: a routine using one is rejected at save / apply / test_run time with a message pointing at the fix (runtime: expr or cel, or convert the step to agent_run).
runtime: bash, python, and go are not executable today and are rejected when you save, apply, or test-run a routine that uses them — the error names the offending step and suggests expr/cel or an agent_run conversion. Only expr and cel are wired.

transform

{
  "id": "extract_emails",
  "type": "transform",
  "transform": {
    "input": "{{ steps.fetch.output }}",
    "expression": ".items[] | .email"
  }
}
Pure-Go data reshaping with a tiny jq-flavoured subset. No LLM, no network — fully deterministic. Useful for wiring step outputs without paying for another agent_run.

Conditional if

Any step can carry "if": "{{ inputs.run_summary }}". The placeholder is rendered as a plain string — empty / false / 0 / no / off (case-insensitive) → step skipped and marked <skipped> in StepOutputs; anything else counts as truthy. Templates are plain substitution, not expression evaluation, so put the boolean upstream (e.g. set inputs.run_summary from the caller) rather than writing == / != inside the if value.

DAG with needs[]

{ "id": "fetch_a", "type": "http", "http": {...} },
{ "id": "fetch_b", "type": "http", "http": {...} },
{ "id": "merge",
  "type": "agent_run",
  "needs": ["fetch_a", "fetch_b"],
  "prompt": "Merge: {{ steps.fetch_a.output }} + {{ steps.fetch_b.output }}"
}
Steps with no overlapping needs execute in parallel (one goroutine wave per ready set). Final output picks the unique leaf node; for multi-leaf DAGs the first leaf in source order wins.

Lifecycle hooks (before_all / after_all / on_failure)

Routine-level hooks run deterministic side-channel steps around the main execution — a clean home for setup/teardown that isn’t part of the visible step graph. Hook steps must be code / http / transform (no agent_run — a hook must not recurse or spend tokens):
hooks:
  before_all:   # setup — its FAILURE aborts the run before any step executes
    id: claim
    type: http
    http: { method: POST, url: "https://ops.example.com/claim" }
  after_all:    # runs after a COMPLETED run — best-effort
    id: release
    type: http
    http: { method: POST, url: "https://ops.example.com/release" }
  on_failure:   # runs after a FAILED run — best-effort cleanup
    id: alert
    type: http
    http: { method: POST, url: "https://ops.example.com/alert" }
after_all and on_failure are best-effort — logged, but they never change the run’s outcome. Hooks fire only on the top-level run (not nested call_pipeline expansions) and are skipped on resume re-entry and dry-run. Per-step before / after hooks also exist and are included in the capability manifest walk. Full reference: Lifecycle hooks.

Per-step overrides (no version bump)

Tweak a single step’s prompt or model without bumping the routine version — the override is applied at run start over the versioned DSL, so the durable, authored definition stays the source of truth while an operator can patch and clear a live behavior quickly:
crewship routine step-override set my-routine summarize \
  --prompt "Summarize in 3 bullets, lead with the risk." --model smart
crewship routine step-override list my-routine
crewship routine step-override clear my-routine summarize
Only non-empty fields win — a prompt-only override leaves the authored model untouched. Full reference: Per-step prompt/model override.

Agentless routines

Declare "agentless": true to get a token-zero guarantee: the routine can never invoke an LLM, no matter who edits it later. This is what makes high-frequency automation (health checks, metric probes, TLS expiry watches) free to run on a tight cron.
{
  "dsl_version": "1.0",
  "name": "cost-spike-probe",
  "agentless": true,
  "egress_targets": ["billing.example.com"],
  "steps": [
    { "id": "fetch", "type": "http",
      "http": { "method": "GET", "url": "https://billing.example.com/today" } },
    { "id": "amount", "type": "transform",
      "transform": { "input": "{{ steps.fetch.output }}", "expression": ".spend_usd" } },
    { "id": "judge", "type": "code",
      "code": { "runtime": "expr", "code": "{{ steps.amount.output }} > 100" } }
  ]
}
The probe’s verdict comes from data reshaping plus a single comparison, not scripting: the http step fetches a JSON status, transform projects the number out of it, and the expr code step emits the boolean true/false — all token-zero. (A pure http + transform projection of an already-boolean field works too; reach for expr when you need the comparison.) Enforced at two layers:
  • Save time — validation rejects agent_run (direct LLM spend), call_pipeline (the target resolves by slug at runtime, so a nested routine could gain an agent step later and silently break the guarantee), and eval.online with sample_rate > 0 (online grading runs a grader agent against the routine’s runs).
  • Run time — the executor independently refuses to dispatch an LLM-capable step inside an agentless run, so even a definition written before the validator existed fails closed.
Everything else works as usual: egress allowlist, credentials, versioning, dry-run, schedules, webhooks. Agentless routines are also the only valid probes for wake gates.

Two-tier execution

The economic value-prop: an Opus-class authoring model designs the routine, a Haiku-class executor model runs each invocation. Workspace execution_tiers_json maps complexity classes to (adapter, model):
{
  "trivial":  { "primary": {"adapter":"claude","model":"claude-haiku-4-5-20251001"} },
  "fast":     { "primary": {"adapter":"claude","model":"claude-haiku-4-5-20251001"},
                "fallback":[{"adapter":"claude","model":"claude-sonnet-4-6"}] },
  "moderate": { "primary": {"adapter":"claude","model":"claude-sonnet-4-6"} },
  "smart":    { "primary": {"adapter":"claude","model":"claude-opus-4-7"} }
}
Per-step complexity annotation drives the resolver. With on_fail: "escalate_tier", a failed validation walks the fallback chain — practically: Haiku tries first, Sonnet on validation fail, Opus on second fail.
Tier override at runtime. The CLI flag --model <model> is constructed from the resolved tier and passed to the agent’s CLI adapter, so a routine’s complexity: "fast" actually fires Haiku, not the agent’s default. CLIAdapter is preserved (so the agent’s CLAUDE_CODE / GEMINI_CLI / etc. wiring stays intact); only the model name swaps.

Save validation gate

Save endpoints (sidecar /pipelines/save, user /api/v1/workspaces/{ws}/pipelines/save, internal /api/v1/internal/pipelines/save) require the routine to clear a validation gate before it persists. The gate is a dry-run validation of the draft, not a real execution — there is no “test run” mode (you cannot run an agent dry). The sidecar agent-authoring flow forwards the draft to /api/v1/internal/pipelines/test_run, which parses, schema-validates, and dry-runs it (rendering every template, invoking no agent); on success it sets last_test_run_passed and forwards to save. This is the self-improvement loop: an authoring agent that writes brittle DSL gets a structured failure report it can read and revise from. Without the gate, MVP would ship pipelines that pass schema but fail at runtime. Real execution happens on the first live run, and risky routines are human-reviewed (governance) before they go live.
skip_test_gate: true is honored only when the caller’s role is OWNER or ADMIN; lower roles get 403. Useful for hand-crafted DSL from known-good templates (the seed flow uses it).

Governance — agent proposes, human approves the risky ones

Routines have a lifecycle status (migration v128): active (live + runnable), proposed (awaiting approval), or disabled (admin airbag). The save validation gate still applies on top — status is an additional gate.

Maker-checker on save

When a routine is saved (by an agent via the sidecar, or by a user via the UI/CLI), Crewship classifies it. A save is risky if any of these hold:
  • it declares an integrations_required the author crew can’t currently satisfy (the same resolver the run gate uses);
  • it has any http/egress step (or routine-level egress_targets);
  • it has any code-runtime step;
  • it declares credentials_required.
Otherwise it’s safe — only agent_run / transform / call_pipeline / wait steps over satisfiable integrations, no egress, no credentials.
  • Safe → active. Goes live immediately, exactly as before.
  • Risky → proposed. The routine is persisted but not runnable, and a blocking inbox item is raised for MANAGER+ (the same Inbox surface as proposed skills). Approve it to go live, or reject it.
A proposed (or disabled) routine refuses run / run_batch with 409 Conflict"routine is awaiting approval" or "routine is disabled". dry_run always previews a saved routine, so it’s never blocked.
OWNER/ADMIN escape hatch — skip_governance_gate. Symmetric with skip_test_gate: passing "skip_governance_gate": true on the user save (POST /api/v1/workspaces/{ws}/pipelines/save) forces a risky definition live as active and raises no review item. Honored only for OWNER/ADMIN (lower roles get 403); it is deliberately not available on the agent/sidecar save path (InternalSave), so agent-authored risky routines are always reviewed. This is what the crewship seed flow uses so a freshly-seeded workspace’s hand-curated starter routines are immediately runnable instead of stuck “awaiting approval”. Use it only for DSL you trust.

Approve / reject (MANAGER+)

crewship routine approve <slug>   # status → active, resolves the inbox item
crewship routine reject  <slug>   # removes the routine, resolves the inbox item
  • POST /api/v1/workspaces/{ws}/pipelines/{slug}/approveMANAGER+. Flips to active.
  • POST /api/v1/workspaces/{ws}/pipelines/{slug}/rejectMANAGER+. Soft-deletes the proposed routine.

Disable / enable (OWNER/ADMIN airbag)

crewship routine disable <slug>   # status → disabled; cancels in-flight runs
crewship routine enable  <slug>   # status → active
  • POST /api/v1/workspaces/{ws}/pipelines/{slug}/disableOWNER/ADMIN. Flips to disabled and cancels any in-flight runs of that routine immediately.
  • POST /api/v1/workspaces/{ws}/pipelines/{slug}/enableOWNER/ADMIN. Returns it to active.
List responses (and crewship routine list) carry status; filter the queue with:
crewship routine list --status proposed

Triggers

Cron schedules

crewship routine schedules create --slug summarize-text \
    --name "daily-summary" --cron "0 9 * * *" --timezone "Europe/Prague" \
    --inputs '{"text":"…default"}'
5-field cron expression. Scheduler runs in-process and ticks every 30s, so minimum resolution is 1 minute.
Single-instance only — running multiple replicas would double-fire (no leader election yet).

Wake gates

A plain cron fires the full routine — including its LLM steps — on every tick, even when there is nothing worth the model’s attention. A wake gate fixes that: the schedule references an agentless probe routine, the scheduler runs the probe first on each tick (free of LLM spend by the agentless guarantee), and the main routine fires only when the probe’s final output is truthy. Same falsy rule as step if: conditions — empty, false, 0, null, nil, no, off (case-insensitive) skip the tick; anything else wakes the routine.
# LLM only runs when the probe prints something truthy:
crewship routine schedules create --slug cost-report \
    --cron "*/15 * * * *" \
    --wake-slug cost-spike-probe --wake-inputs '{"threshold":"100"}'

# Drop the gate later (fire on every tick again):
crewship routine schedules update <schedule_id> --no-wake
This gives schedules a cost ladder: agentless schedule (always free) → wake-gated schedule (probe free, LLM only on signal) → plain schedule (today’s default, unchanged). A freshly-seeded workspace ships a working example: the Demo: feed watch — wake on change schedule runs the agentless feed-watch-probe every 15 minutes and only wakes feed-change-report (an agent routine) when the watched feed drifts from its baseline. Point the probe’s url/expected_items inputs at your own endpoint to make it real. Semantics worth knowing:
  • The probe must be agentless: true, live in the same workspace, and can’t be the schedule’s own routine — all validated when the schedule is saved.
  • Probe errors fail open: a broken or deleted probe wakes the main routine instead of going silently blind, and records last_wake_status: ERROR so you can see the probe needs fixing.
  • A skipped tick advances next_run_at but leaves last_run_* untouched — run telemetry stays strictly about main runs. Wake telemetry lives in wake_check_count / wake_fire_count / last_wake_at / last_wake_status, and routine schedules list shows it as <probe> woke/checked in the WAKE column.
  • Probe executions are regular runs with triggered_via: wake_check, so they’re auditable in run history and filterable out of dashboards.

Webhooks

crewship routine webhooks create --slug pr-review-structured \
    --hmac-secret "$(openssl rand -hex 32)" --rate-limit 30
Output reveals the public URL and signing secret once (Stripe-style). External services POST event payload to /api/v1/webhooks/{token}. With HMAC, sender includes header X-Crewship-Signature: sha256=<hex_hmac_of_body>, validated server-side via hmac.Equal (timing-safe). Rate limited per token, per minute, default 60. Delivery is asynchronous. The endpoint verifies the signature, rate limit, governance status, and the routine’s concurrency_key gate synchronously, reserves a run id, then answers immediately:
HTTP/1.1 202 Accepted
{"run_id": "run_c...", "status": "PENDING", "deduped": false}
The routine executes in the background under the returned run_id, so senders with short delivery timeouts (GitHub ~10s, Stripe ~5s) never time out on long agent runs — and a sender closing the connection early cannot cancel an in-flight run: the run’s context derives from the server lifecycle, not the HTTP request. Poll the handle for the outcome:
crewship routine logs <run_id>          # persisted run state (status, step outputs, error)
crewship routine logs <run_id> --full   # full step-by-step journal timeline
# or: GET /api/v1/workspaces/{wsId}/pipeline-runs/{runId}
Redelivered events dedupe synchronously: a replay (same Idempotency-Key / X-Crewship-Event-ID, or identical bytes within the dedupe window) answers 202 with "status": "DEDUPED" and the original run’s id — the routine executes exactly once. A proposed/disabled routine answers 409 (policy block, nothing dispatched). A delivery arriving while the routine’s concurrency_key gate is at capacity answers 429 with a Retry-After header before anything is dispatched — a 429 never consumes the idempotency key, so redelivering the same event later executes it normally. Runs that hit a wait: approval step park as WAITING and resume once the waitpoint is approved, exactly as with other triggers.
The signing secret is shown once. To rotate it: delete the webhook + create a new one. There is no in-place rotation by design.

Manual

crewship routine run <slug> --inputs '{...}' or click the Run button in the UI detail panel. Same execution path. After you click Run or Test run in the UI, a live Run activity rail appears inline in the detail panel showing the just-started run step by step (started → each step → completed/failed) — so status is visible immediately without switching to the Runs tab. Full run history stays in the Runs tab; see the Activity guide for the rail, and the toolbar Activity Bar for a workspace-wide “what’s running now” view.
The Test run button calls the public, JWT-authed test_run endpoint (POST /api/v1/workspaces/{workspaceId}/pipelines/test_run). It validates a draft — parse + Validate + the integration and resource preconditions + a dry_run pass (no agent is invoked; you can’t run an agent “dry”) — and on success mints an HMAC save_token bound to (workspace, definition hash, user). Save verifies that token, so a draft can’t be saved as “test passed” unless it actually passed test_run. The UI button and the CLI both use this endpoint. To preview a saved routine instead, use Dry run — it walks the saved definition, renders templates, and returns the declared manifest (the blast radius) without invoking anything.

Deferred dispatch: delay, ttl, debounce, priority

A triggered run can be parked instead of firing immediately — useful for “run 60s from now” scheduling or for coalescing a burst of near-duplicate triggers into one run:
# fire 60s from now, expire if not dispatched within 5 min, high priority
crewship routine run my-routine --delay 60 --ttl 300 --priority 9

# coalesce a burst: repeated triggers sharing a key collapse into one run
crewship routine run my-routine --debounce-key vendor-42 --debounce-window 30 --debounce-max 300

crewship routine pending list            # not-yet-fired deferred triggers
crewship routine pending cancel <id>     # cancel before it fires
An in-process dispatcher (5s tick) fires due rows highest-priority-first and expires rows past their ttl. Immediate runs (no --delay / --debounce-key) are unaffected. Full reference, including the underlying API fields: Deferred dispatch.

Dry-run preview

Two execution modes, distinguished by Mode in the request body and surface:
ModeSide effectsIncrements invocation_countCost
runyes (agents called) — productionyesreal
dry_runno (templates rendered, agents skipped)noestimated
There is no test_run mode: you cannot run an agent “dry” — it executes arbitrary scripts (bash, ansible, curl) whose side effects can’t be intercepted — so a real run is always run. The agent-authoring save gate validates a draft via a dry_run (structure + templates), not a real execution. Dry-run is the safe “what would this routine do?” preview, and an honest static plan — not a proof the run will succeed. It walks the DSL, renders all template substitutions against the supplied inputs, resolves each step’s execution tier (adapter + model), and reports a would_execute list with per-step estimated cost. It also returns the routine’s declared manifest — the full blast radius (integrations, egress, credentials, agents, routines, datastores, tools, has_http, has_code) — so the UI can show “would use: ansible, Postgres, discord.com, agent jordan”. (A definition that no longer parses leaves manifest null and still returns the report.) No agents are invoked; no journal entries beyond a single pipeline.dry_run audit row are written.
crewship routine dry-run email-fetch --inputs '{"since":"yesterday"}'
In the UI: click Dry run in the routine detail panel. The would_execute report renders inline above the tab bar with per-step:
  • Step ID + type
  • Resolved tier_adapter:tier_model (e.g. claude:claude-haiku-4-5)
  • Estimated cost in USD (order-of-magnitude only — the executor uses a flat token-density heuristic, not real pricing)
  • would_call_agent / would_call_pipeline target
The estimate is intentionally labelled “estimated” everywhere it surfaces — it’s a planning aid, not a quote. Real cost only lands once you switch to run.

Versioning + rollback

Every save creates a new immutable row in pipeline_versions (v79 migration). The pipelines.head_version column points at the current. Rollback creates a NEW version on top of HEAD whose definition equals the target’s:
crewship routine versions email-fetch
crewship routine rollback email-fetch --to 3
History is preserved — you can roll forward to a future version by another rollback. There is no “delete version” — if a version was bad, the trail of “v3 → v4 (rollback to v2) → v5 (fix)” is the audit story you keep.

Bundle export / import

Routines are portable across workspaces:
# Export from workspace A
crewship routine export email-fetch --include-history > email-fetch.json

# Import into workspace B
cat email-fetch.json | crewship routine import
The bundle format is crewship-pipeline-bundle/v1: routine row + (optionally) the full version chain + change_summary annotations. Author identity is rewritten on import so the importing user becomes the new author. Slug is preserved; if it conflicts in the destination workspace the existing row updates (new version), or you change the bundle’s slug before import.

HITL waitpoints

A routine that includes a wait step of kind: approval parks the run on a DB-backed waitpoint — without holding a goroutine or an execution slot. The triggering crewship routine run returns immediately with status: WAITING and the token (see the wait step). Operators decide via three paths:
# CLI
crewship routine waitpoints list
crewship routine waitpoints approve <token> --comment "LGTM"
crewship routine waitpoints reject  <token> --comment "needs revision"

# UI: /routines → routine → Wait sub-tab → click Approve / Reject
# API: POST /api/v1/workspaces/{ws}/pipelines/waitpoints/{token}/approve
When you trigger a run from the routine detail page and it parks on an approval gate, you don’t have to leave the page: the Run activity panel switches its status to amber “Waiting for approval”, pins the parked step in the timeline, and shows an inline Approve / Reject banner (with an optional comment) for that run’s waitpoint. The same pending item also surfaces in the top-bar Inbox bell and the left-sidebar Inbox badge, so an approval waiting on you is visible whether or not you’re looking at the routine. The workspace-wide Wait points tab and /inbox remain the places to act on approvals for runs you didn’t just start. The decision comment is forwarded to the parked run as the wait step’s output, so downstream steps can read approval rationale via {{ steps.<wait_step_id>.output }}.

Durability and restart recovery

Run state is persisted to the pipeline_runs table at every step boundary: when a step starts, current_step_id is stamped; when it completes, the full step-outputs map and accumulated cost are flushed. A hard kill (crash, OOM, kill -9) therefore loses at most the step that was in flight. At boot, the server scans for runs left in queued/running from the previous process lifetime and resumes them from the next unfinished step:
  • Completed steps are restored, not re-executed. Their outputs feed downstream {{ steps.X.output }} templates exactly as if the process had never died.
  • The in-flight step re-executes from scratch — at-least-once semantics. For an agent_run step this means the agent call is re-issued (and re-billed); http/code steps with external side effects may fire twice. Design steps to be idempotent where that matters.
  • Runs parked on a wait approval step re-attach to the original waitpoint token. No duplicate approval card is created; the pending inbox item stays answerable across the restart, and approving it resumes the run.
  • DAG runs (needs:) resume at wave granularity — the parallel scheduler flushes state when each wave completes, so a kill mid-wave re-executes that wave’s unfinished steps.
  • call_pipeline boundaries are NOT persisted. A kill while a nested pipeline is executing re-runs the entire nested pipeline on resume — the parent’s call_pipeline step is the unit of recovery, and the nested run’s own per-step progress is not checkpointed. Keep nested pipelines short or idempotent if a mid-flight kill matters to you.
  • The accumulated cost is restored too, so max_cost_usd keeps counting across the restart instead of resetting. Caveat: cost is flushed at step boundaries, so whatever the killed in-flight step had already spent before the kill is not in the restored total — the cap under-counts the true spend by up to one step’s worth (and the re-executed step is billed again on top).
  • A resumed run that finds its concurrency slot occupied waits and retries with capped exponential backoff (2s doubling up to 60s) instead of failing — losing the slot race to a freshly-fired scheduled run is a timing collision, not a reason to abandon hours of restored work. If the server shuts down while a run is still waiting for its slot, the row stays in-flight and the next boot resumes it again.
  • A waitpoint that timed out while the process was down resumes, observes the expired token, and fails with wait step "X" (approval) timed out — distinct from an operator clicking deny (… denied).
When persisted state is insufficient to resume safely, the run is stamped interrupted instead — never silently dropped, never wrongly resumed. Fallback triggers: the pipeline row is gone or no longer parses, the definition changed since the run started (content-hash mismatch — this catches in-place edits even when every step id survives, not just renamed/removed steps), unreadable persisted inputs/outputs, or a non-resumable mode (only live run rows resume; a dry_run preview row from a previous lifetime is never re-run). The reason lands in the run’s error_message. Graceful shutdowns are different: an in-flight run cancelled by shutdown is finalized as cancelled (a terminal state) and is not resumed at next boot. Resume targets hard kills, where no terminal write could land. Set CREWSHIP_PIPELINE_RESUME=off to disable resume and restore the older stamp-everything-interrupted behaviour — useful if a crash loop would otherwise re-burn the in-flight agent step’s tokens on every restart. Default is on.

Online eval sampler

The online sampler watches completed routine runs and grades a configurable percentage of them through the existing rubric grader so production traffic continuously feeds the drift detector — not just on-demand replays or scheduled regression suites. Per-routine DSL:
name: nightly-summary
eval:
  online:
    sample_rate: 0.05              # grade 5% of runs; 0 disables, 1.0 grades every run
    grader_agent_slug: qa-grader   # references an agent in the AUTHOR crew with a rubric prompt
What happens on a tick (default cadence: every 1 minute):
  1. Scan pipeline_runs WHERE status = 'completed' AND completed_at > watermark.
  2. For each candidate, resolve the routine DSL. If eval.online is absent or sample_rate <= 0, skip.
  3. Draw from crypto/rand. If the sample lands above sample_rate, skip.
  4. Otherwise, INSERT into eval_runs with kind = 'online', status = 'queued'. The existing grader worker picks it up and writes the result back.
FieldNotes
sample_rateFloat [0, 1]. 0 disables — useful as an incident “pause grading” toggle. 1.0 is expensive but ok for newly-launched routines while calibrating the grader. Realistic prod default: 0.05.
grader_agent_slugRequired when sample_rate > 0. Missing grader is treated as a deterministic skip (logged once, no retry storm).
Correctness guarantees:
  • Schema-layer idempotencyeval_runs has a partial UNIQUE INDEX (pipeline_run_id) WHERE kind='online'. A duplicate sampler instance or a crash-recovery watermark replay can attempt the same enqueue twice; the second collapses to a no-op rather than queueing twice and double-billing the grader.
  • (completed_at, id) tuple cursor — parallel fan-out steps that complete at the same nanosecond all get graded; a timestamp-only cursor would orphan siblings.
  • Stuck-on-error watermark — a transient per-row error (resolver outage, entropy outage, enqueue conflict) freezes the watermark at the row before the failure so the next tick retries. Deterministic skips (no eval config, sample roll missed) advance normally.
  • Trace correlation — the eval row carries routine_slug + pipeline_run_id so an operator clicking a low-scoring grade in the eval UI lands on the actual trace via pipeline_run_id -> trace.
Limitations:
  • Watermark is in-memory only. On process restart it resets to now - 1h; outages longer than that leave a gap in grading coverage. The UNIQUE index keeps reprocessing harmless.
  • Page cap is 500 rows/tick. A workspace processing > 500 routine runs/minute at sample_rate = 1.0 will accumulate backlog; the watermark only advances past successfully-handled rows, so the backlog drains across subsequent ticks rather than being lost.

Validation gates and credential leak guards

Each step’s validation block runs after the step output materializes. The schema is a JSON Schema draft 2020-12 subset (most of type, required, properties, items, pattern, format, enum, etc.) plus three Crewship extensions:
  • must_not_contain: ["API_KEY=", "Bearer "] — output must include none of these substrings
  • must_contain: ["##"] — output must include all of these
  • min_length / max_length — convenience for non-JSON outputs
The must_not_contain gate is the credential-leak tripwire: if an agent is about to leak a real API key in its output, the gate fails the step before downstream consumers see it. Pair with on_fail: "abort" (set at the step level, alongside validation) for hard stop, or escalate_tier if you want the higher model to retry without the leak.

Observability

Every routine run emits a sequence of journal entries:
  • pipeline.run.started — once at run begin
  • pipeline.step.started — per step
  • pipeline.step.completed / pipeline.step.failed / pipeline.step.validation_failed — per step terminal
  • pipeline.run.completed / pipeline.run.failed — once at run end
Plus WebSocket broadcast on the workspace channel for live UI updates (PipelineRunNode in the orchestration graph + Runs sub-tab waterfall both subscribe).

Live visibility — what is a routine doing right now?

While at least one routine run is in flight, the dashboard surfaces it from anywhere in the app:
  • Header chip — a pulsing “N routines running” pill appears in the toolbar next to the Online / Crews pills (hidden when nothing is active). If any run is parked on a human approval it turns amber and appends ”· M awaiting approval”. Clicking opens a popover with the six newest active runs — routine name, short run id, elapsed time, cost so far, current step — each with Open trace ↗ (deep-link to /activity?run=<id>), Cancel (same manage-tier RBAC as the Runs tab), and a Review → shortcut into the routine for parked runs. With more than six active runs a “View all N running →” footer jumps to the Activity rail pre-filtered to the active bucket (/activity?status=active).
  • /routines sidebar — a routine with an active run gets a pulsing blue dot and a sub-line showing the current step and elapsed time (▶ ask-casey · 0:12); a parked run shows the amber ⏸ awaiting approval variant.
  • /routines list table — the status cell swaps the historical “last run” pill for a live Running (or amber Awaiting approval) pill with current step · elapsed · cost, and live routines bubble to the top of the table regardless of the chosen column sort.
All three surfaces share one workspace-scoped subscription (GET /api/v1/workspaces/{ws}/pipeline-runs?status=active + the pipeline.run.*/pipeline.step.started events, 3s poll while anything is active). status=active bundles running, queued, paused and waiting — the status a run parks in while a HITL waitpoint awaits a decision.

Run warnings

before_all/after_all/on_failure lifecycle hooks run best-effort: a failing after_all or on_failure hook (a teardown step like credential-release or cost-meter-close) never flips the run’s terminal status — a before_all failure is different and fails the run outright, since nothing downstream ran. A failed after_all/on_failure hook is instead recorded as a structured warning on the run so it doesn’t silently vanish into server logs while the run reports completed. Fetch it via GET /api/v1/workspaces/{ws}/pipeline-runs/{runId} — the response’s warnings array (always present, empty when there are none) has one entry per failed hook:
{ "stage": "hook after_all", "message": "credential release step timed out", "at": "2026-07-02T12:01:05Z" }
crewship routine logs <run_id> (the slug-free state lookup) prints a Warnings: section when the run has any, alongside the existing Error: line for the run’s own terminal status.

Per-step cost + duration

Both the UI Runs sub-tab waterfall and crewship routine logs <run_id> --slug X surface the cost_usd and duration_ms fields the executor stamps on every pipeline.step.completed event. Same data, two presentation surfaces:
  • UI: right-aligned columns next to each step row, with a footer total summing the run. An em-dash () renders for events that don’t carry cost (pipeline.step.started, .failed, live-only echoes) — easier to scan than $0.0000 next to real values.
  • CLI logs: extra DURATION and COST columns in the timeline output. Same em-dash rule for non-positive values.
Worker tier escalation is visible too — a failed validation gate that retries on a higher tier emits a second step.started+step.completed pair with its own cost on the retry, so the column-summed footer reflects the full spend (including retries), not just the first attempt.

Run tags & metadata

Tag and annotate a run at invoke time to make it filterable later:
crewship routine run cost-spike-probe \
  --tag prod --tag billing \
  --metadata '{"source":"manual","ticket":"OPS-42"}'
Tags are workspace-scoped labels (max 10/run, lowercased) surfaced on the run detail and usable as a filter (crewship routine runs <slug> --tag prod); replays inherit the source run’s tags. Metadata is a free-form JSON object stored on the run and returned by GET /pipeline-runs/{id} — set at invoke time today (mid-run mutation and {{ run.metadata.X }} templating are a follow-up).

Replay & error fingerprinting

Failed runs are bucketed by a stable error fingerprint (failing step + normalized message), so recurring failures group together instead of scrolling past one-by-one in run history:
crewship routine replay <run_id>                   # re-invoke with the run's captured inputs
crewship routine errors                             # list fingerprint groups
crewship routine bulk-replay --fingerprint <fp>      # replay every run in a fingerprint group after a fix
A replay is stamped is_replay=true + replay_of=<run_id>; gate a step on {{ env.is_replay }} to skip a side effect (e.g. a notification) on replay. Full reference: Run observability: tags, metadata, replay, errors. Three terminal-side observability surfaces, ordered by when you’d reach for each:
PhaseCommandWhy
Before runcrewship routine doctor <slug>Preflight ✓/⚠/✗ checklist — catches missing crew provisioning, agent slug typos, missing credentials, contradictory validation gates, tight cost caps. Cheap to run, returns in milliseconds, fails CI builds on FAIL.
During runcrewship routine watch <slug>Polls the runs endpoint and prints events as an ANSI-coloured timeline. --json for JSON Lines piping into jq; --once exits after the first run terminates (CI-friendly).
After runcrewship routine logs <run_id> --slug XOne-shot post-mortem dump — every step’s prompt, output, validation verdict, cost. Use to investigate a specific failure surfaced by watch or by the UI.
For variance characterisation across many runs (is this routine production-ready at this tier?), use crewship routine bench. For matrix-level cross-tier consistency (which scenarios diverge between Haiku and Opus?), use crewship eval scenarios and crewship eval baseline diff for CI regression gates.

RBAC

RoleReadRunSaveSchedules / Webhooksskip_test_gate
VIEWER
MEMBER
MANAGER
ADMIN
OWNER

Common patterns

”Run this routine every day at 9 AM"

crewship routine schedules create --slug daily-status-digest \
    --cron "0 9 * * *" --timezone "Europe/Prague"

"Trigger this routine from a GitHub Actions hook"

crewship routine webhooks create --slug pr-review-structured \
    --hmac-secret "$(openssl rand -hex 32)" --rate-limit 30
# Use the printed token + secret as Settings → Webhooks in the GH repo.

"Validate routine DSL in CI before committing”

- run: crewship routine validate routines/*.json
Exit code 1 on any invalid file.

”Discover what an agent is about to do, without running it”

crewship routine dry-run summarize-text --inputs '{"text":"..."}'
Returns a would_execute report with which agent, which tier, the rendered prompt, and an estimated cost. Zero side effects.

”Watch a metric on a tight cron, spend tokens only on a spike”

Author an agentless probe that prints true/false, then attach it as a wake gate:
crewship routine schedules create --slug incident-summary \
    --cron "*/5 * * * *" --wake-slug error-rate-probe
The probe runs every 5 minutes for free; the LLM routine fires only on the ticks where the probe’s output is truthy. routine schedules list shows how often the gate woke vs. checked.

”Cancel an in-flight run”

crewship routine cancel <run_id>
Signals the run goroutine; it stops at the next safe point and emits pipeline.run.failed with reason “cancelled”.

”Serialise runs per tenant / customer / repo”

Use concurrency_key with a template referencing the tenant-identifying input:
{
  "concurrency_key": "{{ inputs.account_id }}",
  "max_concurrent": 1,
  "inputs": [
    { "name": "account_id", "type": "string", "required": true }
  ]
}
Two requests for the same account_id queue rather than fan out; requests for different account_ids run in parallel. Pair with the Idempotency-Key header for webhook handlers (see the Concurrency + idempotency recipe). The platform fails fast if the rendered key is empty (missing input + no literal prefix in the template) — see Troubleshooting for why and how to fix.

Troubleshooting

Before going through this list: run crewship routine doctor <slug> first. Most “blind alley” failures (missing crew provisioning, agent slug typo, missing credential, contradictory validation gate, cost cap too tight) surface as a ✗ check on doctor before they ever cost an LLM call.
The save endpoint requires the validation gate cleared. The server validates the DSL on save (parse + schema + cycle detection); to clear the residual gate, send "last_test_run_at" (RFC3339, within the last 5 minutes) + "last_test_run_passed": true in the /save body. This is the body-trust path crewship routine save uses, and it mirrors the sidecar agent-authoring flow which sets the same fields after the internal dry-run validation. There is no public test_run endpoint to mint a token from — a real run can’t be done “dry”, so a real run is reserved for the first live crewship routine run.Alternative: crewship apply --skip-test-gate (CLI — the flag lives on apply, not routine save) / "skip_test_gate": true (API) if your role is OWNER/ADMIN and you trust the DSL — bypasses the gate explicitly.
Check that ?include_steps=1 is in the runs URL the UI fetches. The list endpoint defaults to run-level only to keep payload bounded; the detail panel passes the flag explicitly. After a refresh, the waterfall populates from journal entries plus live WebSocket events.
The scheduler ticks every 30s; with single-binary deployment, restarting the server resets the in-memory tick cursor. Pending schedules whose next_run_at has passed will fire on the next tick after restart. If you’ve edited the cron expression, next_run_at recomputes from now() — so a 0 9 * * * schedule edited at 10:30 won’t fire until 9 AM tomorrow.
HMAC mismatch: check the X-Crewship-Signature: sha256=<hex> header, computed as HMAC-SHA256(signing_secret, request_body) over the raw bytes the sender sent (not a re-serialized form). The server uses hmac.Equal for comparison so timing-safe.
Two-tier escalation walks the fallback chain. With on_fail: escalate_tier, a failed Haiku run retries on Sonnet (5-10× more expensive), then Opus (20-50× more expensive) before giving up. Tighten validation gates (loosen must_contain, raise min_length), or set max_cost_usd on the routine to abort the run between steps when a cost ceiling is hit.
Since resume-from-step landed, a restart between the wait step starting and a decision arriving re-attaches the run to its pending waitpoint at boot — approving via crewship routine waitpoints approve <token> resumes it. If the run shows interrupted instead, its persisted state was insufficient to resume (the reason is in the run’s error_message); the orphaned waitpoint can still be listed and rejected to clear the inbox. The boot log lines pipeline boot recovery done (resume-from-step) resumed=N interrupted=M and pipeline waitpoint store wired (...) stranded_pending=N show what recovery did.
The DSL declared a non-empty concurrency_key template but the rendered value is an empty string — typically because a referenced input was omitted at trigger time. Full error message:
pipeline: concurrency_key rendered to empty value (referenced input missing or empty): template "{{ inputs.<NAME> }}"
Why fail-fast: a routine that declares concurrency_key: "{{ inputs.account_id }}" is asking the platform to serialise runs per tenant. If account_id is missing, treating the empty key as “no gate” would silently allow unlimited parallelism for a routine the author explicitly asked us to serialise — a denial-of-self by misconfiguration. The executor refuses to start the run.Fixes, in order of preference:
  1. Supply the input. The caller (curl / crewship routine run --inputs '{...}' / scheduler / webhook) needs to pass the referenced input. For webhooks this means the inputs_template in the webhook config must produce it from the incoming payload.
  2. Set a default on the InputSpec. If the input is genuinely optional but you still want a tenant-style gate, add "default": "global" (or any non-empty sentinel) to the InputSpec. The platform merges defaults before rendering the key.
  3. Use a literal prefix. A template like "vendor-alert-{{ inputs.vendor_id }}" always renders non-empty (the literal vendor-alert- survives even when vendor_id is missing); the key still gates, just less granularly.
  4. Drop the gate. If you genuinely don’t want concurrency control, omit concurrency_key entirely (do NOT set it to "" — that’s the unset sentinel).

CLI reference

CommandPurpose
routine listWorkspace routine catalog
routine get <slug>Detail with full DSL
routine saveSave from JSON file
routine run <slug>Invoke against execution tier (--tier-override fast/smart/...). Deferred dispatch via --delay/--ttl/--debounce-key/--priority; exactly-once retries via --idempotency-key <key> (+ --idempotency-ttl <s>) — a duplicate key inside the window returns the original run as DEDUPED instead of executing again (same contract as the webhook Idempotency-Key header)
routine dry-run <slug>Preview without invoking
routine delete <slug>Soft-delete
routine doctor <slug>Preflight checklist (✓/⚠/✗) — catches blind alleys before run
routine bench <slug>N-runs variance characterisation — pass-rate + cost/latency stats
routine runs <slug>Run history from journal
routine records <slug>Run history from pipeline_runs projection (filterable by status)
routine logs <run_id>Full journal trace for one run (post-mortem)
routine watch <slug>Live event stream
routine cancel <run_id>Signal in-flight run
routine replay <run_id>Re-invoke a run with its captured inputs
routine errorsList failed-run error fingerprint groups
routine bulk-replay --fingerprint <fp>Replay every run in a fingerprint group
routine step-override set/list/clearLive prompt/model override for one step, no version bump
routine pending list/cancel <id>Inspect / cancel deferred (--delay / --debounce-key) triggers
routine versions <slug>Version history
routine rollback <slug> --to NRoll back to v N
routine export <slug>JSON bundle to stdout
routine import [bundle.json]Load from file or stdin
routine validate [file.json]Offline DSL check
routine schedules ...Cron CRUD
routine webhooks ...Webhook CRUD
routine waitpoints ...HITL inbox + decisions
For batch evaluation across the eval-* fleet (matrix sweep, head-to-head tier compare, baseline regression diff), see the eval CLI. Cookbook recipe 6 walks the eval-driven promotion workflow end-to-end. The pipeline alias is preserved for back-compat on the legacy subcommands (pipeline list, pipeline run, pipeline get, etc.). The post-rename additions — schedules, webhooks, waitpoints, validate, watch, logs, records, bench, doctor — are only registered under routine, so scripts that need them must switch.

Backend reference

  • Migrations v78 (pipelines + execution_tiers_json), v79 (versions + waitpoints), v80 (schedules), v81 (run idempotency), v82 (webhooks), v115 (schedule wake gates)
  • Source: internal/pipeline/ (~10 700 LOC, 36 files)
  • API: internal/api/pipelines.go, pipeline_runs.go, pipeline_schedules.go, pipeline_webhooks.go
  • Sidecar: internal/sidecar/pipelines.go (port 9119)
  • Frontend: app/(dashboard)/routines/, components/features/routines/, hooks use-pipelines*

Production notes

Two caveats worth internalizing before you lean on routines for anything time- or cost-sensitive — both are permanent architectural properties of the current single-binary deployment, not bugs waiting on a fix:
Single-instance only. The scheduler, run registry, and online eval watermark all assume one process. Running multiple replicas would double-fire cron schedules and webhooks (no leader election yet) — see Cron schedules.
Crash recovery is at-least-once, step-granular, not exact. A hard kill loses at most the in-flight step, which re-executes (and re-bills) on resume. call_pipeline has no nested checkpointing — a kill mid-nested-run re-executes the entire nested pipeline. max_cost_usd under-counts after a crash, since whatever the killed step had already spent isn’t in the restored total. See Durability and restart recovery for the full resume matrix.

Limitations (current MVP)

These are known gaps in the current MVP — none block production use, but they shape what you can rely on today.
  • Resume is at-least-once, step-granular — restart recovery re-enters runs from the last persisted step boundary (see Durability and restart recovery). The step that was in flight at the kill re-executes from scratch; there is no sub-step checkpointing, nested call_pipeline runs re-execute in full, and DAG runs recover at wave granularity. Runs whose definition changed since the run started (content-hash mismatch) fall back to interrupted. max_cost_usd under-counts true spend after a crash — see Production notes.
  • Single-instance scheduler — running multiple replicas would double-fire schedules.
  • credentials_required is declared-only — unlike integrations_required and resources, it is not yet enforced at run time (see Required integrations).
  • No cross-adapter tier swap yet — same-provider model swap (Haiku→Opus) works; Claude→Gemini swap requires a shorthand→constant mapping not yet wired.
  • No NL→cron converter — ops still hand-type 0 9 * * *. Foundation PRD has the design; not in MVP.