What is a Routine?
A Routine is a declarative recipe for repeatable AI work. You author it once (or an agent authors it for you when it spots a repetitive pattern), it runs the same way every time, and it lives in your workspace as a reusable asset alongside crews, skills, and credentials. Each routine bundles:- A JSON DSL definition — inputs, outputs, ordered or DAG-structured steps, validation gates, declared egress, credential requirements
- Authorship metadata — which crew, which agent (or user), via which path
- Triggers — cron schedules and webhook tokens that fire it autonomously
- Run history — every invocation as an immutable journal trace with step-level events
- Versions — every save creates a new version; rollback is one API call
| Layer | Scope | Authoring | Use case |
|---|---|---|---|
| Routine | atomic | AI-authored OR human DSL | One repeatable AI workflow |
| Recipe (future) | composite | Marketplace template | Crew + agents + integrations + routines bundled |
| Cyclic Issue (future) | issue-tracker | recurring user issue | ”Standup every Monday” without leaving Issues |
pipelines for backwards compatibility.
Naval theme. Routines are the boring, accurate, repeated procedures a ship’s crew runs every day — the equivalent of naval drills in the Crewship metaphor. Programs your AI agents follow.
Three authoring paths
Crewship is unusual among workflow systems because agents author routines, not just execute them. Three paths converge on the samepipelines table:
Agent
The most common path. An agent that’s solved a repetitive problem twice posts to
http://localhost:9119/pipelines/save from inside its container; the sidecar injects authorship and forwards to the main API. Next time [AVAILABLE ROUTINES] block in the system prompt advertises it to other crews.You can also just describe a routine in chat — “make a routine that summarizes yesterday’s commits and posts to Slack.” A crew Lead carries the bundled routine-author skill (an authoring playbook): it clarifies the essentials, grounds the DSL in the crew’s [CONNECTED INTEGRATIONS] and [AVAILABLE ROUTINES], writes and test-runs the routine, then tells you whether it went live or landed as a proposed routine for a Manager to approve (see Governance).UI
Open
/routines, click + New routine, pick a starter template, edit the DSL JSON, click Test & Save. Test_run runs against the execution tier; on pass the routine is persisted with authored_via=user_api and the JWT user as author. An existing routine’s DSL is also fully editable in-place from its detail page’s Editor / JSON tab (CodeMirror, format + revert + copy) — see the note below on how that path’s Save differs from Test & Save.CLI
crewship routine save --name "..." --definition file.json --author-crew <crew_id>. The server validates the DSL on save (same gate as the UI). CI-friendly: validate offline first with crewship routine validate file.json, then save.Editing an existing routine’s DSL bypasses the test-run gate. The Editor / JSON tab’s Save button posts straight to
/pipelines/save with skip_test_gate: true — it does not send the definition through test_run first. The server only honors that flag for OWNER/ADMIN (lower roles get 403), so this is a fast lane for trusted operators, not a hole in the gate: a MEMBER/MANAGER can view and copy the DSL from this tab but can’t save an edit without going through the create flow’s Test & Save. A follow-up will chain test_run → save_token → save behind one button so any MANAGER+ role can edit safely.DSL anatomy
Minimal valid routine:Top-level fields
| Field | Type | Notes |
|---|---|---|
dsl_version | string | Always "1.0" for now. Forward-compat field. |
name | string | Slug-friendly identifier. Workspace-unique. |
display_name | string | Optional pretty label. |
description | string | One-line summary; shown in lists + [AVAILABLE ROUTINES]. |
inputs[] | array of InputSpec | Declared parameters. |
outputs[] | array of OutputSpec | Read from the final step’s output by name. Documentary in MVP. |
steps[] | array of Step | Sequential by default; DAG with needs:. |
execution_tier | object | {preferred, fallback} overriding workspace default. |
estimated_cost_usd | number | Author estimate; UI surfaces overrun warnings. |
egress_targets | array of string | Declared outbound hosts. Enforced at run time for every http step and http hook, including redirect hops; empty/omitted = unrestricted at the routine layer (the crew network policy and SSRF guards still apply). See http. |
integrations_required | array of string | Declared third-party integrations (Composio connector slugs like "github", "slack"). Declared AND enforced at run time — see Required integrations below. Slugs are lowercased/trimmed. |
credentials_required | array of CredReq | Declared credential needs (type + scope). Declared-only today (not yet enforced at run time). |
resources | object | Agent-declared capability manifest — datastores[] and tools[] the routine touches that can’t be inferred from the step graph. Declared AND enforced at run time against the crew’s container resources — see Required resources below. See Capability manifest below for the shape. |
max_cost_usd | number | Runtime cost cap; run aborts if exceeded between steps. |
concurrency_key | string | Template that gates how many runs can be in flight at once for the same workspace + rendered key. Typical pattern "{{ inputs.account_id }}" to serialise per-tenant runs. Empty (default) = no gate. See Concurrency + idempotency recipe and the fail-fast note in Troubleshooting. |
max_concurrent | integer | Cap on simultaneous runs sharing the resolved concurrency_key. Defaults to 1 when concurrency_key is set (strict per-key serialisation); ignored otherwise. |
guardrails | object | Per-routine Lookout policy. guardrails.input.prompt_injection.action: block (default) | sanitize | log. |
eval | object | Continuous online grading. See Online eval below. |
agentless | boolean | Token-zero guarantee: the routine may only contain http / code / wait / transform steps and can never invoke an LLM. See Agentless routines. |
InputSpec
type is one of string | integer | number | boolean | array | object. min / max are *float64 so decimal constraints work for number types.
Template substitution
Anywhere a string is interpolated (prompt, http URL/body/headers, wait until, code, transform, conditionalif), placeholder {{ ... }} resolves against:
inputs.X— declared input valuesteps.Y.output— full text output of an earlier stepsteps.Y.output.path— JSON path into a step’s output (when output parses as JSON)env.AUTHOR_CREW_NAME/ etc. — read-only allowlist of execution context
Save-time validation walks every template-bearing field and rejects placeholders that reference unknown inputs or unseen-yet steps — so a typo’d
{{ inputs.tyop }} fails at save, not at 3am in production.Required integrations
A routine can DECLARE the third-party integrations (Composio connectors) it needs with the top-levelintegrations_required array, and the run path will
block a run when the executing crew hasn’t connected one of them. This
closes the “integration forgotten” gap — the same shape as egress_targets,
but for connectors instead of hosts.
- Declared is always allowed. Saving a routine that names an integration the crew lacks is fine — declaring is a contract, not a connection. Only the run enforces. (Save-time validation only checks the list is well-formed: non-empty slugs, lowercased/trimmed, within a sane count cap.)
- Enforced at run time. Before a run starts, Crewship resolves the
integrations the routine’s author crew has connected and compares them to
integrations_required. If any are missing the run is blocked with an RFC 7807 Problem Details response, HTTP 422, carrying a machine-readablemissing_integrations: string[]member and a humandetaillikeroutine requires integration "slack" not connected for crew "Marketing". The UI usesmissing_integrationsto render a Connect action. The run never starts — no tokens spent. - No-op fast path. An empty / absent
integrations_requireddoes zero resolution work — no overhead for routines that don’t use it. - Fail-open. If integration-availability resolution itself errors, the run is allowed (a warning is logged). A bug in resolution must never wedge every run of every routine — a forgotten integration is a soft, recoverable failure; a hard block on all runs would be a self-inflicted outage.
runis gated;dry_runis not. A liverunexecutes against the author crew’s agents, so it’s gated (fail fast rather than land an unrunnable routine). The internal save gate’s draft validation applies the same integration check.dry_runis a preview that invokes nothing, so it’s left ungated — it shows what the routine would need, even integrations the crew hasn’t connected yet.
Integration availability is resolved from the crew’s connected Composio
connectors (the MCP server rows the bind flow writes). Two limitations:
under the workspace default connector (every agent inherits all connected
apps), the gate treats integrations as available without enumerating them; and
resolution reflects what’s wired, not live connection health (a revoked
account still reads as available until its binding is removed).
credentials_required remains declared-only for now — it is not yet
enforced at run time the way integrations_required is.Capability manifest
Every routine has a derived capability manifest — the full “what this routine touches” blast radius. The detail API (GET .../routines/{id}) returns
it under a manifest member so the UI can render a data-flow diagram and
governance can reason about the whole footprint of a run, not just its visible
steps.
Most of the manifest is auto-derived from the DSL — you don’t declare it:
| Manifest field | Derived from |
|---|---|
integrations | integrations_required (normalized) |
egress | egress_targets plus any host parseable from http step URLs (templated {{ }} URLs are skipped) |
credentials | credentials_required |
agents | agent_slug of every agent_run step |
routines | pipeline_slug of every call_pipeline step |
tools | code step runtimes plus declared resources.tools |
datastores | declared resources.datastores |
has_http / has_code | whether any http / code step exists |
before_all / after_all / on_failure) and
per-step (before / after) lifecycle hooks too — a capability hidden in an
on_failure hook is still part of the blast radius. Every list is deduped +
sorted and always rendered as [] (never null), so the diagram is stable.
The resources block
Two things can’t be inferred from the step graph: datastores a routine
reads/writes, and CLI tools/scripts it runs. In production, code steps
aren’t wired — agents run scripts (ansible, kubectl, …) via an agent_run
step that shells out, so static analysis can’t see them. Declare them so they
show up in the manifest:
datastores[].type— any string, but use the canonical vocabularyredis | postgres | mysql | mongodb | otherso it matches what the crew’s container catalog reports (the precondition gate comparestypecase-insensitively; a non-canonical type simply won’t match a crew resource).tools[].type— any string; canonical examplesansible | terraform | kubectl | bash | python | other. Same matching rule as datastores.name/noteare free-form and optional.
type, or more than 32 of either kind.
The run path enforces availability — see Required resources.
Required resources
Theresources block is a run-time precondition gate, the resource sibling
of Required integrations: it states what the routine
requires (datastores like Postgres/Redis, CLI tools like ansible), and the
run path checks it against what the executing crew’s container actually has
([CONTAINER RESOURCES] — the crew’s sidecar datastores + installed tools).
Semantics:
- Enforced at run time. Before a run starts, Crewship resolves the author
crew’s container resources and compares them to the declared
resources.datastores+resources.tools. If the routine requires a datastore or tool the crew doesn’t have, the run is blocked with an RFC 7807 Problem Details response, HTTP 422, carrying a machine-readablemissing_resources: [{ kind, type, name }]member (kindis"datastore"or"tool") and a humandetaillikeroutine needs datastore postgres, tool ansible, not available to crew "Ops". The run never starts — no tokens spent — until the resource is provisioned on the crew. - Matching. A required datastore is satisfied when the crew has a
datastore of the same engine
type(case-insensitive; the servicenameis advisory and not matched). A required tool is satisfied when the crew has an installed tool whose name matches the requiredtype(the tool’sname, e.g.deploy.yml, is the concrete artifact and isn’t matched against the container). - No-op fast path. A routine that declares no
resourcesblock does zero resolution work. Only the declared resources are gated —code-step runtimes folded into the manifest’stools(e.g.cel) are internal executor runtimes, not crew CLIs, and are never gated. - Fail-open. If the routine has no author crew, or resource resolution itself errors, the run is allowed (a warning is logged) — same reasoning as the integration gate: a resolver bug must never wedge every run.
runis gated;dry_runis not. (The internal save gate’s draft validation applies the same resource check.)
Agents already know what the container has
You don’t have to tell an agent that its crew runs Postgres or shipskubectl —
Crewship surfaces it automatically. Every agent’s system prompt now includes a
[CONTAINER RESOURCES] block listing the crew’s datastores (derived from the
crew’s sidecar services — a service named postgres is reachable at host
postgres on its declared port) and installed CLI tools (derived from the
crew’s devcontainer features + mise toolchain, e.g. ansible, kubectl, git,
python, node). The agent is instructed to use these directly instead of
probing or trying to install them. So when you author a routine that needs a
datastore, declare it under resources.datastores and connect via the host/port
the agent already sees in that block. The block is omitted entirely when a crew
has no services and no notable tools.
Step types
agent_run
complexity resolves through the workspace’s execution_tiers_json mapping into (adapter, model). With complexity: "fast", the agent’s CLI gets --model claude-haiku-4-5-20251001 (or whatever the workspace mapped fast to). model_override is the explicit pin that wins over complexity.
on_fail lives at the step level (not inside validation) and is one of escalate_tier | abort | retry_step — escalate_tier walks the fallback chain (e.g., Haiku → Sonnet → Opus) until validation passes or the chain exhausts.
call_pipeline
http
http step (and http hook) passes two host
gates at run time, checked pre-flight AND on every redirect hop
(CheckRedirect callback):
- Routine layer —
egress_targets. When the routine declaresegress_targets, the request host must be one of the declared hosts or a subdomain of one (api.x.commatches targetx.com;evilx.comdoes not). A routine that declares noegress_targetsis unrestricted at this layer — backward-compatible with every routine that predates the field. The SSRF guard (private/link-local IPs, DNS-rebind-safe dialing) applies regardless. - Crew layer — network policy. The authoring crew’s network policy
(
network_mode+ allowed domains — the same dial that governs the crew’s agent containers) also applies to directhttpsteps. Crews on the defaultfreemode are unaffected; arestrictedcrew’s routines can only reach the crew’s allowed domains (exact host match, same as the container proxy).
credential_ref.type is resolved at run time
against the workspace credential vault by type (case-insensitive match
on the vault type, e.g. API_KEY, GENERIC_SECRET), never by ID — so a
shared routine runs against any workspace holding a credential of the right
type. Only ACTIVE credentials resolve; credentials pinned to another crew
are invisible; when several match, the authoring crew’s own credential wins
over workspace-shared ones and the newest wins within each group (rotation).
The decrypted value goes into the outbound request only — never into logs,
the journal, or step output. If nothing matches, the request is sent without
credentials (public endpoints keep working). Injection schemes: bearer
(default), header with explicit name, query with explicit name.
wait
approval (HITL token), datetime (sleep until ISO timestamp), event (filter on journal events). The DB-backed waitpoint store survives process restart for the approval kind: at boot, the run that was parked on the wait step is resumed and re-attaches to the original pending token, so the approval stays answerable (see Durability and restart recovery).
Async, non-blocking. Hitting a kind: approval gate does not hold a goroutine or fail the run. A foreground crewship routine run returns promptly with status: WAITING (exit 0) and a waitpoint token, and releases its execution slot while it waits. Approving or rejecting resumes the run from the gate (already-completed steps are restored/skipped); a rejection resolves the run to FAILED cleanly rather than stranding it. Parked approvals whose timeout_sec elapsed are reconciled at the next boot scan.
Approve via UI Inbox, CLI crewship routine waitpoints approve <token>, or API.
code
A code step has a runtime. Two runtimes are wired today — expr and cel — both in-process, pure-Go, token-zero (no container, no LLM, no filesystem, no network). Use expr for a single boolean comparison (wake-gate probes); reach for cel as soon as you need real logic (booleans, arithmetic, string/list ops) — its own code comments call it out as the general-purpose deterministic primitive.
runtime: expr (wired, token-zero)
expr evaluates a single comparison and emits true or false:
- Operators:
>>=<<===!=. - The body is
Render()-ed first, so{{ inputs.x }}/{{ steps.y.output }}placeholders substitute before evaluation. - Anything that isn’t a single comparison (multiple operators, function calls, arbitrary code) fails closed with a clear error —
expris deliberately not a scripting language.
true only when work is needed). See Wake gates.
runtime: cel (wired, token-zero, general logic)
cel evaluates a Google CEL expression — non-Turing-complete (every expression provably terminates), so it keeps the token-zero / no-execution-surface guarantee of expr while adding boolean operators (&&, ||, !), arithmetic, string ops, list/map membership, ternaries, and field access. Reach for it when expr’s single comparison isn’t enough. A bool result emits true/false; numeric and string results emit their canonical string form. Compile/eval errors fail closed.
runtime: bash | python | go (rejected at author time)
runtime: expr or cel, or convert the step to agent_run).
transform
Conditional if
Any step can carry "if": "{{ inputs.run_summary }}". The placeholder is rendered as a plain string — empty / false / 0 / no / off (case-insensitive) → step skipped and marked <skipped> in StepOutputs; anything else counts as truthy. Templates are plain substitution, not expression evaluation, so put the boolean upstream (e.g. set inputs.run_summary from the caller) rather than writing == / != inside the if value.
DAG with needs[]
needs execute in parallel (one goroutine wave per ready set). Final output picks the unique leaf node; for multi-leaf DAGs the first leaf in source order wins.
Lifecycle hooks (before_all / after_all / on_failure)
Routine-level hooks run deterministic side-channel steps around the main execution — a clean home for setup/teardown that isn’t part of the visible step graph. Hook steps must be code / http / transform (no agent_run — a hook must not recurse or spend tokens):
after_all and on_failure are best-effort — logged, but they never change the run’s outcome. Hooks fire only on the top-level run (not nested call_pipeline expansions) and are skipped on resume re-entry and dry-run. Per-step before / after hooks also exist and are included in the capability manifest walk. Full reference: Lifecycle hooks.
Per-step overrides (no version bump)
Tweak a single step’sprompt or model without bumping the routine version — the override is applied at run start over the versioned DSL, so the durable, authored definition stays the source of truth while an operator can patch and clear a live behavior quickly:
Agentless routines
Declare"agentless": true to get a token-zero guarantee: the routine can never invoke an LLM, no matter who edits it later. This is what makes high-frequency automation (health checks, metric probes, TLS expiry watches) free to run on a tight cron.
http step fetches a JSON status, transform projects the number out of it, and the expr code step emits the boolean true/false — all token-zero. (A pure http + transform projection of an already-boolean field works too; reach for expr when you need the comparison.)
Enforced at two layers:
- Save time — validation rejects
agent_run(direct LLM spend),call_pipeline(the target resolves by slug at runtime, so a nested routine could gain an agent step later and silently break the guarantee), andeval.onlinewithsample_rate > 0(online grading runs a grader agent against the routine’s runs). - Run time — the executor independently refuses to dispatch an LLM-capable step inside an agentless run, so even a definition written before the validator existed fails closed.
Two-tier execution
The economic value-prop: an Opus-class authoring model designs the routine, a Haiku-class executor model runs each invocation. Workspaceexecution_tiers_json maps complexity classes to (adapter, model):
complexity annotation drives the resolver. With on_fail: "escalate_tier", a failed validation walks the fallback chain — practically: Haiku tries first, Sonnet on validation fail, Opus on second fail.
Tier override at runtime. The CLI flag
--model <model> is constructed from the resolved tier and passed to the agent’s CLI adapter, so a routine’s complexity: "fast" actually fires Haiku, not the agent’s default. CLIAdapter is preserved (so the agent’s CLAUDE_CODE / GEMINI_CLI / etc. wiring stays intact); only the model name swaps.Save validation gate
Save endpoints (sidecar/pipelines/save, user /api/v1/workspaces/{ws}/pipelines/save, internal /api/v1/internal/pipelines/save) require the routine to clear a validation gate before it persists. The gate is a dry-run validation of the draft, not a real execution — there is no “test run” mode (you cannot run an agent dry). The sidecar agent-authoring flow forwards the draft to /api/v1/internal/pipelines/test_run, which parses, schema-validates, and dry-runs it (rendering every template, invoking no agent); on success it sets last_test_run_passed and forwards to save.
This is the self-improvement loop: an authoring agent that writes brittle DSL gets a structured failure report it can read and revise from. Without the gate, MVP would ship pipelines that pass schema but fail at runtime. Real execution happens on the first live run, and risky routines are human-reviewed (governance) before they go live.
Governance — agent proposes, human approves the risky ones
Routines have a lifecyclestatus (migration v128): active (live + runnable), proposed (awaiting approval), or disabled (admin airbag). The save validation gate still applies on top — status is an additional gate.
Maker-checker on save
When a routine is saved (by an agent via the sidecar, or by a user via the UI/CLI), Crewship classifies it. A save is risky if any of these hold:- it declares an
integrations_requiredthe author crew can’t currently satisfy (the same resolver the run gate uses); - it has any
http/egress step (or routine-levelegress_targets); - it has any
code-runtime step; - it declares
credentials_required.
agent_run / transform / call_pipeline / wait steps over satisfiable integrations, no egress, no credentials.
- Safe →
active. Goes live immediately, exactly as before. - Risky →
proposed. The routine is persisted but not runnable, and a blocking inbox item is raised forMANAGER+(the same Inbox surface as proposed skills). Approve it to go live, or reject it.
proposed (or disabled) routine refuses run / run_batch with 409 Conflict — "routine is awaiting approval" or "routine is disabled". dry_run always previews a saved routine, so it’s never blocked.
OWNER/ADMIN escape hatch —
skip_governance_gate. Symmetric with skip_test_gate: passing "skip_governance_gate": true on the user save (POST /api/v1/workspaces/{ws}/pipelines/save) forces a risky definition live as active and raises no review item. Honored only for OWNER/ADMIN (lower roles get 403); it is deliberately not available on the agent/sidecar save path (InternalSave), so agent-authored risky routines are always reviewed. This is what the crewship seed flow uses so a freshly-seeded workspace’s hand-curated starter routines are immediately runnable instead of stuck “awaiting approval”. Use it only for DSL you trust.Approve / reject (MANAGER+)
POST /api/v1/workspaces/{ws}/pipelines/{slug}/approve—MANAGER+. Flips toactive.POST /api/v1/workspaces/{ws}/pipelines/{slug}/reject—MANAGER+. Soft-deletes the proposed routine.
Disable / enable (OWNER/ADMIN airbag)
POST /api/v1/workspaces/{ws}/pipelines/{slug}/disable—OWNER/ADMIN. Flips todisabledand cancels any in-flight runs of that routine immediately.POST /api/v1/workspaces/{ws}/pipelines/{slug}/enable—OWNER/ADMIN. Returns it toactive.
crewship routine list) carry status; filter the queue with:
Triggers
Cron schedules
Wake gates
A plain cron fires the full routine — including its LLM steps — on every tick, even when there is nothing worth the model’s attention. A wake gate fixes that: the schedule references an agentless probe routine, the scheduler runs the probe first on each tick (free of LLM spend by the agentless guarantee), and the main routine fires only when the probe’s final output is truthy. Same falsy rule as stepif: conditions — empty, false, 0, null, nil, no, off (case-insensitive) skip the tick; anything else wakes the routine.
feed-watch-probe every 15 minutes and only wakes feed-change-report (an agent routine) when the watched feed drifts from its baseline. Point the probe’s url/expected_items inputs at your own endpoint to make it real.
Semantics worth knowing:
- The probe must be
agentless: true, live in the same workspace, and can’t be the schedule’s own routine — all validated when the schedule is saved. - Probe errors fail open: a broken or deleted probe wakes the main routine instead of going silently blind, and records
last_wake_status: ERRORso you can see the probe needs fixing. - A skipped tick advances
next_run_atbut leaveslast_run_*untouched — run telemetry stays strictly about main runs. Wake telemetry lives inwake_check_count/wake_fire_count/last_wake_at/last_wake_status, androutine schedules listshows it as<probe> woke/checkedin the WAKE column. - Probe executions are regular runs with
triggered_via: wake_check, so they’re auditable in run history and filterable out of dashboards.
Webhooks
/api/v1/webhooks/{token}. With HMAC, sender includes header X-Crewship-Signature: sha256=<hex_hmac_of_body>, validated server-side via hmac.Equal (timing-safe). Rate limited per token, per minute, default 60.
Delivery is asynchronous. The endpoint verifies the signature, rate limit, governance status, and the routine’s concurrency_key gate synchronously, reserves a run id, then answers immediately:
run_id, so senders with short delivery timeouts (GitHub ~10s, Stripe ~5s) never time out on long agent runs — and a sender closing the connection early cannot cancel an in-flight run: the run’s context derives from the server lifecycle, not the HTTP request. Poll the handle for the outcome:
Idempotency-Key / X-Crewship-Event-ID, or identical bytes within the dedupe window) answers 202 with "status": "DEDUPED" and the original run’s id — the routine executes exactly once. A proposed/disabled routine answers 409 (policy block, nothing dispatched). A delivery arriving while the routine’s concurrency_key gate is at capacity answers 429 with a Retry-After header before anything is dispatched — a 429 never consumes the idempotency key, so redelivering the same event later executes it normally. Runs that hit a wait: approval step park as WAITING and resume once the waitpoint is approved, exactly as with other triggers.
Manual
crewship routine run <slug> --inputs '{...}' or click the Run button in the UI detail panel. Same execution path.
After you click Run or Test run in the UI, a live Run activity rail appears inline in the detail panel showing the just-started run step by step (started → each step → completed/failed) — so status is visible immediately without switching to the Runs tab. Full run history stays in the Runs tab; see the Activity guide for the rail, and the toolbar Activity Bar for a workspace-wide “what’s running now” view.
The Test run button calls the public, JWT-authed
test_run endpoint (POST /api/v1/workspaces/{workspaceId}/pipelines/test_run). It validates a draft — parse + Validate + the integration and resource preconditions + a dry_run pass (no agent is invoked; you can’t run an agent “dry”) — and on success mints an HMAC save_token bound to (workspace, definition hash, user). Save verifies that token, so a draft can’t be saved as “test passed” unless it actually passed test_run. The UI button and the CLI both use this endpoint. To preview a saved routine instead, use Dry run — it walks the saved definition, renders templates, and returns the declared manifest (the blast radius) without invoking anything.Deferred dispatch: delay, ttl, debounce, priority
A triggered run can be parked instead of firing immediately — useful for “run 60s from now” scheduling or for coalescing a burst of near-duplicate triggers into one run:ttl. Immediate runs (no --delay / --debounce-key) are unaffected. Full reference, including the underlying API fields: Deferred dispatch.
Dry-run preview
Two execution modes, distinguished byMode in the request body and surface:
| Mode | Side effects | Increments invocation_count | Cost |
|---|---|---|---|
run | yes (agents called) — production | yes | real |
dry_run | no (templates rendered, agents skipped) | no | estimated |
test_run mode: you cannot run an agent “dry” — it executes arbitrary scripts (bash, ansible, curl) whose side effects can’t be intercepted — so a real run is always run. The agent-authoring save gate validates a draft via a dry_run (structure + templates), not a real execution.
Dry-run is the safe “what would this routine do?” preview, and an honest static plan — not a proof the run will succeed. It walks the DSL, renders all template substitutions against the supplied inputs, resolves each step’s execution tier (adapter + model), and reports a would_execute list with per-step estimated cost. It also returns the routine’s declared manifest — the full blast radius (integrations, egress, credentials, agents, routines, datastores, tools, has_http, has_code) — so the UI can show “would use: ansible, Postgres, discord.com, agent jordan”. (A definition that no longer parses leaves manifest null and still returns the report.) No agents are invoked; no journal entries beyond a single pipeline.dry_run audit row are written.
would_execute report renders inline above the tab bar with per-step:
- Step ID + type
- Resolved
tier_adapter:tier_model(e.g.claude:claude-haiku-4-5) - Estimated cost in USD (order-of-magnitude only — the executor uses a flat token-density heuristic, not real pricing)
would_call_agent/would_call_pipelinetarget
run.
Versioning + rollback
Every save creates a new immutable row inpipeline_versions (v79 migration). The pipelines.head_version column points at the current. Rollback creates a NEW version on top of HEAD whose definition equals the target’s:
Bundle export / import
Routines are portable across workspaces:crewship-pipeline-bundle/v1: routine row + (optionally) the full version chain + change_summary annotations. Author identity is rewritten on import so the importing user becomes the new author. Slug is preserved; if it conflicts in the destination workspace the existing row updates (new version), or you change the bundle’s slug before import.
HITL waitpoints
A routine that includes await step of kind: approval parks the run on a DB-backed waitpoint — without holding a goroutine or an execution slot. The triggering crewship routine run returns immediately with status: WAITING and the token (see the wait step). Operators decide via three paths:
/inbox remain the places to act on approvals for runs
you didn’t just start.
The decision comment is forwarded to the parked run as the wait step’s output, so downstream steps can read approval rationale via {{ steps.<wait_step_id>.output }}.
Durability and restart recovery
Run state is persisted to thepipeline_runs table at every step boundary: when a step starts, current_step_id is stamped; when it completes, the full step-outputs map and accumulated cost are flushed. A hard kill (crash, OOM, kill -9) therefore loses at most the step that was in flight.
At boot, the server scans for runs left in queued/running from the previous process lifetime and resumes them from the next unfinished step:
- Completed steps are restored, not re-executed. Their outputs feed downstream
{{ steps.X.output }}templates exactly as if the process had never died. - The in-flight step re-executes from scratch — at-least-once semantics. For an
agent_runstep this means the agent call is re-issued (and re-billed);http/codesteps with external side effects may fire twice. Design steps to be idempotent where that matters. - Runs parked on a
waitapproval step re-attach to the original waitpoint token. No duplicate approval card is created; the pending inbox item stays answerable across the restart, and approving it resumes the run. - DAG runs (
needs:) resume at wave granularity — the parallel scheduler flushes state when each wave completes, so a kill mid-wave re-executes that wave’s unfinished steps. call_pipelineboundaries are NOT persisted. A kill while a nested pipeline is executing re-runs the entire nested pipeline on resume — the parent’scall_pipelinestep is the unit of recovery, and the nested run’s own per-step progress is not checkpointed. Keep nested pipelines short or idempotent if a mid-flight kill matters to you.- The accumulated cost is restored too, so
max_cost_usdkeeps counting across the restart instead of resetting. Caveat: cost is flushed at step boundaries, so whatever the killed in-flight step had already spent before the kill is not in the restored total — the cap under-counts the true spend by up to one step’s worth (and the re-executed step is billed again on top). - A resumed run that finds its concurrency slot occupied waits and retries with capped exponential backoff (2s doubling up to 60s) instead of failing — losing the slot race to a freshly-fired scheduled run is a timing collision, not a reason to abandon hours of restored work. If the server shuts down while a run is still waiting for its slot, the row stays in-flight and the next boot resumes it again.
- A waitpoint that timed out while the process was down resumes, observes the expired token, and fails with
wait step "X" (approval) timed out— distinct from an operator clicking deny (… denied).
interrupted instead — never silently dropped, never wrongly resumed. Fallback triggers: the pipeline row is gone or no longer parses, the definition changed since the run started (content-hash mismatch — this catches in-place edits even when every step id survives, not just renamed/removed steps), unreadable persisted inputs/outputs, or a non-resumable mode (only live run rows resume; a dry_run preview row from a previous lifetime is never re-run). The reason lands in the run’s error_message.
Graceful shutdowns are different: an in-flight run cancelled by shutdown is finalized as cancelled (a terminal state) and is not resumed at next boot. Resume targets hard kills, where no terminal write could land.
Set CREWSHIP_PIPELINE_RESUME=off to disable resume and restore the older stamp-everything-interrupted behaviour — useful if a crash loop would otherwise re-burn the in-flight agent step’s tokens on every restart. Default is on.
Online eval sampler
The online sampler watches completed routine runs and grades a configurable percentage of them through the existing rubric grader so production traffic continuously feeds the drift detector — not just on-demand replays or scheduled regression suites. Per-routine DSL:- Scan
pipeline_runs WHERE status = 'completed' AND completed_at > watermark. - For each candidate, resolve the routine DSL. If
eval.onlineis absent orsample_rate <= 0, skip. - Draw from
crypto/rand. If the sample lands abovesample_rate, skip. - Otherwise, INSERT into
eval_runswithkind = 'online',status = 'queued'. The existing grader worker picks it up and writes the result back.
| Field | Notes |
|---|---|
sample_rate | Float [0, 1]. 0 disables — useful as an incident “pause grading” toggle. 1.0 is expensive but ok for newly-launched routines while calibrating the grader. Realistic prod default: 0.05. |
grader_agent_slug | Required when sample_rate > 0. Missing grader is treated as a deterministic skip (logged once, no retry storm). |
- Schema-layer idempotency —
eval_runshas a partialUNIQUE INDEX (pipeline_run_id) WHERE kind='online'. A duplicate sampler instance or a crash-recovery watermark replay can attempt the same enqueue twice; the second collapses to a no-op rather than queueing twice and double-billing the grader. (completed_at, id)tuple cursor — parallel fan-out steps that complete at the same nanosecond all get graded; a timestamp-only cursor would orphan siblings.- Stuck-on-error watermark — a transient per-row error (resolver outage, entropy outage, enqueue conflict) freezes the watermark at the row before the failure so the next tick retries. Deterministic skips (no eval config, sample roll missed) advance normally.
- Trace correlation — the eval row carries
routine_slug+pipeline_run_idso an operator clicking a low-scoring grade in the eval UI lands on the actual trace viapipeline_run_id-> trace.
- Watermark is in-memory only. On process restart it resets to
now - 1h; outages longer than that leave a gap in grading coverage. The UNIQUE index keeps reprocessing harmless. - Page cap is 500 rows/tick. A workspace processing > 500 routine runs/minute at
sample_rate = 1.0will accumulate backlog; the watermark only advances past successfully-handled rows, so the backlog drains across subsequent ticks rather than being lost.
Validation gates and credential leak guards
Each step’svalidation block runs after the step output materializes. The schema is a JSON Schema draft 2020-12 subset (most of type, required, properties, items, pattern, format, enum, etc.) plus three Crewship extensions:
must_not_contain: ["API_KEY=", "Bearer "]— output must include none of these substringsmust_contain: ["##"]— output must include all of thesemin_length/max_length— convenience for non-JSON outputs
must_not_contain gate is the credential-leak tripwire: if an agent is about to leak a real API key in its output, the gate fails the step before downstream consumers see it. Pair with on_fail: "abort" (set at the step level, alongside validation) for hard stop, or escalate_tier if you want the higher model to retry without the leak.
Observability
Every routine run emits a sequence of journal entries:pipeline.run.started— once at run beginpipeline.step.started— per steppipeline.step.completed/pipeline.step.failed/pipeline.step.validation_failed— per step terminalpipeline.run.completed/pipeline.run.failed— once at run end
Live visibility — what is a routine doing right now?
While at least one routine run is in flight, the dashboard surfaces it from anywhere in the app:- Header chip — a pulsing “N routines running” pill appears in the toolbar next to the Online / Crews pills (hidden when nothing is active). If any run is parked on a human approval it turns amber and appends ”· M awaiting approval”. Clicking opens a popover with the six newest active runs — routine name, short run id, elapsed time, cost so far, current step — each with Open trace ↗ (deep-link to
/activity?run=<id>), Cancel (same manage-tier RBAC as the Runs tab), and a Review → shortcut into the routine for parked runs. With more than six active runs a “View all N running →” footer jumps to the Activity rail pre-filtered to the active bucket (/activity?status=active). /routinessidebar — a routine with an active run gets a pulsing blue dot and a sub-line showing the current step and elapsed time (▶ ask-casey · 0:12); a parked run shows the amber⏸ awaiting approvalvariant./routineslist table — the status cell swaps the historical “last run” pill for a live Running (or amber Awaiting approval) pill with current step · elapsed · cost, and live routines bubble to the top of the table regardless of the chosen column sort.
GET /api/v1/workspaces/{ws}/pipeline-runs?status=active + the pipeline.run.*/pipeline.step.started events, 3s poll while anything is active). status=active bundles running, queued, paused and waiting — the status a run parks in while a HITL waitpoint awaits a decision.
Run warnings
before_all/after_all/on_failure lifecycle hooks run best-effort: a failing after_all or on_failure hook (a teardown step like credential-release or cost-meter-close) never flips the run’s terminal status — a before_all failure is different and fails the run outright, since nothing downstream ran. A failed after_all/on_failure hook is instead recorded as a structured warning on the run so it doesn’t silently vanish into server logs while the run reports completed.
Fetch it via GET /api/v1/workspaces/{ws}/pipeline-runs/{runId} — the response’s warnings array (always present, empty when there are none) has one entry per failed hook:
crewship routine logs <run_id> (the slug-free state lookup) prints a Warnings: section when the run has any, alongside the existing Error: line for the run’s own terminal status.
Per-step cost + duration
Both the UI Runs sub-tab waterfall andcrewship routine logs <run_id> --slug X surface the cost_usd and duration_ms fields the executor stamps on every pipeline.step.completed event. Same data, two presentation surfaces:
- UI: right-aligned columns next to each step row, with a footer total summing the run. An em-dash (
—) renders for events that don’t carry cost (pipeline.step.started,.failed, live-only echoes) — easier to scan than$0.0000next to real values. - CLI logs: extra
DURATIONandCOSTcolumns in the timeline output. Same em-dash rule for non-positive values.
step.started+step.completed pair with its own cost on the retry, so the column-summed footer reflects the full spend (including retries), not just the first attempt.
Run tags & metadata
Tag and annotate a run at invoke time to make it filterable later:crewship routine runs <slug> --tag prod); replays inherit the source run’s tags. Metadata is a free-form JSON object stored on the run and returned by GET /pipeline-runs/{id} — set at invoke time today (mid-run mutation and {{ run.metadata.X }} templating are a follow-up).
Replay & error fingerprinting
Failed runs are bucketed by a stable error fingerprint (failing step + normalized message), so recurring failures group together instead of scrolling past one-by-one in run history:is_replay=true + replay_of=<run_id>; gate a step on {{ env.is_replay }} to skip a side effect (e.g. a notification) on replay. Full reference: Run observability: tags, metadata, replay, errors.
Three terminal-side observability surfaces, ordered by when you’d reach for each:
| Phase | Command | Why |
|---|---|---|
| Before run | crewship routine doctor <slug> | Preflight ✓/⚠/✗ checklist — catches missing crew provisioning, agent slug typos, missing credentials, contradictory validation gates, tight cost caps. Cheap to run, returns in milliseconds, fails CI builds on FAIL. |
| During run | crewship routine watch <slug> | Polls the runs endpoint and prints events as an ANSI-coloured timeline. --json for JSON Lines piping into jq; --once exits after the first run terminates (CI-friendly). |
| After run | crewship routine logs <run_id> --slug X | One-shot post-mortem dump — every step’s prompt, output, validation verdict, cost. Use to investigate a specific failure surfaced by watch or by the UI. |
crewship routine bench. For matrix-level cross-tier consistency (which scenarios diverge between Haiku and Opus?), use crewship eval scenarios and crewship eval baseline diff for CI regression gates.
RBAC
| Role | Read | Run | Save | Schedules / Webhooks | skip_test_gate |
|---|---|---|---|---|---|
| VIEWER | ✓ | — | — | — | — |
| MEMBER | ✓ | ✓ | — | — | — |
| MANAGER | ✓ | ✓ | ✓ | ✓ | — |
| ADMIN | ✓ | ✓ | ✓ | ✓ | ✓ |
| OWNER | ✓ | ✓ | ✓ | ✓ | ✓ |
Common patterns
”Run this routine every day at 9 AM"
"Trigger this routine from a GitHub Actions hook"
"Validate routine DSL in CI before committing”
”Discover what an agent is about to do, without running it”
would_execute report with which agent, which tier, the rendered prompt, and an estimated cost. Zero side effects.
”Watch a metric on a tight cron, spend tokens only on a spike”
Author an agentless probe that printstrue/false, then attach it as a wake gate:
routine schedules list shows how often the gate woke vs. checked.
”Cancel an in-flight run”
pipeline.run.failed with reason “cancelled”.
”Serialise runs per tenant / customer / repo”
Useconcurrency_key with a template referencing the tenant-identifying input:
account_id queue rather than fan out; requests for different account_ids run in parallel. Pair with the Idempotency-Key header for webhook handlers (see the Concurrency + idempotency recipe).
The platform fails fast if the rendered key is empty (missing input + no literal prefix in the template) — see Troubleshooting for why and how to fix.
Troubleshooting
Save returns 422 "save requires a fresh, passing test_run"
Save returns 422 "save requires a fresh, passing test_run"
The save endpoint requires the validation gate cleared. The server validates the DSL on save (parse + schema + cycle detection); to clear the residual gate, send
"last_test_run_at" (RFC3339, within the last 5 minutes) + "last_test_run_passed": true in the /save body. This is the body-trust path crewship routine save uses, and it mirrors the sidecar agent-authoring flow which sets the same fields after the internal dry-run validation. There is no public test_run endpoint to mint a token from — a real run can’t be done “dry”, so a real run is reserved for the first live crewship routine run.Alternative: crewship apply --skip-test-gate (CLI — the flag lives on apply, not routine save) / "skip_test_gate": true (API) if your role is OWNER/ADMIN and you trust the DSL — bypasses the gate explicitly.Run starts but no step events appear in the UI
Run starts but no step events appear in the UI
Check that
?include_steps=1 is in the runs URL the UI fetches. The list endpoint defaults to run-level only to keep payload bounded; the detail panel passes the flag explicitly. After a refresh, the waterfall populates from journal entries plus live WebSocket events.Schedule shows enabled=true but never fires
Schedule shows enabled=true but never fires
The scheduler ticks every 30s; with single-binary deployment, restarting the server resets the in-memory tick cursor. Pending schedules whose
next_run_at has passed will fire on the next tick after restart. If you’ve edited the cron expression, next_run_at recomputes from now() — so a 0 9 * * * schedule edited at 10:30 won’t fire until 9 AM tomorrow.Webhook returns 401 / 403 on a known-good token
Webhook returns 401 / 403 on a known-good token
HMAC mismatch: check the
X-Crewship-Signature: sha256=<hex> header, computed as HMAC-SHA256(signing_secret, request_body) over the raw bytes the sender sent (not a re-serialized form). The server uses hmac.Equal for comparison so timing-safe.Run cost is higher than estimated_cost_usd
Run cost is higher than estimated_cost_usd
Two-tier escalation walks the fallback chain. With
on_fail: escalate_tier, a failed Haiku run retries on Sonnet (5-10× more expensive), then Opus (20-50× more expensive) before giving up. Tighten validation gates (loosen must_contain, raise min_length), or set max_cost_usd on the routine to abort the run between steps when a cost ceiling is hit."Pipeline X step Y stuck in pending state"
"Pipeline X step Y stuck in pending state"
Since resume-from-step landed, a restart between the
wait step starting and a decision arriving re-attaches the run to its pending waitpoint at boot — approving via crewship routine waitpoints approve <token> resumes it. If the run shows interrupted instead, its persisted state was insufficient to resume (the reason is in the run’s error_message); the orphaned waitpoint can still be listed and rejected to clear the inbox. The boot log lines pipeline boot recovery done (resume-from-step) resumed=N interrupted=M and pipeline waitpoint store wired (...) stranded_pending=N show what recovery did.Run returns 500 "pipeline: concurrency_key rendered to empty value"
Run returns 500 "pipeline: concurrency_key rendered to empty value"
The DSL declared a non-empty Why fail-fast: a routine that declares
concurrency_key template but the rendered value is an empty string — typically because a referenced input was omitted at trigger time. Full error message:concurrency_key: "{{ inputs.account_id }}" is asking the platform to serialise runs per tenant. If account_id is missing, treating the empty key as “no gate” would silently allow unlimited parallelism for a routine the author explicitly asked us to serialise — a denial-of-self by misconfiguration. The executor refuses to start the run.Fixes, in order of preference:- Supply the input. The caller (curl /
crewship routine run --inputs '{...}'/ scheduler / webhook) needs to pass the referenced input. For webhooks this means theinputs_templatein the webhook config must produce it from the incoming payload. - Set a default on the InputSpec. If the input is genuinely optional but you still want a tenant-style gate, add
"default": "global"(or any non-empty sentinel) to theInputSpec. The platform merges defaults before rendering the key. - Use a literal prefix. A template like
"vendor-alert-{{ inputs.vendor_id }}"always renders non-empty (the literalvendor-alert-survives even whenvendor_idis missing); the key still gates, just less granularly. - Drop the gate. If you genuinely don’t want concurrency control, omit
concurrency_keyentirely (do NOT set it to""— that’s the unset sentinel).
CLI reference
| Command | Purpose |
|---|---|
routine list | Workspace routine catalog |
routine get <slug> | Detail with full DSL |
routine save | Save from JSON file |
routine run <slug> | Invoke against execution tier (--tier-override fast/smart/...). Deferred dispatch via --delay/--ttl/--debounce-key/--priority; exactly-once retries via --idempotency-key <key> (+ --idempotency-ttl <s>) — a duplicate key inside the window returns the original run as DEDUPED instead of executing again (same contract as the webhook Idempotency-Key header) |
routine dry-run <slug> | Preview without invoking |
routine delete <slug> | Soft-delete |
routine doctor <slug> | Preflight checklist (✓/⚠/✗) — catches blind alleys before run |
routine bench <slug> | N-runs variance characterisation — pass-rate + cost/latency stats |
routine runs <slug> | Run history from journal |
routine records <slug> | Run history from pipeline_runs projection (filterable by status) |
routine logs <run_id> | Full journal trace for one run (post-mortem) |
routine watch <slug> | Live event stream |
routine cancel <run_id> | Signal in-flight run |
routine replay <run_id> | Re-invoke a run with its captured inputs |
routine errors | List failed-run error fingerprint groups |
routine bulk-replay --fingerprint <fp> | Replay every run in a fingerprint group |
routine step-override set/list/clear | Live prompt/model override for one step, no version bump |
routine pending list/cancel <id> | Inspect / cancel deferred (--delay / --debounce-key) triggers |
routine versions <slug> | Version history |
routine rollback <slug> --to N | Roll back to v N |
routine export <slug> | JSON bundle to stdout |
routine import [bundle.json] | Load from file or stdin |
routine validate [file.json] | Offline DSL check |
routine schedules ... | Cron CRUD |
routine webhooks ... | Webhook CRUD |
routine waitpoints ... | HITL inbox + decisions |
pipeline alias is preserved for back-compat on the legacy subcommands (pipeline list, pipeline run, pipeline get, etc.). The post-rename additions — schedules, webhooks, waitpoints, validate, watch, logs, records, bench, doctor — are only registered under routine, so scripts that need them must switch.
Backend reference
- Migrations v78 (pipelines + execution_tiers_json), v79 (versions + waitpoints), v80 (schedules), v81 (run idempotency), v82 (webhooks), v115 (schedule wake gates)
- Source:
internal/pipeline/(~10 700 LOC, 36 files) - API:
internal/api/pipelines.go,pipeline_runs.go,pipeline_schedules.go,pipeline_webhooks.go - Sidecar:
internal/sidecar/pipelines.go(port 9119) - Frontend:
app/(dashboard)/routines/,components/features/routines/, hooksuse-pipelines*
Production notes
Two caveats worth internalizing before you lean on routines for anything time- or cost-sensitive — both are permanent architectural properties of the current single-binary deployment, not bugs waiting on a fix:Limitations (current MVP)
These are known gaps in the current MVP — none block production use, but they shape what you can rely on today.- Resume is at-least-once, step-granular — restart recovery re-enters runs from the last persisted step boundary (see Durability and restart recovery). The step that was in flight at the kill re-executes from scratch; there is no sub-step checkpointing, nested
call_pipelineruns re-execute in full, and DAG runs recover at wave granularity. Runs whose definition changed since the run started (content-hash mismatch) fall back tointerrupted.max_cost_usdunder-counts true spend after a crash — see Production notes. - Single-instance scheduler — running multiple replicas would double-fire schedules.
credentials_requiredis declared-only — unlikeintegrations_requiredandresources, it is not yet enforced at run time (see Required integrations).- No cross-adapter tier swap yet — same-provider model swap (Haiku→Opus) works; Claude→Gemini swap requires a shorthand→constant mapping not yet wired.
- No NL→cron converter — ops still hand-type
0 9 * * *. Foundation PRD has the design; not in MVP.