A Routine is a declarative recipe for repeatable AI work. You author it once (or an agent authors it for you when it spots a repetitive pattern), it runs the same way every time, and it lives in your workspace as a reusable asset alongside crews, skills, and credentials.Each routine bundles:
A JSON DSL definition — inputs, outputs, ordered or DAG-structured steps, validation gates, declared egress, credential requirements
Authorship metadata — which crew, which agent (or user), via which path
Triggers — cron schedules and webhook tokens that fire it autonomously
Run history — every invocation as an immutable journal trace with step-level events
Versions — every save creates a new version; rollback is one API call
Compared to other layers in Crewship:
Layer
Scope
Authoring
Use case
Routine
atomic
AI-authored OR human DSL
One repeatable AI workflow
Recipe(future)
composite
Marketplace template
Crew + agents + integrations + routines bundled
Cyclic Issue(future)
issue-tracker
recurring user issue
”Standup every Monday” without leaving Issues
The user-facing label is Routine across web UI, CLI, agent system prompts and marketing surfaces. The backend identifier (database table, Go package, HTTP route paths) remains pipelines for backwards compatibility.
Naval theme. Routines are the boring, accurate, repeated procedures a ship’s crew runs every day — the equivalent of naval drills in the Crewship metaphor. Programs your AI agents follow.
Crewship is unusual among workflow systems because agents author routines, not just execute them. Three paths converge on the same pipelines table:
Agent
The most common path. An agent that’s solved a repetitive problem twice posts to http://localhost:9119/pipelines/save from inside its container; the sidecar injects authorship and forwards to the main API. Next time [AVAILABLE ROUTINES] block in the system prompt advertises it to other crews.
UI
Open /routines, click + New routine, pick a starter template, edit the DSL JSON, click Test & Save. Test_run runs against the execution tier; on pass the routine is persisted with authored_via=user_api and the JWT user as author.
CLI
crewship routine save --name "..." --definition file.json --author-crew <crew_id>. Same test_run gate as UI. CI-friendly: validate offline first with crewship routine validate file.json, then save.
Runtime cost cap; run aborts if exceeded between steps.
concurrency_key
string
Template that gates how many runs can be in flight at once for the same workspace + rendered key. Typical pattern "{{ inputs.account_id }}" to serialise per-tenant runs. Empty (default) = no gate. See Concurrency + idempotency recipe and the fail-fast note in Troubleshooting.
max_concurrent
integer
Cap on simultaneous runs sharing the resolved concurrency_key. Defaults to 1 when concurrency_key is set (strict per-key serialisation); ignored otherwise.
Anywhere a string is interpolated (prompt, http URL/body/headers, wait until, code, transform, conditional if), placeholder {{ ... }} resolves against:
inputs.X — declared input value
steps.Y.output — full text output of an earlier step
steps.Y.output.path — JSON path into a step’s output (when output parses as JSON)
env.AUTHOR_CREW_NAME / etc. — read-only allowlist of execution context
Save-time validator walks every template-bearing field (prompt, nested inputs, http url/body/headers, wait until, event_filter, approval_prompt, code body, code env values, transform input, transform expression, if condition) and rejects placeholders that reference unknown inputs or unseen-yet steps.
Templates are regex substitution, not expression evaluation. There is no arithmetic, no function calls, no inline conditionals. If you need logic, add another step.
complexity resolves through the workspace’s execution_tiers_json mapping into (adapter, model). With complexity: "fast", the agent’s CLI gets --model claude-haiku-4-5-20251001 (or whatever the workspace mapped fast to). model_override is the explicit pin that wins over complexity.on_fail lives at the step level (not inside validation) and is one of escalate_tier | abort | retry_step — escalate_tier walks the fallback chain (e.g., Haiku → Sonnet → Opus) until validation passes or the chain exhausts.
Egress allowlist enforced both pre-flight AND on every redirect (CheckRedirect callback). Allowlist set per-workspace. Credential injection schemes: bearer (default), header with explicit name, query with explicit name.
Three kinds: approval (HITL token), datetime (sleep until ISO timestamp), event (filter on journal events). DB-backed waitpoint store survives process restart for approval kind; the recovery scan at boot reports stranded pending waitpoints.Approve via UI Inbox, CLI crewship routine waitpoints approve <token>, or API.
Runs in a sandboxed container with --cap-drop=ALL, no host mounts, network constrained by the same egress allowlist. Inputs auto-mapped to CREWSHIP_INPUT_<NAME> env vars.
Pure-Go data reshaping with a tiny jq-flavoured subset. No LLM, no network — fully deterministic. Useful for wiring step outputs without paying for another agent_run.
Any step can carry "if": "{{ inputs.run_summary }}". The placeholder is rendered as a plain string — empty / false / 0 / no / off (case-insensitive) → step skipped and marked <skipped> in StepOutputs; anything else counts as truthy. Templates are plain substitution, not expression evaluation, so put the boolean upstream (e.g. set inputs.run_summary from the caller) rather than writing == / != inside the if value.
Steps with no overlapping needs execute in parallel (one goroutine wave per ready set). Final output picks the unique leaf node; for multi-leaf DAGs the first leaf in source order wins.
The economic value-prop: an Opus-class authoring model designs the routine, a Haiku-class executor model runs each invocation. Workspace execution_tiers_json maps complexity classes to (adapter, model):
Per-step complexity annotation drives the resolver. With on_fail: "escalate_tier", a failed validation walks the fallback chain — practically: Haiku tries first, Sonnet on validation fail, Opus on second fail.
Tier override at runtime. The CLI flag --model <model> is constructed from the resolved tier and passed to the agent’s CLI adapter, so a routine’s complexity: "fast" actually fires Haiku, not the agent’s default. CLIAdapter is preserved (so the agent’s CLAUDE_CODE / GEMINI_CLI / etc. wiring stays intact); only the model name swaps.
Save endpoints (sidecar /pipelines/save, user /api/v1/workspaces/{ws}/pipelines/save, internal /api/v1/internal/pipelines/save) require a fresh passing test_run within the last 5 minutes — otherwise inline runs one against the execution tier.This is the self-improvement loop: an authoring agent that writes brittle DSL gets a structured failure report it can read and revise from. Without the gate, MVP would ship pipelines that pass schema but fail at runtime.skip_test_gate: true is honored only when the caller’s role is OWNER or ADMIN; lower roles get 403. Useful for hand-crafted DSL from known-good templates (the seed flow uses it).
5-field cron expression. Scheduler runs in-process and ticks every 30s, so minimum resolution is 1 minute. Single-instance only — running multiple replicas would double-fire (no leader election yet).
Output reveals the public URL and signing secret once (Stripe-style). External services POST event payload to /api/v1/webhooks/{token}. With HMAC, sender includes header X-Crewship-Signature: sha256=<hex_hmac_of_body>, validated server-side via hmac.Equal (timing-safe). Rate limited per token, per minute, default 60.To rotate the secret: delete the webhook + create a new one. There is no in-place rotation by design.
Three execution modes, distinguished by Mode in the request body and surface:
Mode
Side effects
Increments invocation_count
Cost
run
yes (agents called)
yes
real
test_run
yes (agents called)
no
real
dry_run
no (templates rendered, agents skipped)
no
estimated
Dry-run is the safe “what would this routine do?” preview. It walks the DSL, renders all template substitutions against the supplied inputs, resolves each step’s execution tier (adapter + model), and reports a would_execute list with per-step estimated cost. No agents are invoked, no journal entries beyond a single pipeline.dry_run audit row are written.
Estimated cost in USD (order-of-magnitude only — the executor uses a flat token-density heuristic, not real pricing)
would_call_agent / would_call_pipeline target
The estimate is intentionally labelled “estimated” everywhere it surfaces — it’s a planning aid, not a quote. Real cost only lands once you switch to run or test_run.
Every save creates a new immutable row in pipeline_versions (v79 migration). The pipelines.head_version column points at the current. Rollback creates a NEW version on top of HEAD whose definition equals the target’s:
History is preserved — you can roll forward to a future version by another rollback. There is no “delete version” — if a version was bad, the trail of “v3 → v4 (rollback to v2) → v5 (fix)” is the audit story you keep.
# Export from workspace Acrewship routine export email-fetch --include-history > email-fetch.json# Import into workspace Bcat email-fetch.json | crewship routine import
The bundle format is crewship-pipeline-bundle/v1: routine row + (optionally) the full version chain + change_summary annotations. Author identity is rewritten on import so the importing user becomes the new author. Slug is preserved; if it conflicts in the destination workspace the existing row updates (new version), or you change the bundle’s slug before import.
The decision comment is forwarded to the parked run as the wait step’s output, so downstream steps can read approval rationale via {{ steps.<wait_step_id>.output }}.
The online sampler watches completed routine runs and grades a configurable percentage of them through the existing rubric grader so production traffic continuously feeds the drift detector — not just on-demand replays or scheduled regression suites.Per-routine DSL:
name: nightly-summaryeval: online: sample_rate: 0.05 # grade 5% of runs; 0 disables, 1.0 grades every run grader_agent_slug: qa-grader # references an agent in the AUTHOR crew with a rubric prompt
What happens on a tick (default cadence: every 1 minute):
Scan pipeline_runs WHERE status = 'completed' AND completed_at > watermark.
For each candidate, resolve the routine DSL. If eval.online is absent or sample_rate <= 0, skip.
Draw from crypto/rand. If the sample lands above sample_rate, skip.
Otherwise, INSERT into eval_runs with kind = 'online', status = 'queued'. The existing grader worker picks it up and writes the result back.
Field
Notes
sample_rate
Float [0, 1]. 0 disables — useful as an incident “pause grading” toggle. 1.0 is expensive but ok for newly-launched routines while calibrating the grader. Realistic prod default: 0.05.
grader_agent_slug
Required when sample_rate > 0. Missing grader is treated as a deterministic skip (logged once, no retry storm).
Correctness guarantees:
Schema-layer idempotency — eval_runs has a partial UNIQUE INDEX (pipeline_run_id) WHERE kind='online'. A duplicate sampler instance or a crash-recovery watermark replay can attempt the same enqueue twice; the second collapses to a no-op rather than queueing twice and double-billing the grader.
(completed_at, id) tuple cursor — parallel fan-out steps that complete at the same nanosecond all get graded; a timestamp-only cursor would orphan siblings.
Stuck-on-error watermark — a transient per-row error (resolver outage, entropy outage, enqueue conflict) freezes the watermark at the row before the failure so the next tick retries. Deterministic skips (no eval config, sample roll missed) advance normally.
Trace correlation — the eval row carries routine_slug + pipeline_run_id so an operator clicking a low-scoring grade in the eval UI lands on the actual trace via pipeline_run_id -> trace.
Limitations:
Watermark is in-memory only. On process restart it resets to now - 1h; outages longer than that leave a gap in grading coverage. The UNIQUE index keeps reprocessing harmless.
Page cap is 500 rows/tick. A workspace processing > 500 routine runs/minute at sample_rate = 1.0 will accumulate backlog; the watermark only advances past successfully-handled rows, so the backlog drains across subsequent ticks rather than being lost.
Each step’s validation block runs after the step output materializes. The schema is a JSON Schema draft 2020-12 subset (most of type, required, properties, items, pattern, format, enum, etc.) plus three Crewship extensions:
must_not_contain: ["API_KEY=", "Bearer "] — output must include none of these substrings
must_contain: ["##"] — output must include all of these
min_length / max_length — convenience for non-JSON outputs
The must_not_contain gate is the credential-leak tripwire: if an agent is about to leak a real API key in its output, the gate fails the step before downstream consumers see it. Pair with on_fail: "abort" (set at the step level, alongside validation) for hard stop, or escalate_tier if you want the higher model to retry without the leak.
Every routine run emits a sequence of journal entries:
pipeline.run.started — once at run begin
pipeline.step.started — per step
pipeline.step.completed / pipeline.step.failed / pipeline.step.validation_failed — per step terminal
pipeline.run.completed / pipeline.run.failed — once at run end
Plus WebSocket broadcast on the workspace channel for live UI updates (PipelineRunNode in the orchestration graph + Runs sub-tab waterfall both subscribe).
Both the UI Runs sub-tab waterfall and crewship routine logs <run_id> --slug X surface the cost_usd and duration_ms fields the executor stamps on every pipeline.step.completed event. Same data, two presentation surfaces:
UI: right-aligned columns next to each step row, with a footer total summing the run. An em-dash (—) renders for events that don’t carry cost (pipeline.step.started, .failed, live-only echoes) — easier to scan than $0.0000 next to real values.
CLI logs: extra DURATION and COST columns in the timeline output. Same em-dash rule for non-positive values.
Worker tier escalation is visible too — a failed validation gate that retries on a higher tier emits a second step.started+step.completed pair with its own cost on the retry, so the column-summed footer reflects the full spend (including retries), not just the first attempt.Three terminal-side observability surfaces, ordered by when you’d reach for each:
Phase
Command
Why
Before run
crewship routine doctor <slug>
Preflight ✓/⚠/✗ checklist — catches missing crew provisioning, agent slug typos, missing credentials, contradictory validation gates, tight cost caps. Cheap to run, returns in milliseconds, fails CI builds on FAIL.
During run
crewship routine watch <slug>
Polls the runs endpoint and prints events as an ANSI-coloured timeline. --json for JSON Lines piping into jq; --once exits after the first run terminates (CI-friendly).
After run
crewship routine logs <run_id> --slug X
One-shot post-mortem dump — every step’s prompt, output, validation verdict, cost. Use to investigate a specific failure surfaced by watch or by the UI.
For variance characterisation across many runs (is this routine production-ready at this tier?), use crewship routine bench. For matrix-level cross-tier consistency (which scenarios diverge between Haiku and Opus?), use crewship eval scenarios and crewship eval baseline diff for CI regression gates.
Two requests for the same account_id queue rather than fan out; requests for different account_ids run in parallel. Pair with the Idempotency-Key header for webhook handlers (see the Concurrency + idempotency recipe).The platform fails fast if the rendered key is empty (missing input + no literal prefix in the template) — see Troubleshooting for why and how to fix.
Before going through this list: run crewship routine doctor <slug> first. Most “blind alley” failures (missing crew provisioning, agent slug typo, missing credential, contradictory validation gate, cost cap too tight) surface as a ✗ check on doctor before they ever cost an LLM call.
Save returns 422 “save requires a fresh, passing test_run”
The save endpoint requires the test gate cleared. Preferred path: call /test_run, capture the save_token from its response, and pass "save_token": "<token>" in the /save body. The token is a server-signed HMAC over (workspace_id | definition_hash | user_id | issued_at) and expires after 5 minutes — it cannot be forged from the client side.Alternatives:
--skip-test-gate (CLI) / "skip_test_gate": true (API) if your role is OWNER/ADMIN and you trust the DSL — bypasses the gate without running the test_run.
Legacy fallback: send "last_test_run_at" + "last_test_run_passed" in the body. Honoured for backwards compatibility, but trusts client-supplied values; new clients should always use save_token.
Check that ?include_steps=1 is in the runs URL the UI fetches. The list endpoint defaults to run-level only to keep payload bounded; the detail panel passes the flag explicitly. After a refresh, the waterfall populates from journal entries plus live WebSocket events.
The scheduler ticks every 30s; with single-binary deployment, restarting the server resets the in-memory tick cursor. Pending schedules whose next_run_at has passed will fire on the next tick after restart. If you’ve edited the cron expression, next_run_at recomputes from now() — so a 0 9 * * * schedule edited at 10:30 won’t fire until 9 AM tomorrow.
HMAC mismatch: check the X-Crewship-Signature: sha256=<hex> header, computed as HMAC-SHA256(signing_secret, request_body) over the raw bytes the sender sent (not a re-serialized form). The server uses hmac.Equal for comparison so timing-safe.
Two-tier escalation walks the fallback chain. With on_fail: escalate_tier, a failed Haiku run retries on Sonnet (5-10× more expensive), then Opus (20-50× more expensive) before giving up. Tighten validation gates (loosen must_contain, raise min_length), or set max_cost_usd on the routine to abort the run between steps when a cost ceiling is hit.
Probably a waitpoint without a matching listener (process restarted between the wait step starting and a decision arriving). Check crewship routine waitpoints list; if listed, decide via approve/reject. The recovery scan at server boot reports stranded waitpoints in the log line pipeline waitpoint store wired (...) stranded_pending=N.
Run returns 500 “pipeline: concurrency_key rendered to empty value”
The DSL declared a non-empty concurrency_key template but the rendered value is an empty string — typically because a referenced input was omitted at trigger time. Full error message:
pipeline: concurrency_key rendered to empty value (referenced input missing or empty): template "{{ inputs.<NAME> }}"
Why fail-fast: a routine that declares concurrency_key: "{{ inputs.account_id }}" is asking the platform to serialise runs per tenant. If account_id is missing, treating the empty key as “no gate” would silently allow unlimited parallelism for a routine the author explicitly asked us to serialise — a denial-of-self by misconfiguration. The executor refuses to start the run.Fixes, in order of preference:
Supply the input. The caller (curl / crewship routine run --inputs '{...}' / scheduler / webhook) needs to pass the referenced input. For webhooks this means the inputs_template in the webhook config must produce it from the incoming payload.
Set a default on the InputSpec. If the input is genuinely optional but you still want a tenant-style gate, add "default": "global" (or any non-empty sentinel) to the InputSpec. The platform merges defaults before rendering the key.
Use a literal prefix. A template like "vendor-alert-{{ inputs.vendor_id }}" always renders non-empty (the literal vendor-alert- survives even when vendor_id is missing); the key still gates, just less granularly.
Drop the gate. If you genuinely don’t want concurrency control, omit concurrency_key entirely (do NOT set it to "" — that’s the unset sentinel).
Run history from pipeline_runs projection (filterable by status)
routine logs <run_id>
Full journal trace for one run (post-mortem)
routine watch <slug>
Live event stream
routine cancel <run_id>
Signal in-flight run
routine versions <slug>
Version history
routine rollback <slug> --to N
Roll back to v N
routine export <slug>
JSON bundle to stdout
routine import [bundle.json]
Load from file or stdin
routine validate [file.json]
Offline DSL check
routine schedules ...
Cron CRUD
routine webhooks ...
Webhook CRUD
routine waitpoints ...
HITL inbox + decisions
For batch evaluation across the eval-* fleet (matrix sweep, head-to-head tier compare, baseline regression diff), see the eval CLI. Cookbook recipe 6 walks the eval-driven promotion workflow end-to-end.The pipeline alias is preserved for back-compat on the legacy subcommands (pipeline list, pipeline run, pipeline get, etc.). The post-rename additions — schedules, webhooks, waitpoints, validate, watch, logs, records, bench, doctor — are only registered under routine, so scripts that need them must switch.
Run state is in-memory — restart loses active runs. Pending waitpoints survive (DB-backed); active step execution does not. A pipeline_runs table with status enum + checkpoint is the next-step PR.
Single-instance scheduler — running multiple replicas would double-fire schedules.
DSL editor is read-only in UI — Monaco write-mode is a follow-up. Authoring through CLI / agent / API works fully.
No cross-adapter tier swap yet — same-provider model swap (Haiku→Opus) works; Claude→Gemini swap requires a shorthand→constant mapping not yet wired.
No NL→cron converter — ops still hand-type 0 9 * * *. Foundation PRD has the design; not in MVP.
No errors fingerprinting / bulk replay — one-by-one rerun via routine run only.