This cookbook collects the patterns we’ve seen pay off when promoting an AI workflow from prototype to production. Each example is a complete, runnable routine — copy, paste, adjust slugs to your crew, save withDocumentation Index
Fetch the complete documentation index at: https://docs.crewship.ai/llms.txt
Use this file to discover all available pages before exploring further.
crewship routine save, validate with crewship eval scenarios.
If you’re new to routines, start with the routines guide for the conceptual model. This page assumes you know what agent_run, complexity, and validation are.
How to read these recipes
Every recipe has the same anatomy:- Goal — what the routine does and when it’s worth using.
- DSL — the full JSON definition. No omissions, no
.... - Gates explained — why each
must_contain/ schema /outcomes.criteriais there. The gates are the design — copying the prompt without the gates doesn’t reproduce the routine. - Failure modes & fixes — what goes wrong on weak tiers and the typical resolution.
complexity: fast (Haiku, cheap) and complexity: smart (Opus, robust).
Recipe 1 — Strict JSON extraction
Goal: turn freeform text into a typed JSON object that the next step cantransform over without parsing prose. The canonical “make this prompt deterministic enough to be a function” pattern.
Gates explained
must_containon{,},"item","qty","unit_price","currency"— anchors on JSON structure + every required key name. Catches the most common failure: model emits prose like “Sure, here’s the JSON: …” before the actual object.must_not_containon```,Here is— blocks code-fence wrappers and conversational lead-ins.max_length: 400— caps verbosity drift. A weak tier sometimes ignores “no prose” and adds a paragraph of explanation; the cap trips it.validation.schema(JSON Schema draft-2020-12) — full schema walk is enforced when set, compiled and cached per definition byinternal/pipeline/executor_validate.go. Use it when you want type-checking; the substring anchors above are the cheap-but-effective first layer that short-circuits the expensive compile + walk on obvious failures.
Failure modes & fixes
| Symptom | Likely cause | Fix |
|---|---|---|
output missing required token: "currency" | Worker forgot a field | Add the missing key to the prompt’s “EXACTLY these keys” list more explicitly |
output contains banned token: ``` | Wrapped in code fences | Add "do NOT use code fences" to prompt; prompt-engineering, not gate-tightening |
| Cost cap exceeded on smart tier | max_cost_usd: 0.05 too low | Bump to 0.50 — Claude Code overhead is ~$0.05–0.10 per step |
Recipe 2 — Cross-family rubric grading
Goal: a fast worker drafts the output, a smart grader scores it on a strict rubric, the loop iterates if the worker misses the rubric. Mitigates self-preference bias by using a different grader family.Why this pattern
A bare gate (must_contain: ["- "]) catches the bullet marker but not “did the worker actually summarise the topic, or just emit three placeholder bullets?” The rubric’s covers_topic + no_invented_facts criteria are what prove the output is grounded. The grader is a separate agent_slug (eva, Sonnet) so its self-preference bias doesn’t pile onto the worker’s (daniel, Haiku).
Gotchas
max_iterations: 3— the loop is bounded so a stubborn worker can’t burn unbounded tokens. After 3 attempts,on_fail: abortlets the run fail honestly rather than ship a bad summary.max_cost_usd: 1.50— outcomes loops can iterate up to 3× the worker cost + 1 grader call per iteration. Budget accordingly.- Don’t put more than ~10 criteria in one rubric — the grader’s verdict gets noisy. Split into two graded steps if you need finer-grained rubric.
Recipe 3 — Tier escalation on cost guardrail
Goal: pin a routine to the cheapest tier that satisfies the gate, but escalate automatically if the cheap tier fails. Production routines should have this — it’s how you catch a model regression without a 4am page.Escalation flow
- Worker runs at
fasttier (Haiku). Output failsmust_containbecause it added “I think the sentiment is…” on_fail: escalate_tier→ executor walksexecution_tier.fallback, retries onmoderate(Sonnet).- Sonnet’s output passes the gate. Run completes with
cost_usd≈ Haiku-cost + Sonnet-cost. - Journal records both attempts so you can see which tier actually satisfied the gate.
When NOT to use auto-escalation
- Critical-output routines where wrong-but-confident is worse than failed. Rubric-graded scenarios with
outcomes.on_fail: abortare better — failing loud is preferable to silently spending more. - Cost-budget-pinned routines where you’d rather see
FAILED: cost cap exceededthan auto-bump to a $0.20 tier you didn’t budget for.
Recipe 4 — DAG with deterministic transform plumbing
Goal: combine non-LLM steps (http, transform, code) with LLM steps to keep cost down where determinism is achievable.
Why the transform step matters
Withoutproject_title, the LLM step would see the entire JSON response (often KBs of data) and either truncate context or summarise the wrong field. A deterministic transform step (jq projection) reduces cost AND eliminates a class of hallucination — the LLM only sees what we explicitly extract.
Egress allowlist
egress_targets: ["api.example.com"] is enforced at runtime. A typo’d URL host fails at the http step rather than going to a different server. Always set this — leaving it unset opens the routine to SSRF if any input is template-substituted into a URL.
Recipe 5 — Idempotent routine with concurrency key
Goal: a webhook-triggered routine that should never double-execute on retransmission. This is the Trigger.dev / Stripe webhook model — pair an idempotency key with a concurrency limit so retries are safe and a burst of duplicate events doesn’t fan out to N parallel executions.Idempotency vs concurrency — what’s the difference?
concurrency_keygates parallel runs: if two requests with the same key arrive at the same time, the second one waits (or 429s) until the first finishes.Idempotency-KeyHTTP header dedupes across time: a second request with the same key (within the TTL) returns the originalrun_idwithstatus=DEDUPEDinstead of executing again.
concurrency_key validation
The platform fails fast when a non-empty concurrency_key template renders to an empty string. Example: a routine declares concurrency_key: "{{ inputs.order_id }}" but the caller triggers without supplying order_id — the rendered key is "", which would otherwise be treated as “no gate” and silently allow unlimited parallelism. The executor instead returns:
"global"); if you want no gate, omit the field entirely.
Triggering safely
Idempotency-Key, the second request returns status=DEDUPED and the original run id. No double-charge, no double-fulfilment.
Recipe 6 — Eval-driven promotion to production
Goal: take a hand-written routine and decide, with data, whether to ship it onfast or smart tier. This is the workflow PRD §18 calls “operator promotion path.”
What to do at each verdict
| Bench verdict | Action |
|---|---|
PRODUCTION_READY (≥90%) at fast | Ship. Save baseline. Add to CI. |
FLAKY (70-90%) at fast | Look at fail breakdown. Top reason cost-cap → bump cap. Top reason rubric-fail → tighten prompt or escalate tier. |
UNRELIABLE (<70%) at fast | Don’t ship at fast. Bench at moderate. If still <70%, the gate is unsatisfiable — rewrite. |
BROKEN (0%) | Auth issue, unsatisfiable gate, or always-failing guardrail. Check --cooldown-ms 1000 to rule out rate limiting. |
Recipe 7 — Cross-tier compare (head-to-head)
Goal: investigate one specific scenario when the matrix fromeval scenarios shows divergence.
DIVERGE-B-PASS) tells you Haiku can’t satisfy the gate but Opus can. Now you have a real data point for the “ship at fast?” decision: probably no, unless you can rewrite the gate.
Recipe 8 — Routine that calls another routine
Goal: compose a complex workflow from smaller, individually-tested routines. Each sub-routine has its own gates, costs, baseline.Why compose
- Each sub-routine has its own benched baseline. A regression in
summarize-eventsshows up as a regression in any composed routine that calls it — you don’t have to re-bench the parent. - Author-crew context is preserved per call. If
summarize-eventslives in thequalitycrew, calling it from aengineeringroutine still runs withquality’s persona. - Cycle detection at save time: A → B → A is rejected. Maximum nested depth is 10 (PIPELINES.md §6.4).
What’s intentionally NOT in the cookbook
These exist in the routines documentation but aren’t recipes here because they don’t change the patterns above:- Schedules — see routine schedules CLI. Add a cron and you have a periodic version of any recipe above.
- Webhooks — see routine webhooks CLI. HMAC-signed event triggers for any recipe above.
- Wait points — human approval gates between steps. See routines guide.