> ## Documentation Index
> Fetch the complete documentation index at: https://docs.crewship.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Routines cookbook

> Worked examples and patterns for production-ready routines: gates, tier escalation, outcomes rubrics, eval-driven promotion

This cookbook collects the patterns we've seen pay off when promoting an AI workflow from prototype to production. Each example is a complete, runnable routine — copy, paste, adjust slugs to your crew, save with `crewship routine save`, validate with `crewship eval scenarios`.

If you're new to routines, start with the [routines guide](/guides/routines) for the conceptual model. This page assumes you know what `agent_run`, `complexity`, and `validation` are.

## How to read these recipes

Every recipe has the same anatomy:

1. **Goal** — what the routine does and when it's worth using.
2. **DSL** — the full JSON definition. No omissions, no `...`.
3. **Gates explained** — why each `must_contain` / schema / `outcomes.criteria` is there. The gates are the design — copying the prompt without the gates doesn't reproduce the routine.
4. **Failure modes & fixes** — what goes wrong on weak tiers and the typical resolution.

<Note>
  The gates are the design. Copying a recipe's prompt without its `validation` / `outcomes` block does **not** reproduce the routine — the gates are what make the output reliable enough to depend on.
</Note>

Every routine here can be benched in one command:

```bash theme={null}
crewship routine bench <slug> --runs 10
```

That gives you the pass-rate + cost variance you need to decide between `complexity: fast` (Haiku, cheap) and `complexity: smart` (Opus, robust).

***

## Recipe 1 — Strict JSON extraction

**Goal**: turn freeform text into a typed JSON object that the next step can `transform` over without parsing prose. The canonical "make this prompt deterministic enough to be a function" pattern.

````json theme={null}
{
  "dsl_version": "1.0",
  "name": "extract-order",
  "display_name": "Extract order details",
  "estimated_cost_usd": 0.002,
  "max_cost_usd": 0.50,
  "credentials_required": [{ "type": "anthropic" }],
  "inputs": [
    { "name": "order_text", "type": "string", "required": true }
  ],
  "outputs": [
    { "name": "order", "type": "object" }
  ],
  "steps": [
    {
      "id": "extract",
      "type": "agent_run",
      "agent_slug": "viktor",
      "complexity": "fast",
      "on_fail": "escalate_tier",
      "prompt": "Reshape the order summary into a JSON object with EXACTLY these keys:\n  - \"item\":       string\n  - \"qty\":        integer\n  - \"unit_price\": number\n  - \"currency\":   ISO-4217 string\nOutput ONLY the raw JSON object. No prose, no code fences.\n\nOrder:\n{{ inputs.order_text }}",
      "validation": {
        "min_length": 30,
        "max_length": 400,
        "must_contain": ["{", "}", "\"item\"", "\"qty\"", "\"unit_price\"", "\"currency\""],
        "must_not_contain": ["```", "Here is", "API_KEY=", "Bearer "]
      }
    }
  ]
}
````

### Gates explained

* **`must_contain` on `{`, `}`, `"item"`, `"qty"`, `"unit_price"`, `"currency"`** — anchors on JSON structure + every required key name. Catches the most common failure: model emits prose like *"Sure, here's the JSON: ..."* before the actual object.
* **`must_not_contain` on ` ``` `, `Here is`** — blocks code-fence wrappers and conversational lead-ins.
* **`max_length: 400`** — caps verbosity drift. A weak tier sometimes ignores "no prose" and adds a paragraph of explanation; the cap trips it.
* **`validation.schema` (JSON Schema draft-2020-12)** — full schema walk is enforced when set, compiled and cached per definition by `internal/pipeline/executor_validate.go`. Use it when you want type-checking; the substring anchors above are the cheap-but-effective first layer that short-circuits the expensive compile + walk on obvious failures.

### Failure modes & fixes

| Symptom                                     | Likely cause                 | Fix                                                                               |
| ------------------------------------------- | ---------------------------- | --------------------------------------------------------------------------------- |
| `output missing required token: "currency"` | Worker forgot a field        | Add the missing key to the prompt's "EXACTLY these keys" list more explicitly     |
| `output contains banned token: ` \`\`\`     | Wrapped in code fences       | Add `"do NOT use code fences"` to prompt; prompt-engineering, not gate-tightening |
| Cost cap exceeded on smart tier             | `max_cost_usd: 0.05` too low | Bump to 0.50 — Claude Code overhead is \~\$0.05–0.10 per step                     |

***

## Recipe 2 — Cross-family rubric grading

**Goal**: a fast worker drafts the output, a smart grader scores it on a strict rubric, the loop iterates if the worker misses the rubric. Mitigates self-preference bias by using a different grader family.

````json theme={null}
{
  "dsl_version": "1.0",
  "name": "graded-summary",
  "display_name": "Cross-family graded summary",
  "estimated_cost_usd": 0.005,
  "max_cost_usd": 1.50,
  "credentials_required": [{ "type": "anthropic" }],
  "inputs": [
    { "name": "topic", "type": "string", "required": true }
  ],
  "outputs": [
    { "name": "summary", "type": "string" }
  ],
  "steps": [
    {
      "id": "summarize",
      "type": "agent_run",
      "agent_slug": "daniel",
      "complexity": "fast",
      "on_fail": "retry_step",
      "prompt": "Write a 3-bullet summary of the topic. Each bullet starts with '- ', between 5 and 25 words. No preamble.\n\nTopic:\n{{ inputs.topic }}",
      "validation": {
        "min_length": 30,
        "max_length": 1200,
        "must_contain": ["- "],
        "must_not_contain": ["```", "API_KEY="]
      },
      "outcomes": {
        "grader_agent_slug": "eva",
        "max_iterations": 3,
        "on_fail": "abort",
        "criteria": [
          { "name": "exactly_three_bullets", "rule": "Exactly three lines starting with '- '." },
          { "name": "each_bullet_in_range",  "rule": "Each bullet line contains 5-25 words." },
          { "name": "covers_topic",          "rule": "Across the bullets, at least two distinct facts from the topic appear." },
          { "name": "no_invented_facts",     "rule": "No bullet introduces a fact not present in the topic input." }
        ]
      }
    }
  ]
}
````

### Why this pattern

A bare gate (`must_contain: ["- "]`) catches the bullet marker but not "did the worker actually summarise the topic, or just emit three placeholder bullets?" The rubric's `covers_topic` + `no_invented_facts` criteria are what prove the output is grounded. The grader is a separate `agent_slug` (`eva`, Sonnet) so its self-preference bias doesn't pile onto the worker's (`daniel`, Haiku).

### Gotchas

* **`max_iterations: 3`** — the loop is bounded so a stubborn worker can't burn unbounded tokens. After 3 attempts, `on_fail: abort` lets the run fail honestly rather than ship a bad summary.
* **`max_cost_usd: 1.50`** — outcomes loops can iterate up to 3× the worker cost + 1 grader call per iteration. Budget accordingly.
* **Don't put more than \~10 criteria in one rubric** — the grader's verdict gets noisy. Split into two graded steps if you need finer-grained rubric.

***

## Recipe 3 — Tier escalation on cost guardrail

**Goal**: pin a routine to the cheapest tier that satisfies the gate, but escalate automatically if the cheap tier fails. Production routines should have this — it's how you catch a model regression without a 4am page.

````json theme={null}
{
  "dsl_version": "1.0",
  "name": "auto-escalate-classifier",
  "display_name": "Sentiment classifier (auto-escalating)",
  "estimated_cost_usd": 0.005,
  "max_cost_usd": 0.50,
  "credentials_required": [{ "type": "anthropic" }],
  "execution_tier": {
    "preferred": "fast",
    "fallback": ["moderate", "smart"]
  },
  "inputs": [
    { "name": "text", "type": "string", "required": true }
  ],
  "steps": [
    {
      "id": "classify",
      "type": "agent_run",
      "agent_slug": "daniel",
      "complexity": "fast",
      "on_fail": "escalate_tier",
      "prompt": "Classify sentiment as positive, negative, or neutral. Output exactly: `sentiment: <label>`\n\nText:\n{{ inputs.text }}",
      "validation": {
        "min_length": 18,
        "max_length": 50,
        "must_contain": ["sentiment:"],
        "must_not_contain": ["```", "I think", "API_KEY="]
      }
    }
  ]
}
````

### Escalation flow

1. Worker runs at `fast` tier (Haiku). Output fails `must_contain` because it added "I think the sentiment is..."
2. `on_fail: escalate_tier` → executor walks `execution_tier.fallback`, retries on `moderate` (Sonnet).
3. Sonnet's output passes the gate. Run completes with `cost_usd` ≈ Haiku-cost + Sonnet-cost.
4. Journal records both attempts so you can see which tier actually satisfied the gate.

### When NOT to use auto-escalation

* **Critical-output routines where wrong-but-confident is worse than failed.** Rubric-graded scenarios with `outcomes.on_fail: abort` are better — failing loud is preferable to silently spending more.
* **Cost-budget-pinned routines** where you'd rather see `FAILED: cost cap exceeded` than auto-bump to a \$0.20 tier you didn't budget for.

***

## Recipe 4 — DAG with deterministic transform plumbing

**Goal**: combine non-LLM steps (`http`, `transform`, `code`) with LLM steps to keep cost down where determinism is achievable.

````json theme={null}
{
  "dsl_version": "1.0",
  "name": "fetch-extract-summarize",
  "display_name": "Fetch JSON, project field, summarize",
  "estimated_cost_usd": 0.003,
  "max_cost_usd": 0.50,
  "egress_targets": ["api.example.com"],
  "credentials_required": [{ "type": "anthropic" }],
  "inputs": [
    { "name": "endpoint", "type": "string", "required": true }
  ],
  "steps": [
    {
      "id": "fetch",
      "type": "http",
      "http": {
        "method": "GET",
        "url": "{{ inputs.endpoint }}",
        "max_response_bytes": 200000,
        "success_codes": [200]
      },
      "timeout_seconds": 30
    },
    {
      "id": "project_title",
      "type": "transform",
      "needs": ["fetch"],
      "transform": {
        "input": "{{ steps.fetch.output }}",
        "expression": ".data.title"
      }
    },
    {
      "id": "summarize",
      "type": "agent_run",
      "agent_slug": "filip",
      "complexity": "fast",
      "needs": ["project_title"],
      "on_fail": "escalate_tier",
      "prompt": "One-sentence summary of the document title (max 20 words):\n\n{{ steps.project_title.output }}",
      "validation": {
        "min_length": 5,
        "max_length": 400,
        "must_not_contain": ["```", "API_KEY="]
      }
    }
  ]
}
````

### Why the transform step matters

Without `project_title`, the LLM step would see the entire JSON response (often KBs of data) and either truncate context or summarise the wrong field. A deterministic `transform` step (jq projection) reduces cost AND eliminates a class of hallucination — the LLM only sees what we explicitly extract.

### Egress allowlist

`egress_targets: ["api.example.com"]` is enforced at runtime. A typo'd URL host fails at the http step rather than going to a different server.

<Warning>
  **Always set `egress_targets`** — leaving it unset opens the routine to SSRF if any input is template-substituted into a URL.
</Warning>

***

## Recipe 5 — Idempotent routine with concurrency key

**Goal**: a webhook-triggered routine that should never double-execute on retransmission. The pattern follows the standard webhook idempotency model — pair an idempotency key with a concurrency limit so retries are safe and a burst of duplicate events doesn't fan out to N parallel executions.

```json theme={null}
{
  "dsl_version": "1.0",
  "name": "webhook-process-order",
  "display_name": "Process order webhook",
  "concurrency_key": "{{ inputs.order_id }}",
  "max_concurrent": 1,
  "estimated_cost_usd": 0.002,
  "max_cost_usd": 0.50,
  "credentials_required": [{ "type": "anthropic" }],
  "inputs": [
    { "name": "order_id",  "type": "string", "required": true },
    { "name": "raw_payload","type": "string", "required": true }
  ],
  "steps": [
    {
      "id": "validate_and_enrich",
      "type": "agent_run",
      "agent_slug": "viktor",
      "complexity": "fast",
      "prompt": "Parse the order webhook payload and emit a JSON status line.\n\n{{ inputs.raw_payload }}",
      "validation": {
        "min_length": 5,
        "max_length": 600,
        "must_not_contain": ["API_KEY="]
      }
    }
  ]
}
```

### Idempotency vs concurrency — what's the difference?

* **`concurrency_key`** gates *parallel* runs: if two requests with the same key arrive at the same time, the second one waits (or 429s) until the first finishes.
* **`Idempotency-Key` HTTP header** dedupes *across time*: a second request with the same key (within the TTL) returns the original `run_id` with `status=DEDUPED` instead of executing again.

Use both for a webhook handler. The concurrency key prevents a burst from fanning out; the idempotency header prevents a delayed retry from re-running an already-completed job.

### `concurrency_key` validation

The platform fails fast when a non-empty `concurrency_key` template renders to an empty string. Example: a routine declares `concurrency_key: "{{ inputs.order_id }}"` but the caller triggers without supplying `order_id` — the rendered key is `""`, which would otherwise be treated as "no gate" and silently allow unlimited parallelism. The executor instead returns:

```
pipeline: concurrency_key rendered to empty value (referenced input missing or empty): template "{{ inputs.order_id }}"
```

This protects against a class of denial-of-self bugs where a misconfigured caller bypasses the per-tenant gate the routine author asked for. If you genuinely want a static gate, use a literal value (e.g. `"global"`); if you want no gate, omit the field entirely.

### Triggering safely

```bash theme={null}
curl -X POST https://your-crewship/api/v1/workspaces/$WS/pipelines/webhook-process-order/run \
  -H "Authorization: Bearer $TOKEN" \
  -H "Idempotency-Key: order-12345-attempt-1" \
  -d '{"inputs":{"order_id":"12345","raw_payload":"..."}}'
```

If the webhook upstream re-fires with the same `Idempotency-Key`, the second request returns `status=DEDUPED` and the original run id. No double-charge, no double-fulfilment.

***

## Recipe 6 — Eval-driven promotion to production

**Goal**: take a hand-written routine and decide, with data, whether to ship it on `fast` or `smart` tier. This is the workflow PRD §18 calls "operator promotion path."

```bash theme={null}
# 1. Author the routine and save it. `routine save` always runs the
#    test_run gate inline before persisting (no skip flag) — a failing
#    test_run aborts the save. --author-crew takes the owning crew_id.
crewship routine save --name='my-summarizer' --definition=routine.json \
  --author-crew=<crew_id>

# 2. Bench at fast tier — 10 runs to characterise
crewship routine bench my-summarizer --runs 10 --tier-override fast

# Output:
#   Pass rate:  9/10  (90%)
#   Cost:       total $0.0521  /  mean $0.0052  /  p95 $0.0080
#   Duration:   p50 2400ms  /  p95 4100ms  /  max 5200ms
#   Verdict:    PRODUCTION_READY (≥90% pass rate over 10 runs)

# 3. Bench at smart tier for comparison
crewship routine bench my-summarizer --runs 10 --tier-override smart

# Output:
#   Pass rate:  10/10  (100%)
#   Cost:       total $0.4900 / mean $0.0490
#   Verdict:    PRODUCTION_READY

# 4. Decide: 90% at $0.005 vs 100% at $0.049 — fast tier is 10x cheaper.
#    Ship at fast. Save the matrix as a regression baseline.
crewship eval baseline save my-summarizer-v1 \
  --scenarios my-summarizer --tiers fast --runs 10

# 5. In CI, on every PR that touches the routine or its prompts:
crewship eval baseline diff my-summarizer-v1 \
  --scenarios my-summarizer --tiers fast --runs 10
# Exits 1 if pass rate dropped beyond --tolerance (default ±10pp)
```

### What to do at each verdict

| Bench verdict                     | Action                                                                                                                |
| --------------------------------- | --------------------------------------------------------------------------------------------------------------------- |
| `PRODUCTION_READY` (≥90%) at fast | Ship. Save baseline. Add to CI.                                                                                       |
| `FLAKY` (70-90%) at fast          | Look at fail breakdown. Top reason `cost-cap` → bump cap. Top reason `rubric-fail` → tighten prompt or escalate tier. |
| `UNRELIABLE` (\<70%) at fast      | Don't ship at fast. Bench at moderate. If still \<70%, the gate is unsatisfiable — rewrite.                           |
| `BROKEN` (0%)                     | Auth issue, unsatisfiable gate, or always-failing guardrail. Check `--cooldown-ms 1000` to rule out rate limiting.    |

***

## Recipe 7 — Cross-tier compare (head-to-head)

**Goal**: investigate one specific scenario when the matrix from `eval scenarios` shows divergence.

```bash theme={null}
crewship eval compare my-summarizer --tier-a fast --tier-b smart -f markdown
```

Output (markdown, paste into PR description):

````markdown theme={null}
## Eval compare — `my-summarizer` (DIVERGE-B-PASS)

| Side | Tier  | Status     | Cost (USD) | Duration (ms) |
| ---  | ---   | ---        | ---        | ---           |
| A    | fast  | FAILED     | $0.0080    | 4200          |
| B    | smart | COMPLETED  | $0.0490    | 5800          |

### Side A output
```
[paste of Haiku output that failed the gate]
```

### Side B output
```
[paste of Opus output that passed]
```

### Errors
- **A** at `summarize`: outcomes failed: bullet 2 contains 32 words (max 25)
````

The verdict (`DIVERGE-B-PASS`) tells you Haiku can't satisfy the gate but Opus can. Now you have a real data point for the "ship at fast?" decision: probably no, unless you can rewrite the gate.

***

## Recipe 8 — Routine that calls another routine

**Goal**: compose a complex workflow from smaller, individually-tested routines. Each sub-routine has its own gates, costs, baseline.

````json theme={null}
{
  "dsl_version": "1.0",
  "name": "weekly-digest-pipeline",
  "display_name": "Weekly digest (composes extract + summarize + format)",
  "estimated_cost_usd": 0.020,
  "max_cost_usd": 2.00,
  "credentials_required": [{ "type": "anthropic" }],
  "inputs": [
    { "name": "week", "type": "string", "required": true }
  ],
  "steps": [
    {
      "id": "extract",
      "type": "call_pipeline",
      "pipeline_slug": "extract-week-events",
      "inputs": { "week": "{{ inputs.week }}" }
    },
    {
      "id": "summarize",
      "type": "call_pipeline",
      "needs": ["extract"],
      "pipeline_slug": "summarize-events",
      "inputs": { "events": "{{ steps.extract.output }}" }
    },
    {
      "id": "format",
      "type": "agent_run",
      "agent_slug": "nela",
      "complexity": "fast",
      "needs": ["summarize"],
      "prompt": "Format as Slack-ready markdown:\n\n{{ steps.summarize.output }}",
      "validation": {
        "min_length": 100,
        "max_length": 4000,
        "must_contain": ["##"],
        "must_not_contain": ["```", "API_KEY="]
      }
    }
  ]
}
````

### Why compose

* Each sub-routine has its own benched baseline. A regression in `summarize-events` shows up as a regression in any composed routine that calls it — you don't have to re-bench the parent.
* Author-crew context is preserved per call. If `summarize-events` lives in the `quality` crew, calling it from a `engineering` routine still runs with `quality`'s persona.
* Cycle detection at save time: A → B → A is rejected. Maximum nested depth is 10 (PIPELINES.md §6.4).

***

## What's intentionally NOT in the cookbook

These exist in the routines documentation but aren't recipes here because they don't change the patterns above:

* **Schedules** — see [routine schedules CLI](/cli/routine#crewship-routine-schedules). Add a cron and you have a periodic version of any recipe above.
* **Webhooks** — see [routine webhooks CLI](/cli/routine#crewship-routine-webhooks). HMAC-signed event triggers for any recipe above.
* **Wait points** — human approval gates between steps. See [routines guide](/guides/routines).

The recipes here are the LLM-step shapes; the trigger surfaces are orthogonal.
