> ## Documentation Index
> Fetch the complete documentation index at: https://docs.crewship.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Paymaster

> Cost ledger and hierarchical budget enforcement for every LLM call the platform makes.

# Paymaster

Paymaster owns cost accounting and budget enforcement. It is the read/write side of the `cost_ledger` and `budget_limits` tables (migration 52, extended by migration 62) and it ships as middleware on the [LLM call stack](/guides/llm-middleware). Every priced LLM round-trip produces exactly one `cost_ledger` row plus a `llm.call` journal entry; metered rows additionally emit `cost.incurred`, and either side can emit `budget.warning` / `budget.exceeded` when limits are hit.

Paymaster separates two billing models that **cannot share a single \$ aggregate**:

* **Metered** — pay-per-token API-key calls. Real \$ math, real budgets.
* **Flat-rate** — subscription credentials (Anthropic Max, ChatGPT Plus, etc.). The \$200/mo is paid up front; the marginal cost of one more token is \$0 from our perspective.

Mixing them in the same KPI silently misleads operators, so the ledger row carries a `billing_mode` column and the UI keeps the two surfaces disjoint: KPIs and budgets show **metered** spend; subscriptions get a separate panel with call/token counts but no \$ figure.

## Data model

<Accordion title="cost_ledger + budget_limits schema (migration 52, extended by v62)">
  ```sql theme={null}
  -- One row per LLM round-trip. Columns added in migration 62 marked [v62].
  CREATE TABLE cost_ledger (
    id                    TEXT PRIMARY KEY,
    workspace_id          TEXT NOT NULL,
    crew_id               TEXT,
    agent_id              TEXT,
    mission_id            TEXT,
    provider              TEXT NOT NULL,
    model                 TEXT NOT NULL,
    input_tokens          INTEGER NOT NULL DEFAULT 0,
    output_tokens         INTEGER NOT NULL DEFAULT 0,
    cached_input_tokens   INTEGER NOT NULL DEFAULT 0,
    cache_creation_tokens INTEGER NOT NULL DEFAULT 0,
    cost_usd              REAL NOT NULL DEFAULT 0,
    tags                  TEXT NOT NULL DEFAULT '{}',
    ts                    TEXT NOT NULL,
    -- [v62] Billing-mode discriminator. 'metered' (default) | 'flat_rate'.
    billing_mode          TEXT NOT NULL DEFAULT 'metered',
    -- [v62] Live rate-limit signal harvested from upstream response headers.
    -- NULL when the call had no rate-limit headers (e.g. local Ollama, errors).
    quota_remaining_pct   REAL,
    quota_window          TEXT,             -- 'requests_per_min' | 'tokens_per_min' | 'input_tokens_per_min' | 'output_tokens_per_min' (most-restrictive)
    -- [v62] Display label for flat-rate rows. NULL for metered.
    subscription_plan     TEXT,
    -- [v62] Rate-card snapshot at write time. A future pricing.go
    -- change cannot retroactively rewrite history.
    -- Stored as plain REAL (nullable, no DEFAULT); the writer always
    -- supplies a non-NULL value so the column is effectively populated.
    rate_input_per_m      REAL,
    rate_output_per_m     REAL,
    rate_cached_in_per_m  REAL,
    rate_cache_write_per_m REAL NOT NULL DEFAULT 0,
    -- [v62] Provenance label per row. UI never renders a $ figure
    -- without telling the operator how trustworthy it is.
    cost_confidence       TEXT NOT NULL DEFAULT 'estimate'
  );

  -- [v62] Partial index for the Subscriptions panel — covers only flat-rate rows
  -- (most rows are 'metered'), so the index stays small and Subscriptions queries
  -- avoid a full-ledger scan. The (workspace_id, billing_mode, ts DESC) shape
  -- means "newest flat-rate row in this workspace" is an index-only seek.
  CREATE INDEX idx_cost_billing_mode
    ON cost_ledger (workspace_id, billing_mode, ts DESC)
    WHERE billing_mode = 'flat_rate';

  -- Optional per-scope limit. Walked hierarchically before every call.
  CREATE TABLE budget_limits (
    id           TEXT PRIMARY KEY,
    workspace_id TEXT NOT NULL,
    scope_kind   TEXT NOT NULL CHECK (scope_kind IN ('workspace','crew','mission','agent')),
    scope_id     TEXT NOT NULL,
    window       TEXT NOT NULL CHECK (window IN ('hour','day','week','month','mission')),
    limit_usd    REAL NOT NULL,
    mode         TEXT NOT NULL CHECK (mode IN ('soft','hard','tiered')),
    enabled      INTEGER NOT NULL DEFAULT 1
  );
  ```
</Accordion>

## Budget scopes

Budgets apply at four levels, narrowing from workspace down to agent. All applicable budgets are checked before every call; the most restrictive breach wins:

| ScopeKind   | Applies when the call's scope has...     | Example                                            |
| ----------- | ---------------------------------------- | -------------------------------------------------- |
| `workspace` | (always, when `scope_id = workspace_id`) | "This workspace cannot spend more than \$500/day." |
| `crew`      | `crew_id` set                            | "backend-team is capped at \$50/hour."             |
| `mission`   | `mission_id` set                         | "MIS-42 must finish under \$20 total."             |
| `agent`     | `agent_id` set                           | "viktor cannot exceed \$5/day."                    |

Windows: `hour`, `day`, `week`, `month`, and `mission` (window-less; sums the whole mission regardless of duration). Hour/day/week/month are calendar-aligned in UTC -- dashboards reset on the hour, at 00:00 UTC, Monday 00:00 UTC, and the 1st respectively.

## Enforcement modes

```go theme={null}
const (
    ModeSoft   EnforcementMode = "soft"   // never blocks, reports state=exceeded
    ModeHard   EnforcementMode = "hard"   // blocks at 100%, no warn tier
    ModeTiered EnforcementMode = "tiered" // warn at 80%, block at 100%
)
```

* **soft** -- `Enforce` never returns an error. Emits `budget.warning` when over 100% so the UI still paints it red. Use for cost-curious teams that want visibility without the platform pulling the plug.
* **hard** -- `BudgetExceededError` at 100%. No warn band. Emits `budget.exceeded`.
* **tiered** -- emits `budget.warning` at 80% (no block) and `budget.exceeded` at 100% (block). The default for new budgets; gives operators a window to react.

A tiered budget ties one emit per tick to the state it's in, so a once-warned call stays in the journal even if the subsequent call pushes it over the line.

## Billing modes

Every `cost_ledger` row is tagged `metered` or `flat_rate`. The orchestrator picks the mode from the credential type before exec and signals it to the sidecar via `CREWSHIP_BILLING_MODE` (and `CREWSHIP_SUBSCRIPTION_PLAN` for flat-rate). The sidecar copies the values onto every `POST /api/v1/internal/cost/record` it sends to the server.

| Mode        | When                                                                              | `cost_usd`                                  | `cost_confidence`       | Budgets evaluated?                |
| ----------- | --------------------------------------------------------------------------------- | ------------------------------------------- | ----------------------- | --------------------------------- |
| `metered`   | API-key credentials (Anthropic, OpenAI, Google, xAI, DeepSeek, Mistral)           | from `Estimate` (or provider-priced inline) | `precise` or `estimate` | yes                               |
| `flat_rate` | Subscription credentials (Claude Max, ChatGPT Plus, GitHub Copilot, Cursor, etc.) | **forced to `0`** in `Record`               | **forced to `unknown`** | no — `Middleware` skips `Enforce` |

`Record` enforces the flat-rate invariants regardless of caller input — the LiteLLM-style "NULL beats fake-\$0" pattern, adapted to SQLite's NOT NULL. A flat-rate row is still a complete audit trail (which credential, which agent, which mission, when, how many tokens), it just refuses to invent a dollar figure.

The `cost.incurred` journal entry is **suppressed** for flat-rate rows because no money was incurred. `llm.call` still emits, with the summary explicitly tagged `(flat-rate · <plan>)` so operators glancing at the timeline aren't misled.

## Cost confidence

Three values, written to every row:

* **`precise`** — provider returned a usage block we trust verbatim (e.g. Anthropic non-streaming).
* **`estimate`** — we computed it from `pricing.go` rates against parsed token counts.
* **`unknown`** — flat-rate row, or we couldn't parse the response body (streaming with no body buffer, opaque proxy).

The dashboard surfaces these as badges next to \$ figures. Aggregate KPIs deliberately do **not** mix confidences — when a rollup spans multiple bands, the lowest band wins so the operator sees the floor of certainty, not the average.

## Quota enforcement

The sidecar parses two header conventions on every upstream response:

* `anthropic-ratelimit-{requests,input-tokens,output-tokens}-{limit,remaining,reset}`
* `x-ratelimit-{remaining-tokens,remaining-requests,reset-tokens}` (OpenAI / xAI / DeepSeek)

The **most restrictive** window's `remaining / limit` ratio is surfaced as a single fraction. Once parsed, the sidecar calls `EnforceQuota` (after `Record`):

```go theme={null}
func EnforceQuota(
    ctx context.Context,
    j journal.Emitter,
    scope Scope,
    remainingPct float64,   // 0.0–1.0; smallest of the parsed windows
    window QuotaWindow,     // requests_per_min | tokens_per_min | input_tokens_per_min | output_tokens_per_min — display only
    hadStatus429 bool,      // upstream sent 429 → authoritative "out"
) error
```

| Signal                                                         | Effect                                                                                                                                                        |
| -------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `hadStatus429 == true`                                         | emits `budget.exceeded` with `reason: "quota_exhausted"`, returns `*BudgetExceededError` (same shape as the \$-side so existing error handling keeps working) |
| `remainingPct < 0.20`                                          | emits `budget.warning`, returns `nil` (does **not** block)                                                                                                    |
| `remainingPct >= 0.20` or `0` (header missing) or `1.0` (full) | no-op                                                                                                                                                         |

Quota signals are **per-call, ephemeral** — they are journaled but not persisted to a side table. `EnforceQuota` is called *after* `Record` so the row that triggered the signal is still present in `cost_ledger` for forensics.

This is the only enforcement path that fires for flat-rate calls: subscription users hit provider rate limits before they hit a \$ ceiling. The error type is intentionally identical so `*BudgetExceededError` handlers do not need to know whether the cause was \$ or quota.

## Prompt-cache token flow

Cache token counts are provider-reported and surface as discounted ledger lines. The wire path:

<Steps>
  <Step title="Provider returns usage">
    Anthropic emits `cache_read_input_tokens` + `cache_creation_input_tokens` in the `usage` block (both non-streaming and `message_start` SSE events); OpenAI emits `prompt_tokens_details.cached_tokens` (no separate creation counter — caching is opaque on their side).
  </Step>

  <Step title="internal/llm/{anthropic,openai}.go parses">
    The fields into `Response.CachedInputToks` + `Response.CacheCreationToks`.
  </Step>

  <Step title="internal/llm/middleware.go plumbs">
    Them through `providerCaller` / `streamCaller` into `paymaster.CallResponse.CachedInputTokens` + `.CacheCreationTokens`.
  </Step>

  <Step title="paymaster.Middleware records">
    Them into the matching `cost_ledger` columns alongside the rate-card snapshot. Cache reads bill at `rate_cached_in_per_m`; cache creates at `rate_cache_write_per_m`.
  </Step>

  <Step title="telemetry.LLMMiddleware stamps">
    The same counts as `gen_ai.usage.cached_input_tokens` + `gen_ai.usage.cache_creation_tokens` on the `llm.call` span — see [Tracing](/guides/tracing).
  </Step>
</Steps>

The 10% discount for Anthropic cache reads (a `claude-haiku-4-5` cache read bills $0.10/M vs $1.00/M base input) lands automatically because the rate-card column is its own field; no per-row multiplier needed.

A regression earlier in this PR's history shipped the parser without consuming the cache fields — every workspace recorded zero cached tokens and the discount never triggered. The `cost_ledger.cached_input_tokens` column has been present since v62 but was effectively dead until this wiring landed.

## Cost controls

Beyond the ledger and budgets, a few knobs cut token spend directly without touching output quality.

### Cache-stable system prompt

Anthropic bills a cache read at \~10% of a fresh input token, but only when the request's **prefix is byte-identical** to a recent one. The orchestrator keeps the `--system-prompt` (preamble → persona → skills → memory files) stable within a day and pushes everything that changes turn-to-turn — conversation history, episodic recall, the `[MEMORY NUDGE]` and `[COST AWARENESS]` blocks — into a `[SESSION CONTEXT]` wrapper on the **user** message instead. Dynamic content in the system prefix used to force a full-price re-read on every message; moving it out lets the large stable prefix hit the cache. Watch `cost_ledger.cache_creation_tokens` (should fall) vs `cached_input_tokens` (should rise) across a multi-turn session to confirm the win.

### Auxiliary & sub-agent model routing

Auxiliary work (memory consolidation, Keeper/behavior evaluators, memory-health, negative learning) defaults to `anthropic/claude-haiku-4-5`. Override any slot per deployment — point it at a cheaper or local model without a redeploy:

```bash theme={null}
CREWSHIP_AUX_<SLOT>_PROVIDER   # curator|keeper|behavior|memory_health|negative|fallback
CREWSHIP_AUX_<SLOT>_MODEL
CREWSHIP_AUX_<SLOT>_TIMEOUT     # Go duration, e.g. 5s; bad value is ignored, deadline preserved
```

Example: `CREWSHIP_AUX_CURATOR_MODEL=llama3.1 CREWSHIP_AUX_CURATOR_PROVIDER=ollama`.

Delegated **worker** sub-agents (a lead handing a bounded sub-task to a crew member) rarely need the top model tier. Set `CREWSHIP_SUBAGENT_MODEL` to route them to a cheaper model; the lead planner keeps its own configured model. Unset = each agent uses its own model (no downgrade), so this is a pure opt-in.

### Turn caps & loop guard

Two independent guards stop a confused agent from burning budget:

* **`--max-turns`** — the adapter-side loop cap. Interactive runs default to `50`; **scheduled / routine runs default to `20`** (`orchestrator.RoutineMaxTurns`) because an unattended job with no human watching is where a stuck loop runs up the bill unnoticed. Override it per-run from the CLI: `crewship run <agent> --max-turns 15 "..."` (also on `crewship ask`). `0` (the default) leaves the built-in cap in place.
* **Loop guard** — the orchestrator aborts a run when the agent repeats the **identical** tool call (same name + same input) five times in a row, emitting an `exec.command` journal entry with `reason: loop_detected` so cost triage can tell a runaway loop apart from a real crash. Adapter-agnostic — it observes the normalized tool-call stream, so it covers every CLI.

## Rate-card snapshotting

Every metered row writes the `rate_*_per_m` columns from `RateCard(provider, model)` at the moment of write. When `pricing.go` is later edited (a provider repriced, a new model was added), historical rollups stay consistent: the old row continues to imply its old rate. New rollup queries can either trust `cost_usd` directly or recompute from `rate_*_per_m * tokens` to verify nothing has drifted.

Zero is the honest value for ollama/local (genuinely free) and for unknown-provider lookups (we couldn't price). The columns are stored as nullable `REAL` (the v62 migration adds them without a `NOT NULL` / `DEFAULT` constraint), but the writer always supplies an explicit zero in those cases, so historical rows never observe NULL in practice — readers can treat the column as effectively populated.

## Pricing — current rates

The canonical rate card lives in `internal/paymaster/pricing.go`. Adding a model is a one-line change. Prices are USD per 1,000,000 tokens and were verified against provider docs on 2026-04-30.

<Accordion title="Rate card — per-provider prices (USD / 1M tokens)">
  | Provider / Model                    | Input | Output | Cached input | Cache write |
  | ----------------------------------- | ----- | ------ | ------------ | ----------- |
  | **Anthropic**                       |       |        |              |             |
  | `claude-opus-4-7`                   | 5.00  | 25.00  | 0.50         | 6.25        |
  | `claude-sonnet-4-6`                 | 3.00  | 15.00  | 0.30         | 3.75        |
  | `claude-haiku-4-5`                  | 1.00  | 5.00   | 0.10         | 1.25        |
  | **OpenAI**                          |       |        |              |             |
  | `gpt-5.5` (alias `gpt-5`)           | 4.00  | 24.00  | 0.40         | 4.00        |
  | `gpt-5.4-mini` (alias `gpt-5-mini`) | 0.75  | 4.50   | 0.075        | 0.75        |
  | `gpt-5.4-nano` (alias `gpt-5-nano`) | 0.10  | 0.40   | 0.01         | 0.10        |
  | `o3-pro`                            | 20.00 | 80.00  | 5.00         | 20.00       |
  | **Google**                          |       |        |              |             |
  | `gemini-2.5-pro`                    | 2.50  | 15.00  | 0.625        | 2.50        |
  | `gemini-2.5-flash`                  | 0.10  | 0.40   | 0.025        | 0.10        |
  | `gemini-2.5-flash-lite`             | 0.05  | 0.20   | 0.0125       | 0.05        |
  | **xAI**                             |       |        |              |             |
  | `grok-4.20`                         | 2.00  | 6.00   | 2.00         | 2.00        |
  | `grok-4.1-fast`                     | 0.20  | 0.50   | 0.20         | 0.20        |
  | **DeepSeek**                        |       |        |              |             |
  | `deepseek-chat` (V3)                | 0.252 | 0.378  | 0.0252       | 0.252       |
  | `deepseek-reasoner`                 | 0.70  | 2.50   | 0.07         | 0.70        |
  | **Mistral**                         |       |        |              |             |
  | `codestral-2508`                    | 0.30  | 0.90   | 0.30         | 0.30        |
  | **Local**                           |       |        |              |             |
  | `ollama/*`, `local/*`               | 0     | 0      | 0            | 0           |
</Accordion>

<Note>
  **Anthropic Opus 4.7 was repriced to \$5/\$25 from the old \$15/\$75 in early 2026.** The Crewship rate card was updated on 2026-04-30; ledger rows written before that date carry the old rate as a snapshot in `rate_*_per_m` and will continue to read out at the historical price. New rows use the corrected rate.
</Note>

When `(provider, model)` misses the table, lookup falls through to a **per-provider ceiling** in `providerFallback` rather than \$0 — picking the most-expensive known tier means budgets warn correctly even on premium reasoning models, at the cost of a mild over-estimate for cheap unknown ones. Better to over-estimate than to silently bill \$0.

## The call path

`paymaster.Middleware(next, j, db)` wraps an `LLMCaller`. Order of operations:

<Steps>
  <Step title="Enforce">
    `Check` resolves every applicable budget; a hard/tiered breach short-circuits with `*BudgetExceededError`. The underlying call is never made, no ledger row is written, but `budget.exceeded` lands in the journal.
  </Step>

  <Step title="Call">
    Delegate to the inner caller.
  </Step>

  <Step title="Record">
    Fill `cost_usd` via `Estimate(provider, model, tokens)` if the provider didn't price the call inline, then `INSERT` the row and emit `llm.call` + `cost.incurred`.
  </Step>
</Steps>

Billing errors do NOT fail the call: if the provider succeeded the response is returned unchanged and the operator sees the audit gap in logs. If the call failed, Paymaster still attempts a partial-billing record so the trail isn't lost.

See [llm.middleware.go](/guides/llm-middleware) for the full stack (`telemetry -> paymaster -> lookout -> raw`).

## Read endpoints

* `GET /api/v1/paymaster/spend/by-crew?range=7d` -- one row per crew (metered only).
* `GET /api/v1/paymaster/spend/by-agent/{crewId}?range=24h` -- per-agent rollup inside a crew (metered only).
* `GET /api/v1/paymaster/spend/by-mission/{missionId}` -- single-mission total (metered only).
* `GET /api/v1/paymaster/top-spenders?limit=10&range=7d` -- highest-cost scopes (metered only).
* `GET /api/v1/paymaster/subscriptions?range=30d` -- flat-rate credentials grouped by `(plan, provider)` with call count, token count, last-used timestamp. **No \$ figure** — see "Billing modes" above for why.

Window syntax: `range=1h|24h|7d|30d` or explicit `since=<RFC3339>&until=<RFC3339>`. Default is 7 days.

The four `spend/*` endpoints filter `WHERE billing_mode = 'metered'` so flat-rate rows do not silently inflate metered totals. The Subscriptions panel is the only place flat-rate rows surface to the operator.

## Internal write endpoint

The sidecar writes ledger rows over the IPC socket, never directly. The handler is `POST /api/v1/internal/cost/record` (auth: `X-Internal-Token`). Authoritative scope (workspace / crew / agent IDs) comes from the sidecar's `IPCConfig` set at exec time, so an agent that captured the token still cannot forge cross-tenant attribution. Mirrors the `/internal/journal/emit` security model — see [Internal IPC API](/api-reference/internal).

Workspace isolation: the `by-agent` and `by-mission` handlers reject cross-tenant IDs with 404 (same shape as "not found") so existence isn't leaked. See `crewBelongsToWorkspace` / `missionBelongsToWorkspace` in `internal/api/paymaster_handler.go`.

## CLI

```bash theme={null}
crewship paymaster by-crew --range 7d
crewship paymaster by-agent backend-team --range 24h
crewship paymaster top --limit 5 --range 30d
```

Full reference: [`crewship paymaster`](/cli/paymaster).

## Creating a budget

<Warning>
  Budgets are inserted directly into `budget_limits` today -- there is no dedicated API. The UI's settings panel drives the same table.
</Warning>

Example workspace-wide hard cap of \$250/day:

```sql theme={null}
INSERT INTO budget_limits (id, workspace_id, scope_kind, scope_id, window, limit_usd, mode, enabled)
VALUES ('bl_default_daily', 'ws_123', 'workspace', 'ws_123', 'day', 250.0, 'hard', 1);
```

## Sidecar coverage of agent CLI calls

Paymaster wraps `llm.Provider.Complete()` for everything called from the Go side (summaries, Keeper, consolidation, quartermaster judge). For agent CLI calls (`claude`, `gemini`, `cursor-agent`, `droid`) the sidecar is the metering point: the proxy intercepts the outbound HTTPS request, parses the response body for usage tokens, harvests rate-limit headers, and async-POSTs a Call to `POST /api/v1/internal/cost/record`. The handler validates auth and calls `Record` + `EnforceQuota`.

What the sidecar can and cannot see:

| Traffic shape                                          | Token counts                                          | Quota signals                        |
| ------------------------------------------------------ | ----------------------------------------------------- | ------------------------------------ |
| Non-streaming JSON (Anthropic / OpenAI / Google)       | **yes** — per-provider body parser                    | **yes**                              |
| Streaming (`text/event-stream`)                        | **no** — body is opaque without a parallel SSE parser | **yes** — headers are always present |
| Pinned-cert TLS CONNECT (Claude Code Max OAuth tunnel) | **no** — body is end-to-end encrypted                 | **no**                               |

For pinned-cert tunnels the sidecar emits a "credential was used" attribution row (`billing_mode=flat_rate`, no tokens, `cost_confidence=unknown`) so the Subscriptions panel still shows usage. We deliberately do not MITM with a custom CA — CLIs with pinned certs would reject it, and the honest answer is that the body is private.

Body buffering uses `io.TeeReader` with a **10 MB cap** so streaming UX is preserved while a pathological upstream cannot OOM the sidecar. Reads beyond the cap pass through to the agent unmodified but are not parsed.

Look for the `llm.call` journal entry: if it exists, the call was metered. If you see `exec.command` with `claude` and no `llm.call`, the OAuth tunnel was used and the row will land as flat-rate.

## Gotchas

* **Soft budgets still emit entries.** A soft budget at 120% still writes `budget.warning` to the journal -- the UI uses this to paint red. Don't filter out "soft" when looking for "who was over".
* **Stream path is unmetered.** `wrappedProvider.Stream()` bypasses the full middleware (token counts arrive in the terminal `message_delta` event, which the sync `CallResponse` shape can't carry). Streaming callers pay through orchestrator-level accounting that predates this package.
* **Pricing is estimated.** `pricing.go` hard-codes per-token prices per (provider, model). When the provider ships a new model the estimate falls to zero until pricing is added -- check `internal/paymaster/pricing.go` after any provider upgrade.

## Related

* [LLM middleware](/guides/llm-middleware) -- layer order and composition.
* [`crewship paymaster`](/cli/paymaster) -- CLI.
* [Paymaster API](/api-reference/paymaster).
* [Crew Journal](/guides/crew-journal) -- `llm.call`, `cost.incurred`, `budget.*` entry types.
