Paymaster

Paymaster owns cost accounting and budget enforcement. It is the read/write side of the cost_ledger and budget_limits tables (migration 52, extended by migration 62) and it ships as middleware on the LLM call stack. Every priced LLM round-trip produces exactly one cost_ledger row plus a llm.call journal entry; metered rows additionally emit cost.incurred, and either side can emit budget.warning / budget.exceeded when limits are hit. Paymaster separates two billing models that cannot share a single $ aggregate:

Metered — pay-per-token API-key calls. Real $ math, real budgets.
Flat-rate — subscription credentials (Anthropic Max, ChatGPT Plus, etc.). The $200/mo is paid up front; the marginal cost of one more token is $0 from our perspective.

Mixing them in the same KPI silently misleads operators, so the ledger row carries a billing_mode column and the UI keeps the two surfaces disjoint: KPIs and budgets show metered spend; subscriptions get a separate panel with call/token counts but no $ figure.

Data model

cost_ledger + budget_limits schema (migration 52, extended by v62)

-- One row per LLM round-trip. Columns added in migration 62 marked [v62].
CREATE TABLE cost_ledger (
  id                    TEXT PRIMARY KEY,
  workspace_id          TEXT NOT NULL,
  crew_id               TEXT,
  agent_id              TEXT,
  mission_id            TEXT,
  provider              TEXT NOT NULL,
  model                 TEXT NOT NULL,
  input_tokens          INTEGER NOT NULL DEFAULT 0,
  output_tokens         INTEGER NOT NULL DEFAULT 0,
  cached_input_tokens   INTEGER NOT NULL DEFAULT 0,
  cache_creation_tokens INTEGER NOT NULL DEFAULT 0,
  cost_usd              REAL NOT NULL DEFAULT 0,
  tags                  TEXT NOT NULL DEFAULT '{}',
  ts                    TEXT NOT NULL,
  -- [v62] Billing-mode discriminator. 'metered' (default) | 'flat_rate'.
  billing_mode          TEXT NOT NULL DEFAULT 'metered',
  -- [v62] Live rate-limit signal harvested from upstream response headers.
  -- NULL when the call had no rate-limit headers (e.g. local Ollama, errors).
  quota_remaining_pct   REAL,
  quota_window          TEXT,             -- 'requests_per_min' | 'tokens_per_min' | 'input_tokens_per_min' | 'output_tokens_per_min' (most-restrictive)
  -- [v62] Display label for flat-rate rows. NULL for metered.
  subscription_plan     TEXT,
  -- [v62] Rate-card snapshot at write time. A future pricing.go
  -- change cannot retroactively rewrite history.
  -- Stored as plain REAL (nullable, no DEFAULT); the writer always
  -- supplies a non-NULL value so the column is effectively populated.
  rate_input_per_m      REAL,
  rate_output_per_m     REAL,
  rate_cached_in_per_m  REAL,
  rate_cache_write_per_m REAL NOT NULL DEFAULT 0,
  -- [v62] Provenance label per row. UI never renders a $ figure
  -- without telling the operator how trustworthy it is.
  cost_confidence       TEXT NOT NULL DEFAULT 'estimate'
);

-- [v62] Partial index for the Subscriptions panel — covers only flat-rate rows
-- (most rows are 'metered'), so the index stays small and Subscriptions queries
-- avoid a full-ledger scan. The (workspace_id, billing_mode, ts DESC) shape
-- means "newest flat-rate row in this workspace" is an index-only seek.
CREATE INDEX idx_cost_billing_mode
  ON cost_ledger (workspace_id, billing_mode, ts DESC)
  WHERE billing_mode = 'flat_rate';

-- Optional per-scope limit. Walked hierarchically before every call.
CREATE TABLE budget_limits (
  id           TEXT PRIMARY KEY,
  workspace_id TEXT NOT NULL,
  scope_kind   TEXT NOT NULL CHECK (scope_kind IN ('workspace','crew','mission','agent')),
  scope_id     TEXT NOT NULL,
  window       TEXT NOT NULL CHECK (window IN ('hour','day','week','month','mission')),
  limit_usd    REAL NOT NULL,
  mode         TEXT NOT NULL CHECK (mode IN ('soft','hard','tiered')),
  enabled      INTEGER NOT NULL DEFAULT 1
);

Budget scopes

Budgets apply at four levels, narrowing from workspace down to agent. All applicable budgets are checked before every call; the most restrictive breach wins:

ScopeKind	Applies when the call’s scope has…	Example
`workspace`	(always, when `scope_id = workspace_id`)	“This workspace cannot spend more than $500/day.”
`crew`	`crew_id` set	”backend-team is capped at $50/hour.”
`mission`	`mission_id` set	”MIS-42 must finish under $20 total.”
`agent`	`agent_id` set	”viktor cannot exceed $5/day.”

Windows: hour, day, week, month, and mission (window-less; sums the whole mission regardless of duration). Hour/day/week/month are calendar-aligned in UTC — dashboards reset on the hour, at 00:00 UTC, Monday 00:00 UTC, and the 1st respectively.

Enforcement modes

const (
    ModeSoft   EnforcementMode = "soft"   // never blocks, reports state=exceeded
    ModeHard   EnforcementMode = "hard"   // blocks at 100%, no warn tier
    ModeTiered EnforcementMode = "tiered" // warn at 80%, block at 100%
)

soft — Enforce never returns an error. Emits budget.warning when over 100% so the UI still paints it red. Use for cost-curious teams that want visibility without the platform pulling the plug.
hard — BudgetExceededError at 100%. No warn band. Emits budget.exceeded.
tiered — emits budget.warning at 80% (no block) and budget.exceeded at 100% (block). The default for new budgets; gives operators a window to react.

A tiered budget ties one emit per tick to the state it’s in, so a once-warned call stays in the journal even if the subsequent call pushes it over the line.

Billing modes

Every cost_ledger row is tagged metered or flat_rate. The orchestrator picks the mode from the credential type before exec and signals it to the sidecar via CREWSHIP_BILLING_MODE (and CREWSHIP_SUBSCRIPTION_PLAN for flat-rate). The sidecar copies the values onto every POST /api/v1/internal/cost/record it sends to the server.

Mode	When	`cost_usd`	`cost_confidence`	Budgets evaluated?
`metered`	API-key credentials (Anthropic, OpenAI, Google, xAI, DeepSeek, Mistral)	from `Estimate` (or provider-priced inline)	`precise` or `estimate`	yes
`flat_rate`	Subscription credentials (Claude Max, ChatGPT Plus, GitHub Copilot, Cursor, etc.)	forced to `0` in `Record`	forced to `unknown`	no — `Middleware` skips `Enforce`

Record enforces the flat-rate invariants regardless of caller input — the LiteLLM-style “NULL beats fake-$0” pattern, adapted to SQLite’s NOT NULL. A flat-rate row is still a complete audit trail (which credential, which agent, which mission, when, how many tokens), it just refuses to invent a dollar figure. The cost.incurred journal entry is suppressed for flat-rate rows because no money was incurred. llm.call still emits, with the summary explicitly tagged (flat-rate · <plan>) so operators glancing at the timeline aren’t misled.

Cost confidence

Three values, written to every row:

precise — provider returned a usage block we trust verbatim (e.g. Anthropic non-streaming).
estimate — we computed it from pricing.go rates against parsed token counts.
unknown — flat-rate row, or we couldn’t parse the response body (streaming with no body buffer, opaque proxy).

The dashboard surfaces these as badges next to $ figures. Aggregate KPIs deliberately do not mix confidences — when a rollup spans multiple bands, the lowest band wins so the operator sees the floor of certainty, not the average.

Quota enforcement

The sidecar parses two header conventions on every upstream response:

anthropic-ratelimit-{requests,input-tokens,output-tokens}-{limit,remaining,reset}
x-ratelimit-{remaining-tokens,remaining-requests,reset-tokens} (OpenAI / xAI / DeepSeek)

The most restrictive window’s remaining / limit ratio is surfaced as a single fraction. Once parsed, the sidecar calls EnforceQuota (after Record):

func EnforceQuota(
    ctx context.Context,
    j journal.Emitter,
    scope Scope,
    remainingPct float64,   // 0.0–1.0; smallest of the parsed windows
    window QuotaWindow,     // requests_per_min | tokens_per_min | input_tokens_per_min | output_tokens_per_min — display only
    hadStatus429 bool,      // upstream sent 429 → authoritative "out"
) error

Signal	Effect
`hadStatus429 == true`	emits `budget.exceeded` with `reason: "quota_exhausted"`, returns `*BudgetExceededError` (same shape as the $-side so existing error handling keeps working)
`remainingPct < 0.20`	emits `budget.warning`, returns `nil` (does not block)
`remainingPct >= 0.20` or `0` (header missing) or `1.0` (full)	no-op

Quota signals are per-call, ephemeral — they are journaled but not persisted to a side table. EnforceQuota is called after Record so the row that triggered the signal is still present in cost_ledger for forensics. This is the only enforcement path that fires for flat-rate calls: subscription users hit provider rate limits before they hit a $ ceiling. The error type is intentionally identical so *BudgetExceededError handlers do not need to know whether the cause was $ or quota.

Prompt-cache token flow

Cache token counts are provider-reported and surface as discounted ledger lines. The wire path:

Provider returns usage

Anthropic emits cache_read_input_tokens + cache_creation_input_tokens in the usage block (both non-streaming and message_start SSE events); OpenAI emits prompt_tokens_details.cached_tokens (no separate creation counter — caching is opaque on their side).

internal/llm/{anthropic,openai}.go parses

The fields into Response.CachedInputToks + Response.CacheCreationToks.

internal/llm/middleware.go plumbs

Them through providerCaller / streamCaller into paymaster.CallResponse.CachedInputTokens + .CacheCreationTokens.

paymaster.Middleware records

Them into the matching cost_ledger columns alongside the rate-card snapshot. Cache reads bill at rate_cached_in_per_m; cache creates at rate_cache_write_per_m.

telemetry.LLMMiddleware stamps

The same counts as gen_ai.usage.cached_input_tokens + gen_ai.usage.cache_creation_tokens on the llm.call span — see Tracing.

The 10% discount for Anthropic cache reads (a claude-haiku-4-5 cache read bills

0.10/M vs

1.00/M base input) lands automatically because the rate-card column is its own field; no per-row multiplier needed. A regression earlier in this PR’s history shipped the parser without consuming the cache fields — every workspace recorded zero cached tokens and the discount never triggered. The cost_ledger.cached_input_tokens column has been present since v62 but was effectively dead until this wiring landed.

Cost controls

Beyond the ledger and budgets, a few knobs cut token spend directly without touching output quality.

Cache-stable system prompt

Anthropic bills a cache read at ~10% of a fresh input token, but only when the request’s prefix is byte-identical to a recent one. The orchestrator keeps the --system-prompt (preamble → persona → skills → memory files) stable within a day and pushes everything that changes turn-to-turn — conversation history, episodic recall, the [MEMORY NUDGE] and [COST AWARENESS] blocks — into a [SESSION CONTEXT] wrapper on the user message instead. Dynamic content in the system prefix used to force a full-price re-read on every message; moving it out lets the large stable prefix hit the cache. Watch cost_ledger.cache_creation_tokens (should fall) vs cached_input_tokens (should rise) across a multi-turn session to confirm the win.

Auxiliary & sub-agent model routing

Auxiliary work (memory consolidation, Keeper/behavior evaluators, memory-health, negative learning) defaults to anthropic/claude-haiku-4-5. Override any slot per deployment — point it at a cheaper or local model without a redeploy:

CREWSHIP_AUX_<SLOT>_PROVIDER   # curator|keeper|behavior|memory_health|negative|fallback
CREWSHIP_AUX_<SLOT>_MODEL
CREWSHIP_AUX_<SLOT>_TIMEOUT     # Go duration, e.g. 5s; bad value is ignored, deadline preserved

Example: CREWSHIP_AUX_CURATOR_MODEL=llama3.1 CREWSHIP_AUX_CURATOR_PROVIDER=ollama. Delegated worker sub-agents (a lead handing a bounded sub-task to a crew member) rarely need the top model tier. Set CREWSHIP_SUBAGENT_MODEL to route them to a cheaper model; the lead planner keeps its own configured model. Unset = each agent uses its own model (no downgrade), so this is a pure opt-in.

Turn caps & loop guard

Two independent guards stop a confused agent from burning budget:

--max-turns — the adapter-side loop cap. Interactive runs default to 50; scheduled / routine runs default to 20 (orchestrator.RoutineMaxTurns) because an unattended job with no human watching is where a stuck loop runs up the bill unnoticed. Override it per-run from the CLI: crewship run <agent> --max-turns 15 "..." (also on crewship ask). 0 (the default) leaves the built-in cap in place.
Loop guard — the orchestrator aborts a run when the agent repeats the identical tool call (same name + same input) five times in a row, emitting an exec.command journal entry with reason: loop_detected so cost triage can tell a runaway loop apart from a real crash. Adapter-agnostic — it observes the normalized tool-call stream, so it covers every CLI.

Rate-card snapshotting

Every metered row writes the rate_*_per_m columns from RateCard(provider, model) at the moment of write. When pricing.go is later edited (a provider repriced, a new model was added), historical rollups stay consistent: the old row continues to imply its old rate. New rollup queries can either trust cost_usd directly or recompute from rate_*_per_m * tokens to verify nothing has drifted. Zero is the honest value for ollama/local (genuinely free) and for unknown-provider lookups (we couldn’t price). The columns are stored as nullable REAL (the v62 migration adds them without a NOT NULL / DEFAULT constraint), but the writer always supplies an explicit zero in those cases, so historical rows never observe NULL in practice — readers can treat the column as effectively populated.

Pricing — current rates

The canonical rate card lives in internal/paymaster/pricing.go. Adding a model is a one-line change. Prices are USD per 1,000,000 tokens and were verified against provider docs on 2026-04-30.

Rate card — per-provider prices (USD / 1M tokens)

Provider / Model	Input	Output	Cached input	Cache write
Anthropic
`claude-opus-4-7`	5.00	25.00	0.50	6.25
`claude-sonnet-4-6`	3.00	15.00	0.30	3.75
`claude-haiku-4-5`	1.00	5.00	0.10	1.25
OpenAI
`gpt-5.5` (alias `gpt-5`)	4.00	24.00	0.40	4.00
`gpt-5.4-mini` (alias `gpt-5-mini`)	0.75	4.50	0.075	0.75
`gpt-5.4-nano` (alias `gpt-5-nano`)	0.10	0.40	0.01	0.10
`o3-pro`	20.00	80.00	5.00	20.00
Google
`gemini-2.5-pro`	2.50	15.00	0.625	2.50
`gemini-2.5-flash`	0.10	0.40	0.025	0.10
`gemini-2.5-flash-lite`	0.05	0.20	0.0125	0.05
xAI
`grok-4.20`	2.00	6.00	2.00	2.00
`grok-4.1-fast`	0.20	0.50	0.20	0.20
DeepSeek
`deepseek-chat` (V3)	0.252	0.378	0.0252	0.252
`deepseek-reasoner`	0.70	2.50	0.07	0.70
Mistral
`codestral-2508`	0.30	0.90	0.30	0.30
Local
`ollama/`, `local/`	0	0	0	0

Anthropic Opus 4.7 was repriced to $5/$25 from the old $15/$75 in early 2026. The Crewship rate card was updated on 2026-04-30; ledger rows written before that date carry the old rate as a snapshot in rate_*_per_m and will continue to read out at the historical price. New rows use the corrected rate.

When (provider, model) misses the table, lookup falls through to a per-provider ceiling in providerFallback rather than $0 — picking the most-expensive known tier means budgets warn correctly even on premium reasoning models, at the cost of a mild over-estimate for cheap unknown ones. Better to over-estimate than to silently bill $0.

The call path

paymaster.Middleware(next, j, db) wraps an LLMCaller. Order of operations:

Enforce

Check resolves every applicable budget; a hard/tiered breach short-circuits with *BudgetExceededError. The underlying call is never made, no ledger row is written, but budget.exceeded lands in the journal.

Call

Delegate to the inner caller.

Record

Fill cost_usd via Estimate(provider, model, tokens) if the provider didn’t price the call inline, then INSERT the row and emit llm.call + cost.incurred.

Billing errors do NOT fail the call: if the provider succeeded the response is returned unchanged and the operator sees the audit gap in logs. If the call failed, Paymaster still attempts a partial-billing record so the trail isn’t lost. See llm.middleware.go for the full stack (telemetry -> paymaster -> lookout -> raw).

Read endpoints

GET /api/v1/paymaster/spend/by-crew?range=7d — one row per crew (metered only).
GET /api/v1/paymaster/spend/by-agent/{crewId}?range=24h — per-agent rollup inside a crew (metered only).
GET /api/v1/paymaster/spend/by-mission/{missionId} — single-mission total (metered only).
GET /api/v1/paymaster/top-spenders?limit=10&range=7d — highest-cost scopes (metered only).
GET /api/v1/paymaster/subscriptions?range=30d — flat-rate credentials grouped by (plan, provider) with call count, token count, last-used timestamp. No $ figure — see “Billing modes” above for why.

Window syntax: range=1h|24h|7d|30d or explicit since=<RFC3339>&until=<RFC3339>. Default is 7 days. The four spend/* endpoints filter WHERE billing_mode = 'metered' so flat-rate rows do not silently inflate metered totals. The Subscriptions panel is the only place flat-rate rows surface to the operator.

Internal write endpoint

The sidecar writes ledger rows over the IPC socket, never directly. The handler is POST /api/v1/internal/cost/record (auth: X-Internal-Token). Authoritative scope (workspace / crew / agent IDs) comes from the sidecar’s IPCConfig set at exec time, so an agent that captured the token still cannot forge cross-tenant attribution. Mirrors the /internal/journal/emit security model — see Internal IPC API. Workspace isolation: the by-agent and by-mission handlers reject cross-tenant IDs with 404 (same shape as “not found”) so existence isn’t leaked. See crewBelongsToWorkspace / missionBelongsToWorkspace in internal/api/paymaster_handler.go.

CLI

crewship paymaster by-crew --range 7d
crewship paymaster by-agent backend-team --range 24h
crewship paymaster top --limit 5 --range 30d

Full reference: crewship paymaster.

Creating a budget

Budgets are inserted directly into budget_limits today — there is no dedicated API. The UI’s settings panel drives the same table.

Example workspace-wide hard cap of $250/day:

INSERT INTO budget_limits (id, workspace_id, scope_kind, scope_id, window, limit_usd, mode, enabled)
VALUES ('bl_default_daily', 'ws_123', 'workspace', 'ws_123', 'day', 250.0, 'hard', 1);

Sidecar coverage of agent CLI calls

Paymaster wraps llm.Provider.Complete() for everything called from the Go side (summaries, Keeper, consolidation, quartermaster judge). For agent CLI calls (claude, gemini, cursor-agent, droid) the sidecar is the metering point: the proxy intercepts the outbound HTTPS request, parses the response body for usage tokens, harvests rate-limit headers, and async-POSTs a Call to POST /api/v1/internal/cost/record. The handler validates auth and calls Record + EnforceQuota. What the sidecar can and cannot see:

Traffic shape	Token counts	Quota signals
Non-streaming JSON (Anthropic / OpenAI / Google)	yes — per-provider body parser	yes
Streaming (`text/event-stream`)	no — body is opaque without a parallel SSE parser	yes — headers are always present
Pinned-cert TLS CONNECT (Claude Code Max OAuth tunnel)	no — body is end-to-end encrypted	no

For pinned-cert tunnels the sidecar emits a “credential was used” attribution row (billing_mode=flat_rate, no tokens, cost_confidence=unknown) so the Subscriptions panel still shows usage. We deliberately do not MITM with a custom CA — CLIs with pinned certs would reject it, and the honest answer is that the body is private. Body buffering uses io.TeeReader with a 10 MB cap so streaming UX is preserved while a pathological upstream cannot OOM the sidecar. Reads beyond the cap pass through to the agent unmodified but are not parsed. Look for the llm.call journal entry: if it exists, the call was metered. If you see exec.command with claude and no llm.call, the OAuth tunnel was used and the row will land as flat-rate.

Gotchas

Soft budgets still emit entries. A soft budget at 120% still writes budget.warning to the journal — the UI uses this to paint red. Don’t filter out “soft” when looking for “who was over”.
Stream path is unmetered. wrappedProvider.Stream() bypasses the full middleware (token counts arrive in the terminal message_delta event, which the sync CallResponse shape can’t carry). Streaming callers pay through orchestrator-level accounting that predates this package.
Pricing is estimated. pricing.go hard-codes per-token prices per (provider, model). When the provider ships a new model the estimate falls to zero until pricing is added — check internal/paymaster/pricing.go after any provider upgrade.

LLM middleware — layer order and composition.
crewship paymaster — CLI.
Paymaster API.
Crew Journal — llm.call, cost.incurred, budget.* entry types.

​Paymaster

​Data model

​Budget scopes

​Enforcement modes

​Billing modes

​Cost confidence

​Quota enforcement

​Prompt-cache token flow

​Cost controls

​Cache-stable system prompt

​Auxiliary & sub-agent model routing

​Turn caps & loop guard

​Rate-card snapshotting

​Pricing — current rates

​The call path

​Read endpoints

​Internal write endpoint

​CLI

​Creating a budget

​Sidecar coverage of agent CLI calls

​Gotchas

​Related

Paymaster

Data model

Budget scopes

Enforcement modes

Billing modes

Cost confidence

Quota enforcement

Prompt-cache token flow

Cost controls

Cache-stable system prompt

Auxiliary & sub-agent model routing

Turn caps & loop guard

Rate-card snapshotting

Pricing — current rates

The call path

Read endpoints

Internal write endpoint

CLI

Creating a budget

Sidecar coverage of agent CLI calls

Gotchas

Related