Paymaster
Paymaster owns cost accounting and budget enforcement. It is the read/write side of thecost_ledger and budget_limits tables (migration 52, extended by migration 62) and it ships as middleware on the LLM call stack. Every priced LLM round-trip produces exactly one cost_ledger row plus a llm.call journal entry; metered rows additionally emit cost.incurred, and either side can emit budget.warning / budget.exceeded when limits are hit.
Paymaster separates two billing models that cannot share a single $ aggregate:
- Metered — pay-per-token API-key calls. Real $ math, real budgets.
- Flat-rate — subscription credentials (Anthropic Max, ChatGPT Plus, etc.). The $200/mo is paid up front; the marginal cost of one more token is $0 from our perspective.
billing_mode column and the UI keeps the two surfaces disjoint: KPIs and budgets show metered spend; subscriptions get a separate panel with call/token counts but no $ figure.
Data model
cost_ledger + budget_limits schema (migration 52, extended by v62)
cost_ledger + budget_limits schema (migration 52, extended by v62)
Budget scopes
Budgets apply at four levels, narrowing from workspace down to agent. All applicable budgets are checked before every call; the most restrictive breach wins:| ScopeKind | Applies when the call’s scope has… | Example |
|---|---|---|
workspace | (always, when scope_id = workspace_id) | “This workspace cannot spend more than $500/day.” |
crew | crew_id set | ”backend-team is capped at $50/hour.” |
mission | mission_id set | ”MIS-42 must finish under $20 total.” |
agent | agent_id set | ”viktor cannot exceed $5/day.” |
hour, day, week, month, and mission (window-less; sums the whole mission regardless of duration). Hour/day/week/month are calendar-aligned in UTC — dashboards reset on the hour, at 00:00 UTC, Monday 00:00 UTC, and the 1st respectively.
Enforcement modes
- soft —
Enforcenever returns an error. Emitsbudget.warningwhen over 100% so the UI still paints it red. Use for cost-curious teams that want visibility without the platform pulling the plug. - hard —
BudgetExceededErrorat 100%. No warn band. Emitsbudget.exceeded. - tiered — emits
budget.warningat 80% (no block) andbudget.exceededat 100% (block). The default for new budgets; gives operators a window to react.
Billing modes
Everycost_ledger row is tagged metered or flat_rate. The orchestrator picks the mode from the credential type before exec and signals it to the sidecar via CREWSHIP_BILLING_MODE (and CREWSHIP_SUBSCRIPTION_PLAN for flat-rate). The sidecar copies the values onto every POST /api/v1/internal/cost/record it sends to the server.
| Mode | When | cost_usd | cost_confidence | Budgets evaluated? |
|---|---|---|---|---|
metered | API-key credentials (Anthropic, OpenAI, Google, xAI, DeepSeek, Mistral) | from Estimate (or provider-priced inline) | precise or estimate | yes |
flat_rate | Subscription credentials (Claude Max, ChatGPT Plus, GitHub Copilot, Cursor, etc.) | forced to 0 in Record | forced to unknown | no — Middleware skips Enforce |
Record enforces the flat-rate invariants regardless of caller input — the LiteLLM-style “NULL beats fake-$0” pattern, adapted to SQLite’s NOT NULL. A flat-rate row is still a complete audit trail (which credential, which agent, which mission, when, how many tokens), it just refuses to invent a dollar figure.
The cost.incurred journal entry is suppressed for flat-rate rows because no money was incurred. llm.call still emits, with the summary explicitly tagged (flat-rate · <plan>) so operators glancing at the timeline aren’t misled.
Cost confidence
Three values, written to every row:precise— provider returned a usage block we trust verbatim (e.g. Anthropic non-streaming).estimate— we computed it frompricing.gorates against parsed token counts.unknown— flat-rate row, or we couldn’t parse the response body (streaming with no body buffer, opaque proxy).
Quota enforcement
The sidecar parses two header conventions on every upstream response:anthropic-ratelimit-{requests,input-tokens,output-tokens}-{limit,remaining,reset}x-ratelimit-{remaining-tokens,remaining-requests,reset-tokens}(OpenAI / xAI / DeepSeek)
remaining / limit ratio is surfaced as a single fraction. Once parsed, the sidecar calls EnforceQuota (after Record):
| Signal | Effect |
|---|---|
hadStatus429 == true | emits budget.exceeded with reason: "quota_exhausted", returns *BudgetExceededError (same shape as the $-side so existing error handling keeps working) |
remainingPct < 0.20 | emits budget.warning, returns nil (does not block) |
remainingPct >= 0.20 or 0 (header missing) or 1.0 (full) | no-op |
EnforceQuota is called after Record so the row that triggered the signal is still present in cost_ledger for forensics.
This is the only enforcement path that fires for flat-rate calls: subscription users hit provider rate limits before they hit a $ ceiling. The error type is intentionally identical so *BudgetExceededError handlers do not need to know whether the cause was $ or quota.
Prompt-cache token flow
Cache token counts are provider-reported and surface as discounted ledger lines. The wire path:Provider returns usage
Anthropic emits
cache_read_input_tokens + cache_creation_input_tokens in the usage block (both non-streaming and message_start SSE events); OpenAI emits prompt_tokens_details.cached_tokens (no separate creation counter — caching is opaque on their side).internal/llm/{anthropic,openai}.go parses
The fields into
Response.CachedInputToks + Response.CacheCreationToks.internal/llm/middleware.go plumbs
Them through
providerCaller / streamCaller into paymaster.CallResponse.CachedInputTokens + .CacheCreationTokens.paymaster.Middleware records
Them into the matching
cost_ledger columns alongside the rate-card snapshot. Cache reads bill at rate_cached_in_per_m; cache creates at rate_cache_write_per_m.telemetry.LLMMiddleware stamps
The same counts as
gen_ai.usage.cached_input_tokens + gen_ai.usage.cache_creation_tokens on the llm.call span — see Tracing.claude-haiku-4-5 cache read bills 1.00/M base input) lands automatically because the rate-card column is its own field; no per-row multiplier needed.
A regression earlier in this PR’s history shipped the parser without consuming the cache fields — every workspace recorded zero cached tokens and the discount never triggered. The cost_ledger.cached_input_tokens column has been present since v62 but was effectively dead until this wiring landed.
Cost controls
Beyond the ledger and budgets, a few knobs cut token spend directly without touching output quality.Cache-stable system prompt
Anthropic bills a cache read at ~10% of a fresh input token, but only when the request’s prefix is byte-identical to a recent one. The orchestrator keeps the--system-prompt (preamble → persona → skills → memory files) stable within a day and pushes everything that changes turn-to-turn — conversation history, episodic recall, the [MEMORY NUDGE] and [COST AWARENESS] blocks — into a [SESSION CONTEXT] wrapper on the user message instead. Dynamic content in the system prefix used to force a full-price re-read on every message; moving it out lets the large stable prefix hit the cache. Watch cost_ledger.cache_creation_tokens (should fall) vs cached_input_tokens (should rise) across a multi-turn session to confirm the win.
Auxiliary & sub-agent model routing
Auxiliary work (memory consolidation, Keeper/behavior evaluators, memory-health, negative learning) defaults toanthropic/claude-haiku-4-5. Override any slot per deployment — point it at a cheaper or local model without a redeploy:
CREWSHIP_AUX_CURATOR_MODEL=llama3.1 CREWSHIP_AUX_CURATOR_PROVIDER=ollama.
Delegated worker sub-agents (a lead handing a bounded sub-task to a crew member) rarely need the top model tier. Set CREWSHIP_SUBAGENT_MODEL to route them to a cheaper model; the lead planner keeps its own configured model. Unset = each agent uses its own model (no downgrade), so this is a pure opt-in.
Turn caps & loop guard
Two independent guards stop a confused agent from burning budget:--max-turns— the adapter-side loop cap. Interactive runs default to50; scheduled / routine runs default to20(orchestrator.RoutineMaxTurns) because an unattended job with no human watching is where a stuck loop runs up the bill unnoticed. Override it per-run from the CLI:crewship run <agent> --max-turns 15 "..."(also oncrewship ask).0(the default) leaves the built-in cap in place.- Loop guard — the orchestrator aborts a run when the agent repeats the identical tool call (same name + same input) five times in a row, emitting an
exec.commandjournal entry withreason: loop_detectedso cost triage can tell a runaway loop apart from a real crash. Adapter-agnostic — it observes the normalized tool-call stream, so it covers every CLI.
Rate-card snapshotting
Every metered row writes therate_*_per_m columns from RateCard(provider, model) at the moment of write. When pricing.go is later edited (a provider repriced, a new model was added), historical rollups stay consistent: the old row continues to imply its old rate. New rollup queries can either trust cost_usd directly or recompute from rate_*_per_m * tokens to verify nothing has drifted.
Zero is the honest value for ollama/local (genuinely free) and for unknown-provider lookups (we couldn’t price). The columns are stored as nullable REAL (the v62 migration adds them without a NOT NULL / DEFAULT constraint), but the writer always supplies an explicit zero in those cases, so historical rows never observe NULL in practice — readers can treat the column as effectively populated.
Pricing — current rates
The canonical rate card lives ininternal/paymaster/pricing.go. Adding a model is a one-line change. Prices are USD per 1,000,000 tokens and were verified against provider docs on 2026-04-30.
Rate card — per-provider prices (USD / 1M tokens)
Rate card — per-provider prices (USD / 1M tokens)
| Provider / Model | Input | Output | Cached input | Cache write |
|---|---|---|---|---|
| Anthropic | ||||
claude-opus-4-7 | 5.00 | 25.00 | 0.50 | 6.25 |
claude-sonnet-4-6 | 3.00 | 15.00 | 0.30 | 3.75 |
claude-haiku-4-5 | 1.00 | 5.00 | 0.10 | 1.25 |
| OpenAI | ||||
gpt-5.5 (alias gpt-5) | 4.00 | 24.00 | 0.40 | 4.00 |
gpt-5.4-mini (alias gpt-5-mini) | 0.75 | 4.50 | 0.075 | 0.75 |
gpt-5.4-nano (alias gpt-5-nano) | 0.10 | 0.40 | 0.01 | 0.10 |
o3-pro | 20.00 | 80.00 | 5.00 | 20.00 |
gemini-2.5-pro | 2.50 | 15.00 | 0.625 | 2.50 |
gemini-2.5-flash | 0.10 | 0.40 | 0.025 | 0.10 |
gemini-2.5-flash-lite | 0.05 | 0.20 | 0.0125 | 0.05 |
| xAI | ||||
grok-4.20 | 2.00 | 6.00 | 2.00 | 2.00 |
grok-4.1-fast | 0.20 | 0.50 | 0.20 | 0.20 |
| DeepSeek | ||||
deepseek-chat (V3) | 0.252 | 0.378 | 0.0252 | 0.252 |
deepseek-reasoner | 0.70 | 2.50 | 0.07 | 0.70 |
| Mistral | ||||
codestral-2508 | 0.30 | 0.90 | 0.30 | 0.30 |
| Local | ||||
ollama/*, local/* | 0 | 0 | 0 | 0 |
Anthropic Opus 4.7 was repriced to $5/$25 from the old $15/$75 in early 2026. The Crewship rate card was updated on 2026-04-30; ledger rows written before that date carry the old rate as a snapshot in
rate_*_per_m and will continue to read out at the historical price. New rows use the corrected rate.(provider, model) misses the table, lookup falls through to a per-provider ceiling in providerFallback rather than $0 — picking the most-expensive known tier means budgets warn correctly even on premium reasoning models, at the cost of a mild over-estimate for cheap unknown ones. Better to over-estimate than to silently bill $0.
The call path
paymaster.Middleware(next, j, db) wraps an LLMCaller. Order of operations:
Enforce
Check resolves every applicable budget; a hard/tiered breach short-circuits with *BudgetExceededError. The underlying call is never made, no ledger row is written, but budget.exceeded lands in the journal.telemetry -> paymaster -> lookout -> raw).
Read endpoints
GET /api/v1/paymaster/spend/by-crew?range=7d— one row per crew (metered only).GET /api/v1/paymaster/spend/by-agent/{crewId}?range=24h— per-agent rollup inside a crew (metered only).GET /api/v1/paymaster/spend/by-mission/{missionId}— single-mission total (metered only).GET /api/v1/paymaster/top-spenders?limit=10&range=7d— highest-cost scopes (metered only).GET /api/v1/paymaster/subscriptions?range=30d— flat-rate credentials grouped by(plan, provider)with call count, token count, last-used timestamp. No $ figure — see “Billing modes” above for why.
range=1h|24h|7d|30d or explicit since=<RFC3339>&until=<RFC3339>. Default is 7 days.
The four spend/* endpoints filter WHERE billing_mode = 'metered' so flat-rate rows do not silently inflate metered totals. The Subscriptions panel is the only place flat-rate rows surface to the operator.
Internal write endpoint
The sidecar writes ledger rows over the IPC socket, never directly. The handler isPOST /api/v1/internal/cost/record (auth: X-Internal-Token). Authoritative scope (workspace / crew / agent IDs) comes from the sidecar’s IPCConfig set at exec time, so an agent that captured the token still cannot forge cross-tenant attribution. Mirrors the /internal/journal/emit security model — see Internal IPC API.
Workspace isolation: the by-agent and by-mission handlers reject cross-tenant IDs with 404 (same shape as “not found”) so existence isn’t leaked. See crewBelongsToWorkspace / missionBelongsToWorkspace in internal/api/paymaster_handler.go.
CLI
crewship paymaster.
Creating a budget
Example workspace-wide hard cap of $250/day:Sidecar coverage of agent CLI calls
Paymaster wrapsllm.Provider.Complete() for everything called from the Go side (summaries, Keeper, consolidation, quartermaster judge). For agent CLI calls (claude, gemini, cursor-agent, droid) the sidecar is the metering point: the proxy intercepts the outbound HTTPS request, parses the response body for usage tokens, harvests rate-limit headers, and async-POSTs a Call to POST /api/v1/internal/cost/record. The handler validates auth and calls Record + EnforceQuota.
What the sidecar can and cannot see:
| Traffic shape | Token counts | Quota signals |
|---|---|---|
| Non-streaming JSON (Anthropic / OpenAI / Google) | yes — per-provider body parser | yes |
Streaming (text/event-stream) | no — body is opaque without a parallel SSE parser | yes — headers are always present |
| Pinned-cert TLS CONNECT (Claude Code Max OAuth tunnel) | no — body is end-to-end encrypted | no |
billing_mode=flat_rate, no tokens, cost_confidence=unknown) so the Subscriptions panel still shows usage. We deliberately do not MITM with a custom CA — CLIs with pinned certs would reject it, and the honest answer is that the body is private.
Body buffering uses io.TeeReader with a 10 MB cap so streaming UX is preserved while a pathological upstream cannot OOM the sidecar. Reads beyond the cap pass through to the agent unmodified but are not parsed.
Look for the llm.call journal entry: if it exists, the call was metered. If you see exec.command with claude and no llm.call, the OAuth tunnel was used and the row will land as flat-rate.
Gotchas
- Soft budgets still emit entries. A soft budget at 120% still writes
budget.warningto the journal — the UI uses this to paint red. Don’t filter out “soft” when looking for “who was over”. - Stream path is unmetered.
wrappedProvider.Stream()bypasses the full middleware (token counts arrive in the terminalmessage_deltaevent, which the syncCallResponseshape can’t carry). Streaming callers pay through orchestrator-level accounting that predates this package. - Pricing is estimated.
pricing.gohard-codes per-token prices per (provider, model). When the provider ships a new model the estimate falls to zero until pricing is added — checkinternal/paymaster/pricing.goafter any provider upgrade.
Related
- LLM middleware — layer order and composition.
crewship paymaster— CLI.- Paymaster API.
- Crew Journal —
llm.call,cost.incurred,budget.*entry types.