Documentation Index
Fetch the complete documentation index at: https://docs.crewship.ai/llms.txt
Use this file to discover all available pages before exploring further.
Paymaster
Paymaster owns cost accounting and budget enforcement. It is the read/write side of the cost_ledger and budget_limits tables (migration 52, extended by migration 62) and it ships as middleware on the LLM call stack. Every priced LLM round-trip produces exactly one cost_ledger row plus a llm.call journal entry; metered rows additionally emit cost.incurred, and either side can emit budget.warning / budget.exceeded when limits are hit.
Paymaster separates two billing models that cannot share a single $ aggregate:
- Metered — pay-per-token API-key calls. Real $ math, real budgets.
- Flat-rate — subscription credentials (Anthropic Max, ChatGPT Plus, etc.). The $200/mo is paid up front; the marginal cost of one more token is $0 from our perspective.
Mixing them in the same KPI silently misleads operators, so the ledger row carries a billing_mode column and the UI keeps the two surfaces disjoint: KPIs and budgets show metered spend; subscriptions get a separate panel with call/token counts but no $ figure.
Data model
-- One row per LLM round-trip. Columns added in migration 62 marked [v62].
CREATE TABLE cost_ledger (
id TEXT PRIMARY KEY,
workspace_id TEXT NOT NULL,
crew_id TEXT,
agent_id TEXT,
mission_id TEXT,
provider TEXT NOT NULL,
model TEXT NOT NULL,
input_tokens INTEGER NOT NULL DEFAULT 0,
output_tokens INTEGER NOT NULL DEFAULT 0,
cached_input_tokens INTEGER NOT NULL DEFAULT 0,
cache_creation_tokens INTEGER NOT NULL DEFAULT 0,
cost_usd REAL NOT NULL DEFAULT 0,
tags TEXT NOT NULL DEFAULT '{}',
ts TEXT NOT NULL,
-- [v62] Billing-mode discriminator. 'metered' (default) | 'flat_rate'.
billing_mode TEXT NOT NULL DEFAULT 'metered',
-- [v62] Live rate-limit signal harvested from upstream response headers.
-- NULL when the call had no rate-limit headers (e.g. local Ollama, errors).
quota_remaining_pct REAL,
quota_window TEXT, -- 'requests_per_min' | 'tokens_per_min' | 'input_tokens_per_min' | 'output_tokens_per_min' (most-restrictive)
-- [v62] Display label for flat-rate rows. NULL for metered.
subscription_plan TEXT,
-- [v62] Rate-card snapshot at write time (Langfuse pattern). A future
-- pricing.go change cannot retroactively rewrite history.
-- Stored as plain REAL (nullable, no DEFAULT); the writer always
-- supplies a non-NULL value so the column is effectively populated.
rate_input_per_m REAL,
rate_output_per_m REAL,
rate_cached_in_per_m REAL,
rate_cache_write_per_m REAL NOT NULL DEFAULT 0,
-- [v62] Provenance label per row (Helicone pattern). UI never renders a
-- $ figure without telling the operator how trustworthy it is.
cost_confidence TEXT NOT NULL DEFAULT 'estimate'
);
-- [v62] Partial index for the Subscriptions panel — covers only flat-rate rows
-- (most rows are 'metered'), so the index stays small and Subscriptions queries
-- avoid a full-ledger scan. The (workspace_id, billing_mode, ts DESC) shape
-- means "newest flat-rate row in this workspace" is an index-only seek.
CREATE INDEX idx_cost_billing_mode
ON cost_ledger (workspace_id, billing_mode, ts DESC)
WHERE billing_mode = 'flat_rate';
-- Optional per-scope limit. Walked hierarchically before every call.
CREATE TABLE budget_limits (
id TEXT PRIMARY KEY,
workspace_id TEXT NOT NULL,
scope_kind TEXT NOT NULL CHECK (scope_kind IN ('workspace','crew','mission','agent')),
scope_id TEXT NOT NULL,
window TEXT NOT NULL CHECK (window IN ('hour','day','week','month','mission')),
limit_usd REAL NOT NULL,
mode TEXT NOT NULL CHECK (mode IN ('soft','hard','tiered')),
enabled INTEGER NOT NULL DEFAULT 1
);
Budget scopes
Budgets apply at four levels, narrowing from workspace down to agent. All applicable budgets are checked before every call; the most restrictive breach wins:
| ScopeKind | Applies when the call’s scope has… | Example |
|---|
workspace | (always, when scope_id = workspace_id) | “This workspace cannot spend more than $500/day.” |
crew | crew_id set | ”backend-team is capped at $50/hour.” |
mission | mission_id set | ”MIS-42 must finish under $20 total.” |
agent | agent_id set | ”viktor cannot exceed $5/day.” |
Windows: hour, day, week, month, and mission (window-less; sums the whole mission regardless of duration). Hour/day/week/month are calendar-aligned in UTC — dashboards reset on the hour, at 00:00 UTC, Monday 00:00 UTC, and the 1st respectively.
Enforcement modes
const (
ModeSoft EnforcementMode = "soft" // never blocks, reports state=exceeded
ModeHard EnforcementMode = "hard" // blocks at 100%, no warn tier
ModeTiered EnforcementMode = "tiered" // warn at 80%, block at 100%
)
- soft —
Enforce never returns an error. Emits budget.warning when over 100% so the UI still paints it red. Use for cost-curious teams that want visibility without the platform pulling the plug.
- hard —
BudgetExceededError at 100%. No warn band. Emits budget.exceeded.
- tiered — emits
budget.warning at 80% (no block) and budget.exceeded at 100% (block). The default for new budgets; gives operators a window to react.
A tiered budget ties one emit per tick to the state it’s in, so a once-warned call stays in the journal even if the subsequent call pushes it over the line.
Billing modes
Every cost_ledger row is tagged metered or flat_rate. The orchestrator picks the mode from the credential type before exec and signals it to the sidecar via CREWSHIP_BILLING_MODE (and CREWSHIP_SUBSCRIPTION_PLAN for flat-rate). The sidecar copies the values onto every POST /api/v1/internal/cost/record it sends to the server.
| Mode | When | cost_usd | cost_confidence | Budgets evaluated? |
|---|
metered | API-key credentials (Anthropic, OpenAI, Google, xAI, DeepSeek, Mistral) | from Estimate (or provider-priced inline) | precise or estimate | yes |
flat_rate | Subscription credentials (Claude Max, ChatGPT Plus, GitHub Copilot, Cursor, etc.) | forced to 0 in Record | forced to unknown | no — Middleware skips Enforce |
Record enforces the flat-rate invariants regardless of caller input — the LiteLLM-style “NULL beats fake-$0” pattern, adapted to SQLite’s NOT NULL. A flat-rate row is still a complete audit trail (which credential, which agent, which mission, when, how many tokens), it just refuses to invent a dollar figure.
The cost.incurred journal entry is suppressed for flat-rate rows because no money was incurred. llm.call still emits, with the summary explicitly tagged (flat-rate · <plan>) so operators glancing at the timeline aren’t misled.
Cost confidence
Three values, written to every row:
precise — provider returned a usage block we trust verbatim (e.g. Anthropic non-streaming).
estimate — we computed it from pricing.go rates against parsed token counts.
unknown — flat-rate row, or we couldn’t parse the response body (streaming with no body buffer, opaque proxy).
The dashboard surfaces these as badges next to $ figures. Aggregate KPIs deliberately do not mix confidences — when a rollup spans multiple bands, the lowest band wins so the operator sees the floor of certainty, not the average.
Quota enforcement
The sidecar parses two header conventions on every upstream response:
anthropic-ratelimit-{requests,input-tokens,output-tokens}-{limit,remaining,reset}
x-ratelimit-{remaining-tokens,remaining-requests,reset-tokens} (OpenAI / xAI / DeepSeek)
The most restrictive window’s remaining / limit ratio is surfaced as a single fraction. Once parsed, the sidecar calls EnforceQuota (after Record):
func EnforceQuota(
ctx context.Context,
j journal.Emitter,
scope Scope,
remainingPct float64, // 0.0–1.0; smallest of the parsed windows
window QuotaWindow, // requests_per_min | tokens_per_min | input_tokens_per_min | output_tokens_per_min — display only
hadStatus429 bool, // upstream sent 429 → authoritative "out"
) error
| Signal | Effect |
|---|
hadStatus429 == true | emits budget.exceeded with reason: "quota_exhausted", returns *BudgetExceededError (same shape as the $-side so existing error handling keeps working) |
remainingPct < 0.20 | emits budget.warning, returns nil (does not block) |
remainingPct >= 0.20 or 0 (header missing) or 1.0 (full) | no-op |
Quota signals are per-call, ephemeral — they are journaled but not persisted to a side table. EnforceQuota is called after Record so the row that triggered the signal is still present in cost_ledger for forensics.
This is the only enforcement path that fires for flat-rate calls: subscription users hit provider rate limits before they hit a $ ceiling. The error type is intentionally identical so *BudgetExceededError handlers do not need to know whether the cause was $ or quota.
Prompt-cache token flow
Cache token counts are provider-reported and surface as discounted ledger lines. The wire path:
- Provider returns usage — Anthropic emits
cache_read_input_tokens + cache_creation_input_tokens in the usage block (both non-streaming and message_start SSE events); OpenAI emits prompt_tokens_details.cached_tokens (no separate creation counter — caching is opaque on their side).
internal/llm/{anthropic,openai}.go parses the fields into Response.CachedInputToks + Response.CacheCreationToks.
internal/llm/middleware.go plumbs them through providerCaller / streamCaller into paymaster.CallResponse.CachedInputTokens + .CacheCreationTokens.
paymaster.Middleware records them into the matching cost_ledger columns alongside the rate-card snapshot. Cache reads bill at rate_cached_in_per_m; cache creates at rate_cache_write_per_m.
telemetry.LLMMiddleware stamps the same counts as gen_ai.usage.cached_input_tokens + gen_ai.usage.cache_creation_tokens on the llm.call span — see Tracing.
The 10% discount for Anthropic cache reads (a claude-haiku-4-5 cache read bills 0.10/Mvs1.00/M base input) lands automatically because the rate-card column is its own field; no per-row multiplier needed.
A regression earlier in this PR’s history shipped the parser without consuming the cache fields — every workspace recorded zero cached tokens and the discount never triggered. The cost_ledger.cached_input_tokens column has been present since v62 but was effectively dead until this wiring landed.
Rate-card snapshotting (Langfuse pattern)
Every metered row writes the rate_*_per_m columns from RateCard(provider, model) at the moment of write. When pricing.go is later edited (a provider repriced, a new model was added), historical rollups stay consistent: the old row continues to imply its old rate. New rollup queries can either trust cost_usd directly or recompute from rate_*_per_m * tokens to verify nothing has drifted.
Zero is the honest value for ollama/local (genuinely free) and for unknown-provider lookups (we couldn’t price). The columns are stored as nullable REAL (the v62 migration adds them without a NOT NULL / DEFAULT constraint), but the writer always supplies an explicit zero in those cases, so historical rows never observe NULL in practice — readers can treat the column as effectively populated.
Pricing — current rates
The canonical rate card lives in internal/paymaster/pricing.go. Adding a model is a one-line change. Prices are USD per 1,000,000 tokens and were verified against provider docs on 2026-04-30.
| Provider / Model | Input | Output | Cached input | Cache write |
|---|
| Anthropic | | | | |
claude-opus-4-7 | 5.00 | 25.00 | 0.50 | 6.25 |
claude-sonnet-4-6 | 3.00 | 15.00 | 0.30 | 3.75 |
claude-haiku-4-5 | 1.00 | 5.00 | 0.10 | 1.25 |
| OpenAI | | | | |
gpt-5.5 (alias gpt-5) | 4.00 | 24.00 | 0.40 | 4.00 |
gpt-5.4-mini (alias gpt-5-mini) | 0.75 | 4.50 | 0.075 | 0.75 |
gpt-5.4-nano (alias gpt-5-nano) | 0.10 | 0.40 | 0.01 | 0.10 |
o3-pro | 20.00 | 80.00 | 5.00 | 20.00 |
| Google | | | | |
gemini-2.5-pro | 2.50 | 15.00 | 0.625 | 2.50 |
gemini-2.5-flash | 0.10 | 0.40 | 0.025 | 0.10 |
gemini-2.5-flash-lite | 0.05 | 0.20 | 0.0125 | 0.05 |
| xAI | | | | |
grok-4.20 | 2.00 | 6.00 | 2.00 | 2.00 |
grok-4.1-fast | 0.20 | 0.50 | 0.20 | 0.20 |
| DeepSeek | | | | |
deepseek-chat (V3) | 0.252 | 0.378 | 0.0252 | 0.252 |
deepseek-reasoner | 0.70 | 2.50 | 0.07 | 0.70 |
| Mistral | | | | |
codestral-2508 | 0.30 | 0.90 | 0.30 | 0.30 |
| Local | | | | |
ollama/*, local/* | 0 | 0 | 0 | 0 |
Anthropic Opus 4.7 was repriced to $5/$25 from the old $15/$75 in early 2026. The Crewship rate card was updated on 2026-04-30; ledger rows written before that date carry the old rate as a snapshot in rate_*_per_m and will continue to read out at the historical price. New rows use the corrected rate.
When (provider, model) misses the table, lookup falls through to a per-provider ceiling in providerFallback rather than $0 — picking the most-expensive known tier means budgets warn correctly even on premium reasoning models, at the cost of a mild over-estimate for cheap unknown ones. Better to over-estimate than to silently bill $0.
The call path
paymaster.Middleware(next, j, db) wraps an LLMCaller. Order of operations:
- Enforce —
Check resolves every applicable budget; a hard/tiered breach short-circuits with *BudgetExceededError. The underlying call is never made, no ledger row is written, but budget.exceeded lands in the journal.
- Call — delegate to the inner caller.
- Record — fill
cost_usd via Estimate(provider, model, tokens) if the provider didn’t price the call inline, then INSERT the row and emit llm.call + cost.incurred.
Billing errors do NOT fail the call: if the provider succeeded the response is returned unchanged and the operator sees the audit gap in logs. If the call failed, Paymaster still attempts a partial-billing record so the trail isn’t lost.
See llm.middleware.go for the full stack (telemetry -> paymaster -> lookout -> raw).
Read endpoints
GET /api/v1/paymaster/spend/by-crew?range=7d — one row per crew (metered only).
GET /api/v1/paymaster/spend/by-agent/{crewId}?range=24h — per-agent rollup inside a crew (metered only).
GET /api/v1/paymaster/spend/by-mission/{missionId} — single-mission total (metered only).
GET /api/v1/paymaster/top-spenders?limit=10&range=7d — highest-cost scopes (metered only).
GET /api/v1/paymaster/subscriptions?range=30d — flat-rate credentials grouped by (plan, provider) with call count, token count, last-used timestamp. No $ figure — see “Billing modes” above for why.
Window syntax: range=1h|24h|7d|30d or explicit since=<RFC3339>&until=<RFC3339>. Default is 7 days.
The four spend/* endpoints filter WHERE billing_mode = 'metered' so flat-rate rows do not silently inflate metered totals. The Subscriptions panel is the only place flat-rate rows surface to the operator.
Internal write endpoint
The sidecar writes ledger rows over the IPC socket, never directly. The handler is POST /api/v1/internal/cost/record (auth: X-Internal-Token). Authoritative scope (workspace / crew / agent IDs) comes from the sidecar’s IPCConfig set at exec time, so an agent that captured the token still cannot forge cross-tenant attribution. Mirrors the /internal/journal/emit security model — see Internal IPC API.
Workspace isolation: the by-agent and by-mission handlers reject cross-tenant IDs with 404 (same shape as “not found”) so existence isn’t leaked. See crewBelongsToWorkspace / missionBelongsToWorkspace in internal/api/paymaster_handler.go.
CLI
crewship paymaster by-crew --range 7d
crewship paymaster by-agent backend-team --range 24h
crewship paymaster top --limit 5 --range 30d
Full reference: crewship paymaster.
Creating a budget
Budgets are inserted directly into budget_limits today — there is no dedicated API. The UI’s settings panel drives the same table. Example workspace-wide hard cap of $250/day:
INSERT INTO budget_limits (id, workspace_id, scope_kind, scope_id, window, limit_usd, mode, enabled)
VALUES ('bl_default_daily', 'ws_123', 'workspace', 'ws_123', 'day', 250.0, 'hard', 1);
Sidecar coverage of agent CLI calls
Paymaster wraps llm.Provider.Complete() for everything called from the Go side (summaries, Keeper, consolidation, quartermaster judge). For agent CLI calls (claude, gemini, cursor-agent, droid) the sidecar is the metering point: the proxy intercepts the outbound HTTPS request, parses the response body for usage tokens, harvests rate-limit headers, and async-POSTs a Call to POST /api/v1/internal/cost/record. The handler validates auth and calls Record + EnforceQuota.
What the sidecar can and cannot see:
| Traffic shape | Token counts | Quota signals |
|---|
| Non-streaming JSON (Anthropic / OpenAI / Google) | yes — per-provider body parser | yes |
Streaming (text/event-stream) | no — body is opaque without a parallel SSE parser | yes — headers are always present |
| Pinned-cert TLS CONNECT (Claude Code Max OAuth tunnel) | no — body is end-to-end encrypted | no |
For pinned-cert tunnels the sidecar emits a “credential was used” attribution row (billing_mode=flat_rate, no tokens, cost_confidence=unknown) so the Subscriptions panel still shows usage. We deliberately do not MITM with a custom CA — CLIs with pinned certs would reject it, and the honest answer is that the body is private.
Body buffering uses io.TeeReader with a 10 MB cap so streaming UX is preserved while a pathological upstream cannot OOM the sidecar. Reads beyond the cap pass through to the agent unmodified but are not parsed.
Look for the llm.call journal entry: if it exists, the call was metered. If you see exec.command with claude and no llm.call, the OAuth tunnel was used and the row will land as flat-rate.
Gotchas
- Soft budgets still emit entries. A soft budget at 120% still writes
budget.warning to the journal — the UI uses this to paint red. Don’t filter out “soft” when looking for “who was over”.
- Stream path is unmetered.
wrappedProvider.Stream() bypasses the full middleware (token counts arrive in the terminal message_delta event, which the sync CallResponse shape can’t carry). Streaming callers pay through orchestrator-level accounting that predates this package.
- Pricing is estimated.
pricing.go hard-codes per-token prices per (provider, model). When the provider ships a new model the estimate falls to zero until pricing is added — check internal/paymaster/pricing.go after any provider upgrade.