Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.crewship.ai/llms.txt

Use this file to discover all available pages before exploring further.

Paymaster

Paymaster owns cost accounting and budget enforcement. It is the read/write side of the cost_ledger and budget_limits tables (migration 52, extended by migration 62) and it ships as middleware on the LLM call stack. Every priced LLM round-trip produces exactly one cost_ledger row plus a llm.call journal entry; metered rows additionally emit cost.incurred, and either side can emit budget.warning / budget.exceeded when limits are hit. Paymaster separates two billing models that cannot share a single $ aggregate:
  • Metered — pay-per-token API-key calls. Real $ math, real budgets.
  • Flat-rate — subscription credentials (Anthropic Max, ChatGPT Plus, etc.). The $200/mo is paid up front; the marginal cost of one more token is $0 from our perspective.
Mixing them in the same KPI silently misleads operators, so the ledger row carries a billing_mode column and the UI keeps the two surfaces disjoint: KPIs and budgets show metered spend; subscriptions get a separate panel with call/token counts but no $ figure.

Data model

-- One row per LLM round-trip. Columns added in migration 62 marked [v62].
CREATE TABLE cost_ledger (
  id                    TEXT PRIMARY KEY,
  workspace_id          TEXT NOT NULL,
  crew_id               TEXT,
  agent_id              TEXT,
  mission_id            TEXT,
  provider              TEXT NOT NULL,
  model                 TEXT NOT NULL,
  input_tokens          INTEGER NOT NULL DEFAULT 0,
  output_tokens         INTEGER NOT NULL DEFAULT 0,
  cached_input_tokens   INTEGER NOT NULL DEFAULT 0,
  cache_creation_tokens INTEGER NOT NULL DEFAULT 0,
  cost_usd              REAL NOT NULL DEFAULT 0,
  tags                  TEXT NOT NULL DEFAULT '{}',
  ts                    TEXT NOT NULL,
  -- [v62] Billing-mode discriminator. 'metered' (default) | 'flat_rate'.
  billing_mode          TEXT NOT NULL DEFAULT 'metered',
  -- [v62] Live rate-limit signal harvested from upstream response headers.
  -- NULL when the call had no rate-limit headers (e.g. local Ollama, errors).
  quota_remaining_pct   REAL,
  quota_window          TEXT,             -- 'requests_per_min' | 'tokens_per_min' | 'input_tokens_per_min' | 'output_tokens_per_min' (most-restrictive)
  -- [v62] Display label for flat-rate rows. NULL for metered.
  subscription_plan     TEXT,
  -- [v62] Rate-card snapshot at write time (Langfuse pattern). A future
  -- pricing.go change cannot retroactively rewrite history.
  -- Stored as plain REAL (nullable, no DEFAULT); the writer always
  -- supplies a non-NULL value so the column is effectively populated.
  rate_input_per_m      REAL,
  rate_output_per_m     REAL,
  rate_cached_in_per_m  REAL,
  rate_cache_write_per_m REAL NOT NULL DEFAULT 0,
  -- [v62] Provenance label per row (Helicone pattern). UI never renders a
  -- $ figure without telling the operator how trustworthy it is.
  cost_confidence       TEXT NOT NULL DEFAULT 'estimate'
);

-- [v62] Partial index for the Subscriptions panel — covers only flat-rate rows
-- (most rows are 'metered'), so the index stays small and Subscriptions queries
-- avoid a full-ledger scan. The (workspace_id, billing_mode, ts DESC) shape
-- means "newest flat-rate row in this workspace" is an index-only seek.
CREATE INDEX idx_cost_billing_mode
  ON cost_ledger (workspace_id, billing_mode, ts DESC)
  WHERE billing_mode = 'flat_rate';

-- Optional per-scope limit. Walked hierarchically before every call.
CREATE TABLE budget_limits (
  id           TEXT PRIMARY KEY,
  workspace_id TEXT NOT NULL,
  scope_kind   TEXT NOT NULL CHECK (scope_kind IN ('workspace','crew','mission','agent')),
  scope_id     TEXT NOT NULL,
  window       TEXT NOT NULL CHECK (window IN ('hour','day','week','month','mission')),
  limit_usd    REAL NOT NULL,
  mode         TEXT NOT NULL CHECK (mode IN ('soft','hard','tiered')),
  enabled      INTEGER NOT NULL DEFAULT 1
);

Budget scopes

Budgets apply at four levels, narrowing from workspace down to agent. All applicable budgets are checked before every call; the most restrictive breach wins:
ScopeKindApplies when the call’s scope has…Example
workspace(always, when scope_id = workspace_id)“This workspace cannot spend more than $500/day.”
crewcrew_id set”backend-team is capped at $50/hour.”
missionmission_id set”MIS-42 must finish under $20 total.”
agentagent_id set”viktor cannot exceed $5/day.”
Windows: hour, day, week, month, and mission (window-less; sums the whole mission regardless of duration). Hour/day/week/month are calendar-aligned in UTC — dashboards reset on the hour, at 00:00 UTC, Monday 00:00 UTC, and the 1st respectively.

Enforcement modes

const (
    ModeSoft   EnforcementMode = "soft"   // never blocks, reports state=exceeded
    ModeHard   EnforcementMode = "hard"   // blocks at 100%, no warn tier
    ModeTiered EnforcementMode = "tiered" // warn at 80%, block at 100%
)
  • softEnforce never returns an error. Emits budget.warning when over 100% so the UI still paints it red. Use for cost-curious teams that want visibility without the platform pulling the plug.
  • hardBudgetExceededError at 100%. No warn band. Emits budget.exceeded.
  • tiered — emits budget.warning at 80% (no block) and budget.exceeded at 100% (block). The default for new budgets; gives operators a window to react.
A tiered budget ties one emit per tick to the state it’s in, so a once-warned call stays in the journal even if the subsequent call pushes it over the line.

Billing modes

Every cost_ledger row is tagged metered or flat_rate. The orchestrator picks the mode from the credential type before exec and signals it to the sidecar via CREWSHIP_BILLING_MODE (and CREWSHIP_SUBSCRIPTION_PLAN for flat-rate). The sidecar copies the values onto every POST /api/v1/internal/cost/record it sends to the server.
ModeWhencost_usdcost_confidenceBudgets evaluated?
meteredAPI-key credentials (Anthropic, OpenAI, Google, xAI, DeepSeek, Mistral)from Estimate (or provider-priced inline)precise or estimateyes
flat_rateSubscription credentials (Claude Max, ChatGPT Plus, GitHub Copilot, Cursor, etc.)forced to 0 in Recordforced to unknownno — Middleware skips Enforce
Record enforces the flat-rate invariants regardless of caller input — the LiteLLM-style “NULL beats fake-$0” pattern, adapted to SQLite’s NOT NULL. A flat-rate row is still a complete audit trail (which credential, which agent, which mission, when, how many tokens), it just refuses to invent a dollar figure. The cost.incurred journal entry is suppressed for flat-rate rows because no money was incurred. llm.call still emits, with the summary explicitly tagged (flat-rate · <plan>) so operators glancing at the timeline aren’t misled.

Cost confidence

Three values, written to every row:
  • precise — provider returned a usage block we trust verbatim (e.g. Anthropic non-streaming).
  • estimate — we computed it from pricing.go rates against parsed token counts.
  • unknown — flat-rate row, or we couldn’t parse the response body (streaming with no body buffer, opaque proxy).
The dashboard surfaces these as badges next to $ figures. Aggregate KPIs deliberately do not mix confidences — when a rollup spans multiple bands, the lowest band wins so the operator sees the floor of certainty, not the average.

Quota enforcement

The sidecar parses two header conventions on every upstream response:
  • anthropic-ratelimit-{requests,input-tokens,output-tokens}-{limit,remaining,reset}
  • x-ratelimit-{remaining-tokens,remaining-requests,reset-tokens} (OpenAI / xAI / DeepSeek)
The most restrictive window’s remaining / limit ratio is surfaced as a single fraction. Once parsed, the sidecar calls EnforceQuota (after Record):
func EnforceQuota(
    ctx context.Context,
    j journal.Emitter,
    scope Scope,
    remainingPct float64,   // 0.0–1.0; smallest of the parsed windows
    window QuotaWindow,     // requests_per_min | tokens_per_min | input_tokens_per_min | output_tokens_per_min — display only
    hadStatus429 bool,      // upstream sent 429 → authoritative "out"
) error
SignalEffect
hadStatus429 == trueemits budget.exceeded with reason: "quota_exhausted", returns *BudgetExceededError (same shape as the $-side so existing error handling keeps working)
remainingPct < 0.20emits budget.warning, returns nil (does not block)
remainingPct >= 0.20 or 0 (header missing) or 1.0 (full)no-op
Quota signals are per-call, ephemeral — they are journaled but not persisted to a side table. EnforceQuota is called after Record so the row that triggered the signal is still present in cost_ledger for forensics. This is the only enforcement path that fires for flat-rate calls: subscription users hit provider rate limits before they hit a $ ceiling. The error type is intentionally identical so *BudgetExceededError handlers do not need to know whether the cause was $ or quota.

Prompt-cache token flow

Cache token counts are provider-reported and surface as discounted ledger lines. The wire path:
  1. Provider returns usage — Anthropic emits cache_read_input_tokens + cache_creation_input_tokens in the usage block (both non-streaming and message_start SSE events); OpenAI emits prompt_tokens_details.cached_tokens (no separate creation counter — caching is opaque on their side).
  2. internal/llm/{anthropic,openai}.go parses the fields into Response.CachedInputToks + Response.CacheCreationToks.
  3. internal/llm/middleware.go plumbs them through providerCaller / streamCaller into paymaster.CallResponse.CachedInputTokens + .CacheCreationTokens.
  4. paymaster.Middleware records them into the matching cost_ledger columns alongside the rate-card snapshot. Cache reads bill at rate_cached_in_per_m; cache creates at rate_cache_write_per_m.
  5. telemetry.LLMMiddleware stamps the same counts as gen_ai.usage.cached_input_tokens + gen_ai.usage.cache_creation_tokens on the llm.call span — see Tracing.
The 10% discount for Anthropic cache reads (a claude-haiku-4-5 cache read bills 0.10/Mvs0.10/M vs 1.00/M base input) lands automatically because the rate-card column is its own field; no per-row multiplier needed. A regression earlier in this PR’s history shipped the parser without consuming the cache fields — every workspace recorded zero cached tokens and the discount never triggered. The cost_ledger.cached_input_tokens column has been present since v62 but was effectively dead until this wiring landed.

Rate-card snapshotting (Langfuse pattern)

Every metered row writes the rate_*_per_m columns from RateCard(provider, model) at the moment of write. When pricing.go is later edited (a provider repriced, a new model was added), historical rollups stay consistent: the old row continues to imply its old rate. New rollup queries can either trust cost_usd directly or recompute from rate_*_per_m * tokens to verify nothing has drifted. Zero is the honest value for ollama/local (genuinely free) and for unknown-provider lookups (we couldn’t price). The columns are stored as nullable REAL (the v62 migration adds them without a NOT NULL / DEFAULT constraint), but the writer always supplies an explicit zero in those cases, so historical rows never observe NULL in practice — readers can treat the column as effectively populated.

Pricing — current rates

The canonical rate card lives in internal/paymaster/pricing.go. Adding a model is a one-line change. Prices are USD per 1,000,000 tokens and were verified against provider docs on 2026-04-30.
Provider / ModelInputOutputCached inputCache write
Anthropic
claude-opus-4-75.0025.000.506.25
claude-sonnet-4-63.0015.000.303.75
claude-haiku-4-51.005.000.101.25
OpenAI
gpt-5.5 (alias gpt-5)4.0024.000.404.00
gpt-5.4-mini (alias gpt-5-mini)0.754.500.0750.75
gpt-5.4-nano (alias gpt-5-nano)0.100.400.010.10
o3-pro20.0080.005.0020.00
Google
gemini-2.5-pro2.5015.000.6252.50
gemini-2.5-flash0.100.400.0250.10
gemini-2.5-flash-lite0.050.200.01250.05
xAI
grok-4.202.006.002.002.00
grok-4.1-fast0.200.500.200.20
DeepSeek
deepseek-chat (V3)0.2520.3780.02520.252
deepseek-reasoner0.702.500.070.70
Mistral
codestral-25080.300.900.300.30
Local
ollama/*, local/*0000
Anthropic Opus 4.7 was repriced to $5/$25 from the old $15/$75 in early 2026. The Crewship rate card was updated on 2026-04-30; ledger rows written before that date carry the old rate as a snapshot in rate_*_per_m and will continue to read out at the historical price. New rows use the corrected rate.
When (provider, model) misses the table, lookup falls through to a per-provider ceiling in providerFallback rather than $0 — picking the most-expensive known tier means budgets warn correctly even on premium reasoning models, at the cost of a mild over-estimate for cheap unknown ones. Better to over-estimate than to silently bill $0.

The call path

paymaster.Middleware(next, j, db) wraps an LLMCaller. Order of operations:
  1. EnforceCheck resolves every applicable budget; a hard/tiered breach short-circuits with *BudgetExceededError. The underlying call is never made, no ledger row is written, but budget.exceeded lands in the journal.
  2. Call — delegate to the inner caller.
  3. Record — fill cost_usd via Estimate(provider, model, tokens) if the provider didn’t price the call inline, then INSERT the row and emit llm.call + cost.incurred.
Billing errors do NOT fail the call: if the provider succeeded the response is returned unchanged and the operator sees the audit gap in logs. If the call failed, Paymaster still attempts a partial-billing record so the trail isn’t lost. See llm.middleware.go for the full stack (telemetry -> paymaster -> lookout -> raw).

Read endpoints

  • GET /api/v1/paymaster/spend/by-crew?range=7d — one row per crew (metered only).
  • GET /api/v1/paymaster/spend/by-agent/{crewId}?range=24h — per-agent rollup inside a crew (metered only).
  • GET /api/v1/paymaster/spend/by-mission/{missionId} — single-mission total (metered only).
  • GET /api/v1/paymaster/top-spenders?limit=10&range=7d — highest-cost scopes (metered only).
  • GET /api/v1/paymaster/subscriptions?range=30d — flat-rate credentials grouped by (plan, provider) with call count, token count, last-used timestamp. No $ figure — see “Billing modes” above for why.
Window syntax: range=1h|24h|7d|30d or explicit since=<RFC3339>&until=<RFC3339>. Default is 7 days. The four spend/* endpoints filter WHERE billing_mode = 'metered' so flat-rate rows do not silently inflate metered totals. The Subscriptions panel is the only place flat-rate rows surface to the operator.

Internal write endpoint

The sidecar writes ledger rows over the IPC socket, never directly. The handler is POST /api/v1/internal/cost/record (auth: X-Internal-Token). Authoritative scope (workspace / crew / agent IDs) comes from the sidecar’s IPCConfig set at exec time, so an agent that captured the token still cannot forge cross-tenant attribution. Mirrors the /internal/journal/emit security model — see Internal IPC API. Workspace isolation: the by-agent and by-mission handlers reject cross-tenant IDs with 404 (same shape as “not found”) so existence isn’t leaked. See crewBelongsToWorkspace / missionBelongsToWorkspace in internal/api/paymaster_handler.go.

CLI

crewship paymaster by-crew --range 7d
crewship paymaster by-agent backend-team --range 24h
crewship paymaster top --limit 5 --range 30d
Full reference: crewship paymaster.

Creating a budget

Budgets are inserted directly into budget_limits today — there is no dedicated API. The UI’s settings panel drives the same table. Example workspace-wide hard cap of $250/day:
INSERT INTO budget_limits (id, workspace_id, scope_kind, scope_id, window, limit_usd, mode, enabled)
VALUES ('bl_default_daily', 'ws_123', 'workspace', 'ws_123', 'day', 250.0, 'hard', 1);

Sidecar coverage of agent CLI calls

Paymaster wraps llm.Provider.Complete() for everything called from the Go side (summaries, Keeper, consolidation, quartermaster judge). For agent CLI calls (claude, gemini, cursor-agent, droid) the sidecar is the metering point: the proxy intercepts the outbound HTTPS request, parses the response body for usage tokens, harvests rate-limit headers, and async-POSTs a Call to POST /api/v1/internal/cost/record. The handler validates auth and calls Record + EnforceQuota. What the sidecar can and cannot see:
Traffic shapeToken countsQuota signals
Non-streaming JSON (Anthropic / OpenAI / Google)yes — per-provider body parseryes
Streaming (text/event-stream)no — body is opaque without a parallel SSE parseryes — headers are always present
Pinned-cert TLS CONNECT (Claude Code Max OAuth tunnel)no — body is end-to-end encryptedno
For pinned-cert tunnels the sidecar emits a “credential was used” attribution row (billing_mode=flat_rate, no tokens, cost_confidence=unknown) so the Subscriptions panel still shows usage. We deliberately do not MITM with a custom CA — CLIs with pinned certs would reject it, and the honest answer is that the body is private. Body buffering uses io.TeeReader with a 10 MB cap so streaming UX is preserved while a pathological upstream cannot OOM the sidecar. Reads beyond the cap pass through to the agent unmodified but are not parsed. Look for the llm.call journal entry: if it exists, the call was metered. If you see exec.command with claude and no llm.call, the OAuth tunnel was used and the row will land as flat-rate.

Gotchas

  • Soft budgets still emit entries. A soft budget at 120% still writes budget.warning to the journal — the UI uses this to paint red. Don’t filter out “soft” when looking for “who was over”.
  • Stream path is unmetered. wrappedProvider.Stream() bypasses the full middleware (token counts arrive in the terminal message_delta event, which the sync CallResponse shape can’t carry). Streaming callers pay through orchestrator-level accounting that predates this package.
  • Pricing is estimated. pricing.go hard-codes per-token prices per (provider, model). When the provider ships a new model the estimate falls to zero until pricing is added — check internal/paymaster/pricing.go after any provider upgrade.