Skip to main content

LLM Middleware

internal/llm/middleware.go composes the full call stack that every LLM request flows through:
telemetry  →  paymaster  →  lookout  →  (caching: future)  →  base provider
Each layer matches the paymaster.LLMCaller signature so they compose as plain function wrappers. The returned Provider preserves the original Name() for routing and fan-out.

Composition

import "github.com/crewship-ai/crewship/internal/llm"

base := anthropic.NewProvider(apiKey)
wrapped := llm.Middleware(base, journalEmitter, db)

// From this point, wrapped.Complete() flows through all layers.
resp, err := wrapped.Complete(ctx, llm.Request{
    Model: "claude-sonnet-4-5",
    Messages: []llm.Message{{Role: llm.RoleUser, Content: "..."}},
})
The request context MUST carry a lookout.Scope (set by the HTTP handler chain). Without it, paymaster rejects the call because WorkspaceID is empty — calls without a workspace are not billable.
ctx = lookout.WithScope(ctx, lookout.Scope{
    WorkspaceID: ws, CrewID: crew, AgentID: agent, MissionID: mission,
})

Layer order rationale

Layer order is deliberate and the comment in middleware.go is the source of truth. Getting it wrong produces subtle bugs that only surface under load:

1. Telemetry outermost

An SRE looking at a slow trace must see every contributor: budget check, guardrails, cache lookup, network hop. If telemetry sat inside paymaster, the trace would start at “provider call” and hide the time spent on enforcement, which is often where slowness lives.

2. Paymaster outside lookout

Load-bearing. A pre-call budget check must refuse an over-budget request before we’ve done any guardrail work — otherwise a workspace out of budget still pays in sanitization time. And the cost ledger row is written here, outside the guardrail layer, so sanitize latency is not counted toward “provider latency”.

3. Lookout inside paymaster

Load-bearing. Running Lookout INSIDE Paymaster means: if Lookout blocks the call, no cost_ledger row is written because next.Call is never invoked. A blocked call is not a billable call. If the order were reversed, Paymaster would record a ledger row for work that never reached the provider.

4. Caching (provider-side) below lookout

A future request-level cache layer would sit here. Anthropic and OpenAI prompt caching is handled wire-side today:
  • Anthropicinternal/llm/anthropic.go ships anthropic-beta: prompt-caching-2024-07-31 by default and stamps cache_control: ephemeral on the system prompt and the last tool definition (tool schemas are usually large and stable across turns — single highest-leverage breakpoint). Response usage parses cache_read_input_tokens + cache_creation_input_tokens.
  • OpenAI — auto-activates for prompts ≥1024 tokens (Sept 2025). Response usage parses prompt_tokens_details.cached_tokens (no separate creation counter).
Both counts plumb through Response.CachedInputToks + .CacheCreationToks into paymaster.CallResponse and onto the cost_ledger row + the OTel llm.call span. See Paymaster and Tracing for the downstream details.

5. Base provider innermost

The innermost providerCaller unpacks the opaque CallRequest.Inputs back into a typed llm.Request. It trusts that guardrails have already scanned the prompt and paymaster has green-lit the spend.

Write-path order beyond LLM calls

The same principle shows up elsewhere:
  • Keeper: SecretStore -> Gatekeeper LLM -> Decision. Journal emit is outermost so even rejected requests are audited.
  • Harbormaster: Enqueue -> (optional sync poll) -> Decide. Journal emit fires on each state transition.
  • Hooks: Dispatch -> blocking handlers (sequential, stop on Block) -> non-blocking goroutines. The hook.fired entry lands regardless of outcome.

Stream() path

Provider.Stream() now runs through the same middleware stack as Complete. wrappedProvider.Stream() builds a per-call telemetry → paymaster → lookout chain around a streamCaller that closes over the handler:
var caller paymaster.LLMCaller = &streamCaller{p: w.base, handler: handler}
caller = lookoutCaller(caller, w.j)
caller = paymaster.Middleware(caller, w.j, w.db)
caller = telemetry.LLMMiddleware(caller)
The synchronous CallResponse the streamCaller returns carries the final token counts computed across the stream, which lets paymaster.Middleware write a normal cost_ledger row and lets telemetry.LLMMiddleware close out a normal llm.call span. Callers that pick Stream over Complete now pay, log, and guard identically — the older bypass (where streaming fell back to orchestrator-level accounting) is gone. lookoutCaller stays inside paymaster.Middleware for the same reason as the Complete path: a blocked input never reaches the provider, so no ledger row is written for work that never happened.

Custom composition

You almost never want this — use llm.Middleware — but the layers are individually exported so tests can swap them:
var caller paymaster.LLMCaller = providerCaller{p: base}
caller = lookoutCaller(caller, j)
caller = paymaster.Middleware(caller, j, db)
caller = telemetry.LLMMiddleware(caller)
If you rearrange, understand the reasoning above. A PR that moves paymaster inside lookout will be rejected.

Gotchas

  • Context scope is required. Without lookout.WithScope in ctx, the paymaster rejects every call. This is by design (unscoped calls are unbillable) but produces a confusing error in tests — always attach a scope.
  • Type assertion failures fail fast. providerCaller returns "inputs not llm.Request (got %T)" if something upstream handed it the wrong shape. This is always a wiring bug.
  • Stream runs the full stack. Stream and Complete now both flow through telemetry → paymaster → lookout, so picking one over the other no longer changes billing or guardrail coverage. The streaming streamCaller returns a "stream inputs not llm.Request (got %T)" error on a type-assertion mismatch — same fail-fast wiring-bug signal as the Complete path.
  • Journal emitter is shared. Both paymaster and lookout emit through the same journal.Emitter instance. A nil emitter would no-op silently — the production path always sets it.