LLM Middleware
internal/llm/middleware.go composes the full call stack that every LLM request flows through:
paymaster.LLMCaller signature so they compose as plain function wrappers. The returned Provider preserves the original Name() for routing and fan-out.
Composition
lookout.Scope (set by the HTTP handler chain). Without it, paymaster rejects the call because WorkspaceID is empty — calls without a workspace are not billable.
Layer order rationale
Layer order is deliberate and the comment inmiddleware.go is the source of truth. Getting it wrong produces subtle bugs that only surface under load:
1. Telemetry outermost
An SRE looking at a slow trace must see every contributor: budget check, guardrails, cache lookup, network hop. If telemetry sat inside paymaster, the trace would start at “provider call” and hide the time spent on enforcement, which is often where slowness lives.2. Paymaster outside lookout
Load-bearing. A pre-call budget check must refuse an over-budget request before we’ve done any guardrail work — otherwise a workspace out of budget still pays in sanitization time. And the cost ledger row is written here, outside the guardrail layer, so sanitize latency is not counted toward “provider latency”.3. Lookout inside paymaster
Load-bearing. Running Lookout INSIDE Paymaster means: if Lookout blocks the call, nocost_ledger row is written because next.Call is never invoked. A blocked call is not a billable call. If the order were reversed, Paymaster would record a ledger row for work that never reached the provider.
4. Caching (provider-side) below lookout
A future request-level cache layer would sit here. Anthropic and OpenAI prompt caching is handled wire-side today:- Anthropic —
internal/llm/anthropic.goshipsanthropic-beta: prompt-caching-2024-07-31by default and stampscache_control: ephemeralon the system prompt and the last tool definition (tool schemas are usually large and stable across turns — single highest-leverage breakpoint). Response usage parsescache_read_input_tokens+cache_creation_input_tokens. - OpenAI — auto-activates for prompts ≥1024 tokens (Sept 2025). Response usage parses
prompt_tokens_details.cached_tokens(no separate creation counter).
Response.CachedInputToks + .CacheCreationToks into paymaster.CallResponse and onto the cost_ledger row + the OTel llm.call span. See Paymaster and Tracing for the downstream details.
5. Base provider innermost
The innermostproviderCaller unpacks the opaque CallRequest.Inputs back into a typed llm.Request. It trusts that guardrails have already scanned the prompt and paymaster has green-lit the spend.
Write-path order beyond LLM calls
The same principle shows up elsewhere:- Keeper: SecretStore -> Gatekeeper LLM -> Decision. Journal emit is outermost so even rejected requests are audited.
- Harbormaster: Enqueue -> (optional sync poll) -> Decide. Journal emit fires on each state transition.
- Hooks: Dispatch -> blocking handlers (sequential, stop on Block) -> non-blocking goroutines. The hook.fired entry lands regardless of outcome.
Stream() path
Provider.Stream() now runs through the same middleware stack as Complete. wrappedProvider.Stream() builds a per-call telemetry → paymaster → lookout chain around a streamCaller that closes over the handler:
CallResponse the streamCaller returns carries the final token counts computed across the stream, which lets paymaster.Middleware write a normal cost_ledger row and lets telemetry.LLMMiddleware close out a normal llm.call span. Callers that pick Stream over Complete now pay, log, and guard identically — the older bypass (where streaming fell back to orchestrator-level accounting) is gone.
lookoutCaller stays inside paymaster.Middleware for the same reason as the Complete path: a blocked input never reaches the provider, so no ledger row is written for work that never happened.
Custom composition
You almost never want this — usellm.Middleware — but the layers are individually exported so tests can swap them:
Gotchas
Edge cases and footguns
Edge cases and footguns
- Context scope is required. Without
lookout.WithScopein ctx, the paymaster rejects every call. This is by design (unscoped calls are unbillable) but produces a confusing error in tests — always attach a scope. - Type assertion failures fail fast.
providerCallerreturns"inputs not llm.Request (got %T)"if something upstream handed it the wrong shape. This is always a wiring bug. - Stream runs the full stack.
StreamandCompletenow both flow throughtelemetry → paymaster → lookout, so picking one over the other no longer changes billing or guardrail coverage. The streamingstreamCallerreturns a"stream inputs not llm.Request (got %T)"error on a type-assertion mismatch — same fail-fast wiring-bug signal as the Complete path. - Journal emitter is shared. Both paymaster and lookout emit through the same
journal.Emitterinstance. A nil emitter would no-op silently — the production path always sets it.
Related
- Paymaster — layer 2. Cost + budget.
- Lookout — layer 3. Guardrails.
- Tracing — layer 1. OTel spans.
- Architecture — where this stack fits.