> ## Documentation Index
> Fetch the complete documentation index at: https://docs.crewship.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# LLM Middleware

> The unified LLM call stack: telemetry → paymaster → lookout → raw provider. Composition and layer order rationale.

# LLM Middleware

`internal/llm/middleware.go` composes the full call stack that every LLM request flows through:

```
telemetry  →  paymaster  →  lookout  →  (caching: future)  →  base provider
```

Each layer matches the `paymaster.LLMCaller` signature so they compose as plain function wrappers. The returned `Provider` preserves the original `Name()` for routing and fan-out.

## Composition

```go theme={null}
import "github.com/crewship-ai/crewship/internal/llm"

base := anthropic.NewProvider(apiKey)
wrapped := llm.Middleware(base, journalEmitter, db)

// From this point, wrapped.Complete() flows through all layers.
resp, err := wrapped.Complete(ctx, llm.Request{
    Model: "claude-sonnet-4-5",
    Messages: []llm.Message{{Role: llm.RoleUser, Content: "..."}},
})
```

The request context MUST carry a `lookout.Scope` (set by the HTTP handler chain). Without it, paymaster rejects the call because `WorkspaceID` is empty -- calls without a workspace are not billable.

```go theme={null}
ctx = lookout.WithScope(ctx, lookout.Scope{
    WorkspaceID: ws, CrewID: crew, AgentID: agent, MissionID: mission,
})
```

## Layer order rationale

Layer order is deliberate and the comment in `middleware.go` is the source of truth. Getting it wrong produces subtle bugs that only surface under load:

### 1. Telemetry outermost

An SRE looking at a slow trace must see every contributor: budget check, guardrails, cache lookup, network hop. If telemetry sat inside paymaster, the trace would start at "provider call" and hide the time spent on enforcement, which is often where slowness lives.

### 2. Paymaster outside lookout

**Load-bearing.** A pre-call budget check must refuse an over-budget request before we've done any guardrail work -- otherwise a workspace out of budget still pays in sanitization time. And the cost ledger row is written here, outside the guardrail layer, so sanitize latency is not counted toward "provider latency".

### 3. Lookout inside paymaster

**Load-bearing.** Running Lookout INSIDE Paymaster means: **if Lookout blocks the call, no `cost_ledger` row is written because `next.Call` is never invoked.** A blocked call is not a billable call. If the order were reversed, Paymaster would record a ledger row for work that never reached the provider.

### 4. Caching (provider-side) below lookout

A future request-level cache layer would sit here. Anthropic and OpenAI prompt caching is handled wire-side today:

* **Anthropic** — `internal/llm/anthropic.go` ships `anthropic-beta: prompt-caching-2024-07-31` by default and stamps `cache_control: ephemeral` on the system prompt and the last tool definition (tool schemas are usually large and stable across turns — single highest-leverage breakpoint). Response usage parses `cache_read_input_tokens` + `cache_creation_input_tokens`.
* **OpenAI** — auto-activates for prompts ≥1024 tokens (Sept 2025). Response usage parses `prompt_tokens_details.cached_tokens` (no separate creation counter).

Both counts plumb through `Response.CachedInputToks` + `.CacheCreationToks` into `paymaster.CallResponse` and onto the `cost_ledger` row + the OTel `llm.call` span. See [Paymaster](/guides/paymaster#prompt-cache-token-flow) and [Tracing](/guides/tracing#prompt-cache-attributes) for the downstream details.

### 5. Base provider innermost

The innermost `providerCaller` unpacks the opaque `CallRequest.Inputs` back into a typed `llm.Request`. It trusts that guardrails have already scanned the prompt and paymaster has green-lit the spend.

## Write-path order beyond LLM calls

The same principle shows up elsewhere:

* **Keeper**: SecretStore -> Gatekeeper LLM -> Decision. Journal emit is outermost so even rejected requests are audited.
* **Harbormaster**: Enqueue -> (optional sync poll) -> Decide. Journal emit fires on each state transition.
* **Hooks**: Dispatch -> blocking handlers (sequential, stop on Block) -> non-blocking goroutines. The hook.fired entry lands regardless of outcome.

## Stream() path

`Provider.Stream()` now runs through the **same** middleware stack as `Complete`. `wrappedProvider.Stream()` builds a per-call `telemetry → paymaster → lookout` chain around a `streamCaller` that closes over the handler:

```go theme={null}
var caller paymaster.LLMCaller = &streamCaller{p: w.base, handler: handler}
caller = lookoutCaller(caller, w.j)
caller = paymaster.Middleware(caller, w.j, w.db)
caller = telemetry.LLMMiddleware(caller)
```

The synchronous `CallResponse` the `streamCaller` returns carries the final token counts computed across the stream, which lets `paymaster.Middleware` write a normal `cost_ledger` row and lets `telemetry.LLMMiddleware` close out a normal `llm.call` span. Callers that pick `Stream` over `Complete` now **pay, log, and guard identically** — the older bypass (where streaming fell back to orchestrator-level accounting) is gone.

`lookoutCaller` stays inside `paymaster.Middleware` for the same reason as the Complete path: a blocked input never reaches the provider, so no ledger row is written for work that never happened.

## Custom composition

You almost never want this -- use `llm.Middleware` -- but the layers are individually exported so tests can swap them:

```go theme={null}
var caller paymaster.LLMCaller = providerCaller{p: base}
caller = lookoutCaller(caller, j)
caller = paymaster.Middleware(caller, j, db)
caller = telemetry.LLMMiddleware(caller)
```

If you rearrange, understand the reasoning above. A PR that moves paymaster inside lookout will be rejected.

## Gotchas

<Accordion title="Edge cases and footguns">
  * **Context scope is required.** Without `lookout.WithScope` in ctx, the paymaster rejects every call. This is by design (unscoped calls are unbillable) but produces a confusing error in tests -- always attach a scope.
  * **Type assertion failures fail fast.** `providerCaller` returns `"inputs not llm.Request (got %T)"` if something upstream handed it the wrong shape. This is always a wiring bug.
  * **Stream runs the full stack.** `Stream` and `Complete` now both flow through `telemetry → paymaster → lookout`, so picking one over the other no longer changes billing or guardrail coverage. The streaming `streamCaller` returns a `"stream inputs not llm.Request (got %T)"` error on a type-assertion mismatch — same fail-fast wiring-bug signal as the Complete path.
  * **Journal emitter is shared.** Both paymaster and lookout emit through the same `journal.Emitter` instance. A nil emitter would no-op silently -- the production path always sets it.
</Accordion>

## Related

* [Paymaster](/guides/paymaster) -- layer 2. Cost + budget.
* [Lookout](/guides/lookout) -- layer 3. Guardrails.
* [Tracing](/guides/tracing) -- layer 1. OTel spans.
* [Architecture](/architecture#crew-journal-platform) -- where this stack fits.