> ## Documentation Index
> Fetch the complete documentation index at: https://docs.crewship.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Tracing (OpenTelemetry)

> OpenTelemetry GenAI spans with W3C Trace Context propagation into journal entries.

# Tracing

The `internal/telemetry` package wires Crewship to OpenTelemetry distributed tracing. (For the user-facing crash-reporting opt-out flow see the [Telemetry guide](/guides/telemetry).)

It owns three concerns:

1. **Provider init** -- build an OTLP HTTP exporter or fall back to a no-op tracer when no endpoint is configured.
2. **Span builders** -- typed helpers that create agent, tool, and LLM spans with the attributes prescribed by the OTel GenAI Semantic Conventions (`gen_ai.system`, `gen_ai.request.model`, `gen_ai.usage.*`).
3. **Propagation + journal integration** -- W3C Trace Context injection/extraction over HTTP headers plus a resolver that feeds the journal package so every entry is stamped with the current `trace_id` / `span_id`.

## Zero-config safety

If `OTEL_EXPORTER_OTLP_ENDPOINT` is unset and `Init` is called with an empty endpoint, the shutdown function is a no-op and `otel.GetTracerProvider()` keeps returning the noop provider. Spans still exist but never leave the process -- developers run the binary without a collector and nothing breaks.

```go theme={null}
shutdown, err := telemetry.Init(ctx, os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT"), "crewship")
if err != nil { return err }
defer shutdown()
```

## Configuration

| Env var                       | Purpose                                                                         |
| ----------------------------- | ------------------------------------------------------------------------------- |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | Collector URL. Accepts `host:port` (plaintext) or `http(s)://host:port[/path]`. |
| `CREWSHIP_VERSION`            | Stamped as `service.version` on the resource. Empty = `"dev"`.                  |

Endpoint resolution order:

1. The explicit `endpoint` argument to `Init` (most specific wins).
2. `OTEL_EXPORTER_OTLP_ENDPOINT`.
3. Empty string -> no-op tracer, nothing exported.

The propagator is set to the W3C composite (TraceContext + Baggage) unconditionally so `Inject`/`Extract` work even in no-op mode. Without this, cross-process links are silently dropped when the exporter is disabled.

## Span types

```go theme={null}
ctx, span := telemetry.StartLLMSpan(ctx, "anthropic", "claude-sonnet-4-5")
defer span.End()
// ... make the call ...
telemetry.RecordLLMUsage(span,
    resp.InputTokens,
    resp.OutputTokens,
    resp.CachedInputTokens,
    resp.CacheCreationTokens,
    resp.CostUSD)
```

| Helper                                                  | Span name      | Attributes                                                                                                                                                   |
| ------------------------------------------------------- | -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `StartRoutineRunSpan(ctx, slug, runID, pipelineID)`     | `routine.run`  | `crewship.routine.slug`, `crewship.routine.run_id`, `crewship.routine.pipeline_id`                                                                           |
| `StartRoutineStepSpan(ctx, stepID, stepType, attempt)`  | `routine.step` | `crewship.routine.step.id`, `crewship.routine.step.type`, `crewship.routine.step.attempt`                                                                    |
| `StartAgentSpan(ctx, agentID, type, crewID, missionID)` | `agent.invoke` | `crewship.agent.id`, `crewship.agent.type`, `crewship.crew.id`, `crewship.mission.id`                                                                        |
| `StartToolSpan(ctx, toolName, argsHash, sideEffect)`    | `tool.execute` | `crewship.tool.name`, `crewship.tool.args_hash`, `crewship.tool.side_effect`                                                                                 |
| `StartLLMSpan(ctx, provider, model)`                    | `llm.call`     | `gen_ai.system`, `gen_ai.request.model`                                                                                                                      |
| `RecordLLMUsage(span, ...)`                             | --             | `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `gen_ai.usage.cached_input_tokens`, `gen_ai.usage.cache_creation_tokens`, `gen_ai.cost.total_usd` |
| `RecordError(span, err)`                                | --             | Sets status=Error, records exception event                                                                                                                   |

The span tree on a typical routine invocation:

```
routine.run                       (one per top-level invocation)
└── routine.step                  (one per DSL step; attempt=0)
    └── agent.invoke              (one per agent_run step)
        └── llm.call              (one per provider round-trip)
```

`call_pipeline` steps nest the child run as a `routine.step` of the parent — the trace tree mirrors the DSL composition rather than producing sibling top-level routine spans.

LLM spans are added by `telemetry.LLMMiddleware`, the outermost layer of the [LLM middleware stack](/guides/llm-middleware). Agent + routine spans are wired at their call sites: `internal/orchestrator/orchestrator_run.go:RunAgent` opens `agent.invoke`; `internal/pipeline/executor.go:runDSL` + `runStep` open `routine.run` + `routine.step`.

## Prompt-cache attributes

`gen_ai.usage.cached_input_tokens` and `gen_ai.usage.cache_creation_tokens` carry provider-reported prompt-cache counts:

| Provider  | Source field                                                          | Notes                                                                                                                                                                                                   |
| --------- | --------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Anthropic | `usage.cache_read_input_tokens` + `usage.cache_creation_input_tokens` | Activated via `anthropic-beta: prompt-caching-2024-07-31` header (set by default). Cache reads bill at \~10% of base input rate. System prompt + last tool definition carry `cache_control: ephemeral`. |
| OpenAI    | `usage.prompt_tokens_details.cached_tokens`                           | Auto-activates for prompts ≥1024 tokens. No separate creation counter — caching is opaque on OpenAI's side.                                                                                             |
| Ollama    | --                                                                    | Field stays zero.                                                                                                                                                                                       |

Dashboards can compute fleet-wide cache-hit ratio as `cached_input_tokens / input_tokens` without per-provider branching. See [LLM middleware](/guides/llm-middleware) and [Paymaster](/guides/paymaster) for the wire-side details + cost-ledger columns.

## Journal integration

At startup call `telemetry.RegisterJournalResolver()` -- this registers a `journal.SetTraceResolver` callback that pulls the active span context from `ctx` and returns `(trace_id, span_id)`. Every journal entry written via `Writer.Emit` then carries both IDs:

```go theme={null}
// internal/journal/emit.go
if e.TraceID == "" {
    if t, s, ok := traceFromContext(ctx); ok {
        e.TraceID, e.SpanID = t, s
    }
}
```

The `journal_entries` table indexes `trace_id` for efficient trace -> journal lookups:

```sql theme={null}
CREATE INDEX idx_journal_trace ON journal_entries(trace_id) WHERE trace_id IS NOT NULL;
```

From a trace you've opened in your collector, search the journal for `trace_id = <that>` and you get every operation recorded during that request without grepping logs.

## Propagation across processes

The sidecar speaks HTTP to crewshipd. The server chain injects trace headers on every outbound IPC call, and the sidecar extracts them on every inbound. Result: a span started in the web handler can span the sidecar's keeper evaluation call all the way to the gatekeeper LLM round-trip as a single trace.

For external webhooks fired by [Hooks](/guides/hooks), the `http` handler calls `telemetry.Inject(ctx, req.Header)` before sending so the receiving system can correlate.

## Gotchas

* **HTTP exporter only.** The `otlptrace-grpc` exporter pulls in grpc which balloons the build. HTTP is fine for expected volume and plays nicer with corporate proxies.
* **Bare `host:port` is plaintext.** Use `https://collector:4318` if you need TLS -- bare hostports call `WithInsecure()`.
* **Init is idempotent.** Re-calling it (tests, config reloads) swaps the global provider and shuts down the previous one cleanly.
* **No spans without explicit wiring.** Adding telemetry to a new code path means wrapping the relevant call with a `StartXxxSpan` helper. The package does not auto-instrument HTTP handlers -- spans are deliberate, not reflexive.
* **Don't stamp journal entries twice.** `Emit` fills `trace_id` only if the caller left it empty. If you construct an entry manually and set `TraceID` to something synthetic, the resolver won't overwrite it.

## Related

* [LLM middleware](/guides/llm-middleware) -- outermost layer starts LLM spans.
* [Crew Journal](/guides/crew-journal) -- `trace_id` / `span_id` columns.
* OTel GenAI Semantic Conventions: [https://opentelemetry.io/docs/specs/semconv/gen-ai/](https://opentelemetry.io/docs/specs/semconv/gen-ai/)
