Tracing

The internal/telemetry package wires Crewship to OpenTelemetry distributed tracing. (For the user-facing crash-reporting opt-out flow see the Telemetry guide.) It owns three concerns:

Provider init — build an OTLP HTTP exporter or fall back to a no-op tracer when no endpoint is configured.
Span builders — typed helpers that create agent, tool, and LLM spans with the attributes prescribed by the OTel GenAI Semantic Conventions (gen_ai.system, gen_ai.request.model, gen_ai.usage.*).
Propagation + journal integration — W3C Trace Context injection/extraction over HTTP headers plus a resolver that feeds the journal package so every entry is stamped with the current trace_id / span_id.

Zero-config safety

If OTEL_EXPORTER_OTLP_ENDPOINT is unset and Init is called with an empty endpoint, the shutdown function is a no-op and otel.GetTracerProvider() keeps returning the noop provider. Spans still exist but never leave the process — developers run the binary without a collector and nothing breaks.

shutdown, err := telemetry.Init(ctx, os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT"), "crewship")
if err != nil { return err }
defer shutdown()

Configuration

Env var	Purpose
`OTEL_EXPORTER_OTLP_ENDPOINT`	Collector URL. Accepts `host:port` (plaintext) or `http(s)://host:port[/path]`.
`CREWSHIP_VERSION`	Stamped as `service.version` on the resource. Empty = `"dev"`.

Endpoint resolution order:

The explicit endpoint argument to Init (most specific wins).
OTEL_EXPORTER_OTLP_ENDPOINT.
Empty string -> no-op tracer, nothing exported.

The propagator is set to the W3C composite (TraceContext + Baggage) unconditionally so Inject/Extract work even in no-op mode. Without this, cross-process links are silently dropped when the exporter is disabled.

Span types

ctx, span := telemetry.StartLLMSpan(ctx, "anthropic", "claude-sonnet-4-5")
defer span.End()
// ... make the call ...
telemetry.RecordLLMUsage(span,
    resp.InputTokens,
    resp.OutputTokens,
    resp.CachedInputTokens,
    resp.CacheCreationTokens,
    resp.CostUSD)

Helper	Span name	Attributes
`StartRoutineRunSpan(ctx, slug, runID, pipelineID)`	`routine.run`	`crewship.routine.slug`, `crewship.routine.run_id`, `crewship.routine.pipeline_id`
`StartRoutineStepSpan(ctx, stepID, stepType, attempt)`	`routine.step`	`crewship.routine.step.id`, `crewship.routine.step.type`, `crewship.routine.step.attempt`
`StartAgentSpan(ctx, agentID, type, crewID, missionID)`	`agent.invoke`	`crewship.agent.id`, `crewship.agent.type`, `crewship.crew.id`, `crewship.mission.id`
`StartToolSpan(ctx, toolName, argsHash, sideEffect)`	`tool.execute`	`crewship.tool.name`, `crewship.tool.args_hash`, `crewship.tool.side_effect`
`StartLLMSpan(ctx, provider, model)`	`llm.call`	`gen_ai.system`, `gen_ai.request.model`
`RecordLLMUsage(span, ...)`	—	`gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `gen_ai.usage.cached_input_tokens`, `gen_ai.usage.cache_creation_tokens`, `gen_ai.cost.total_usd`
`RecordError(span, err)`	—	Sets status=Error, records exception event

The span tree on a typical routine invocation:

routine.run                       (one per top-level invocation)
└── routine.step                  (one per DSL step; attempt=0)
    └── agent.invoke              (one per agent_run step)
        └── llm.call              (one per provider round-trip)

call_pipeline steps nest the child run as a routine.step of the parent — the trace tree mirrors the DSL composition rather than producing sibling top-level routine spans. LLM spans are added by telemetry.LLMMiddleware, the outermost layer of the LLM middleware stack. Agent + routine spans are wired at their call sites: internal/orchestrator/orchestrator_run.go:RunAgent opens agent.invoke; internal/pipeline/executor.go:runDSL + runStep open routine.run + routine.step.

Prompt-cache attributes

gen_ai.usage.cached_input_tokens and gen_ai.usage.cache_creation_tokens carry provider-reported prompt-cache counts:

Provider	Source field	Notes
Anthropic	`usage.cache_read_input_tokens` + `usage.cache_creation_input_tokens`	Activated via `anthropic-beta: prompt-caching-2024-07-31` header (set by default). Cache reads bill at ~10% of base input rate. System prompt + last tool definition carry `cache_control: ephemeral`.
OpenAI	`usage.prompt_tokens_details.cached_tokens`	Auto-activates for prompts ≥1024 tokens. No separate creation counter — caching is opaque on OpenAI’s side.
Ollama	—	Field stays zero.

Dashboards can compute fleet-wide cache-hit ratio as cached_input_tokens / input_tokens without per-provider branching. See LLM middleware and Paymaster for the wire-side details + cost-ledger columns.

Journal integration

At startup call telemetry.RegisterJournalResolver() — this registers a journal.SetTraceResolver callback that pulls the active span context from ctx and returns (trace_id, span_id). Every journal entry written via Writer.Emit then carries both IDs:

// internal/journal/emit.go
if e.TraceID == "" {
    if t, s, ok := traceFromContext(ctx); ok {
        e.TraceID, e.SpanID = t, s
    }
}

The journal_entries table indexes trace_id for efficient trace -> journal lookups:

CREATE INDEX idx_journal_trace ON journal_entries(trace_id) WHERE trace_id IS NOT NULL;

From a trace you’ve opened in your collector, search the journal for trace_id = <that> and you get every operation recorded during that request without grepping logs.

Propagation across processes

The sidecar speaks HTTP to crewshipd. The server chain injects trace headers on every outbound IPC call, and the sidecar extracts them on every inbound. Result: a span started in the web handler can span the sidecar’s keeper evaluation call all the way to the gatekeeper LLM round-trip as a single trace. For external webhooks fired by Hooks, the http handler calls telemetry.Inject(ctx, req.Header) before sending so the receiving system can correlate.

Gotchas

HTTP exporter only. The otlptrace-grpc exporter pulls in grpc which balloons the build. HTTP is fine for expected volume and plays nicer with corporate proxies.
Bare host:port is plaintext. Use https://collector:4318 if you need TLS — bare hostports call WithInsecure().
Init is idempotent. Re-calling it (tests, config reloads) swaps the global provider and shuts down the previous one cleanly.
No spans without explicit wiring. Adding telemetry to a new code path means wrapping the relevant call with a StartXxxSpan helper. The package does not auto-instrument HTTP handlers — spans are deliberate, not reflexive.
Don’t stamp journal entries twice. Emit fills trace_id only if the caller left it empty. If you construct an entry manually and set TraceID to something synthetic, the resolver won’t overwrite it.

LLM middleware — outermost layer starts LLM spans.
Crew Journal — trace_id / span_id columns.
OTel GenAI Semantic Conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/

​Tracing

​Zero-config safety

​Configuration

​Span types

​Prompt-cache attributes

​Journal integration

​Propagation across processes

​Gotchas

​Related