Tracing
Theinternal/telemetry package wires Crewship to OpenTelemetry distributed tracing. (For the user-facing crash-reporting opt-out flow see the Telemetry guide.)
It owns three concerns:
- Provider init — build an OTLP HTTP exporter or fall back to a no-op tracer when no endpoint is configured.
- Span builders — typed helpers that create agent, tool, and LLM spans with the attributes prescribed by the OTel GenAI Semantic Conventions (
gen_ai.system,gen_ai.request.model,gen_ai.usage.*). - Propagation + journal integration — W3C Trace Context injection/extraction over HTTP headers plus a resolver that feeds the journal package so every entry is stamped with the current
trace_id/span_id.
Zero-config safety
IfOTEL_EXPORTER_OTLP_ENDPOINT is unset and Init is called with an empty endpoint, the shutdown function is a no-op and otel.GetTracerProvider() keeps returning the noop provider. Spans still exist but never leave the process — developers run the binary without a collector and nothing breaks.
Configuration
| Env var | Purpose |
|---|---|
OTEL_EXPORTER_OTLP_ENDPOINT | Collector URL. Accepts host:port (plaintext) or http(s)://host:port[/path]. |
CREWSHIP_VERSION | Stamped as service.version on the resource. Empty = "dev". |
- The explicit
endpointargument toInit(most specific wins). OTEL_EXPORTER_OTLP_ENDPOINT.- Empty string -> no-op tracer, nothing exported.
Inject/Extract work even in no-op mode. Without this, cross-process links are silently dropped when the exporter is disabled.
Span types
| Helper | Span name | Attributes |
|---|---|---|
StartRoutineRunSpan(ctx, slug, runID, pipelineID) | routine.run | crewship.routine.slug, crewship.routine.run_id, crewship.routine.pipeline_id |
StartRoutineStepSpan(ctx, stepID, stepType, attempt) | routine.step | crewship.routine.step.id, crewship.routine.step.type, crewship.routine.step.attempt |
StartAgentSpan(ctx, agentID, type, crewID, missionID) | agent.invoke | crewship.agent.id, crewship.agent.type, crewship.crew.id, crewship.mission.id |
StartToolSpan(ctx, toolName, argsHash, sideEffect) | tool.execute | crewship.tool.name, crewship.tool.args_hash, crewship.tool.side_effect |
StartLLMSpan(ctx, provider, model) | llm.call | gen_ai.system, gen_ai.request.model |
RecordLLMUsage(span, ...) | — | gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.cached_input_tokens, gen_ai.usage.cache_creation_tokens, gen_ai.cost.total_usd |
RecordError(span, err) | — | Sets status=Error, records exception event |
call_pipeline steps nest the child run as a routine.step of the parent — the trace tree mirrors the DSL composition rather than producing sibling top-level routine spans.
LLM spans are added by telemetry.LLMMiddleware, the outermost layer of the LLM middleware stack. Agent + routine spans are wired at their call sites: internal/orchestrator/orchestrator_run.go:RunAgent opens agent.invoke; internal/pipeline/executor.go:runDSL + runStep open routine.run + routine.step.
Prompt-cache attributes
gen_ai.usage.cached_input_tokens and gen_ai.usage.cache_creation_tokens carry provider-reported prompt-cache counts:
| Provider | Source field | Notes |
|---|---|---|
| Anthropic | usage.cache_read_input_tokens + usage.cache_creation_input_tokens | Activated via anthropic-beta: prompt-caching-2024-07-31 header (set by default). Cache reads bill at ~10% of base input rate. System prompt + last tool definition carry cache_control: ephemeral. |
| OpenAI | usage.prompt_tokens_details.cached_tokens | Auto-activates for prompts ≥1024 tokens. No separate creation counter — caching is opaque on OpenAI’s side. |
| Ollama | — | Field stays zero. |
cached_input_tokens / input_tokens without per-provider branching. See LLM middleware and Paymaster for the wire-side details + cost-ledger columns.
Journal integration
At startup calltelemetry.RegisterJournalResolver() — this registers a journal.SetTraceResolver callback that pulls the active span context from ctx and returns (trace_id, span_id). Every journal entry written via Writer.Emit then carries both IDs:
journal_entries table indexes trace_id for efficient trace -> journal lookups:
trace_id = <that> and you get every operation recorded during that request without grepping logs.
Propagation across processes
The sidecar speaks HTTP to crewshipd. The server chain injects trace headers on every outbound IPC call, and the sidecar extracts them on every inbound. Result: a span started in the web handler can span the sidecar’s keeper evaluation call all the way to the gatekeeper LLM round-trip as a single trace. For external webhooks fired by Hooks, thehttp handler calls telemetry.Inject(ctx, req.Header) before sending so the receiving system can correlate.
Gotchas
- HTTP exporter only. The
otlptrace-grpcexporter pulls in grpc which balloons the build. HTTP is fine for expected volume and plays nicer with corporate proxies. - Bare
host:portis plaintext. Usehttps://collector:4318if you need TLS — bare hostports callWithInsecure(). - Init is idempotent. Re-calling it (tests, config reloads) swaps the global provider and shuts down the previous one cleanly.
- No spans without explicit wiring. Adding telemetry to a new code path means wrapping the relevant call with a
StartXxxSpanhelper. The package does not auto-instrument HTTP handlers — spans are deliberate, not reflexive. - Don’t stamp journal entries twice.
Emitfillstrace_idonly if the caller left it empty. If you construct an entry manually and setTraceIDto something synthetic, the resolver won’t overwrite it.
Related
- LLM middleware — outermost layer starts LLM spans.
- Crew Journal —
trace_id/span_idcolumns. - OTel GenAI Semantic Conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/