Skip to main content
crewshipd exposes a Prometheus text-format endpoint at GET /metrics on the main HTTP port. It serves two groups of series: process gauges (uptime, memory, goroutines, WebSocket connections) and domain metrics — the counters and gauges an operator alerts on.

Authorization

/metrics is not public:
  • Requests from loopback (the true client IP, X-Forwarded-For aware) are always allowed — the typical node-local Prometheus or sidecar scrape.
  • Remote scrapers must send Authorization: Bearer <token> matching the CREWSHIP_METRICS_TOKEN environment variable.
  • With no token configured, non-loopback requests get a 404.
# prometheus.yml
scrape_configs:
  - job_name: crewshipd
    scrape_interval: 30s
    authorization:
      credentials: <CREWSHIP_METRICS_TOKEN>
    static_configs:
      - targets: ["crewship.example.com:8080"]

Process metrics

MetricTypeDescription
crewshipd_uptime_secondsgaugeTime since crewshipd started
crewshipd_goroutinesgaugeNumber of goroutines
crewshipd_memory_alloc_bytesgaugeBytes of allocated heap
crewshipd_memory_sys_bytesgaugeTotal bytes obtained from the OS
crewshipd_gc_runs_totalcounterTotal GC runs
crewshipd_ws_connectionsgaugeActive WebSocket connections
Every series carries a hostname label.

Domain metrics

Assignments and queue

MetricTypeLabelsDescription
crewshipd_assignmentsgaugestatusAssignments currently in each status. Statuses: pending, queued, running, completed, failed, cancelled; anything unrecognized folds into other. All label values are always emitted (zero-filled).
crewshipd_assignment_queue_depthgaugeQUEUED assignments across all crews
crewshipd_assignment_queue_crewsgaugeCrews with at least one queued assignment
crewshipd_assignment_queue_depth_maxgaugeQueued assignments in the most backlogged crew
Queue depth is deliberately aggregated, not labeled per crew — crews are user-created and unbounded, and per-crew labels would grow the series set without limit. The three aggregates cover the alerting cases: total backlog growing, backlog spreading across crews, and a single crew wedged (depth_max climbing while depth is flat).

Pipeline runs

MetricTypeLabelsDescription
crewshipd_pipeline_runsgaugestatusPipeline runs by status: queued, running, completed, failed, cancelled, dry_run, interrupted (+ other), zero-filled

Agent runs

MetricTypeLabelsDescription
crewshipd_agent_run_events_totalcountereventAgent run lifecycle events from the unified journal (live + archived rows): started, completed, failed, cancelled, timeout
Alert on failure rate with the usual counter recipe:
sum(rate(crewshipd_agent_run_events_total{event="failed"}[10m]))
/
sum(rate(crewshipd_agent_run_events_total{event="started"}[10m])) > 0.2
Journal retention pruning can shrink these counters; Prometheus rate() / increase() treat that as a normal counter reset.

LLM cost (paymaster)

MetricTypeLabelsDescription
crewshipd_llm_calls_totalcounterproviderLLM invocations recorded in the paymaster cost ledger
crewshipd_llm_cost_usd_totalcounterproviderCumulative LLM spend in USD
Provider label values are capped (overflow folds into provider="other") so the series set stays bounded. Spend-rate alert:
sum(increase(crewshipd_llm_cost_usd_total[1h])) > 5

Containers

MetricTypeDescription
crewshipd_containers_trackedgaugeCrew containers registered with the stats collector
crewshipd_containers_reportinggaugeTracked containers with a collected stats sample — a cheap health proxy; tracked - reporting > 0 for more than a couple of poll intervals means a container is not answering stats

Database

MetricTypeDescription
crewshipd_db_migration_versiongaugeHighest applied schema migration version. Compare across a fleet to catch a node running old code against a newer schema.

Freshness and cost

The DB-derived block is computed from indexed counts and cached for 15 seconds — scraping more often than that returns the same snapshot. At typical 15–60s scrape intervals this is invisible; it exists so a scraper retry storm (or an abusive client that got hold of the token) cannot turn /metrics into a query amplifier. For traces and OTLP export, see OTLP setup.