Documentation Index
Fetch the complete documentation index at: https://docs.crewship.ai/llms.txt
Use this file to discover all available pages before exploring further.
Backup & Restore
Overview
Crewship’s backup system produces portable .tar.zst bundles that capture a workspace (or a single crew) in one file. Bundles are AGE-encrypted by default, carry a versioned manifest, and can be restored on any Crewship instance that speaks the same format version (N-2 compatibility guarantee). The whole subsystem is admin-only by design: every backup subcommand requires the OWNER or ADMIN role on the workspace, and the runner refuses MEMBER / VIEWER calls at both the CLI parsing layer and the server-side handler — defence in depth, not just a UI veneer.
The architectural choice that shapes everything else is “one bundle, one file”. Backups don’t depend on an external object store, don’t require a sidecar, and don’t shard across files. A .tar.zst is a single artefact an operator can scp to a backup host, hand to a customer for legal hold, or check into a private bucket — without the rest of Crewship being available. Inside the tarball, the manifest is plaintext JSON (so crewship backup inspect can read it without the AGE recipient key) and the payload is the encrypted SQLite snapshot plus any referenced workspace files. The forward-compatible manifest schema means a bundle produced on N can restore on N-1 and N-2; older bundles run their migration’s restoreBackfill hook so columns added since the snapshot are populated sanely instead of left at SQL defaults.
Restores are intentionally cautious. --dry-run walks the entire restore plan — schema diff, row counts, blob deltas — and prints what would happen without writing a single byte to the destination database. Advisory locking (backup_locks table, per-workspace) prevents two concurrent restore runs from corrupting each other, and the lock file records the host + PID so crewship backup status can tell you who’s holding it. If a host crashes mid-restore, crewship backup unlock is the manual recovery — admin-only with a confirmation prompt, because clearing a real lock from another live process is how you trash a workspace.
When to use it
Backups are cheap to make and expensive to wish you’d made. The five canonical reasons to run crewship backup create:
- Before any destructive admin write. Before
crewship admin reset-password, before a database migration on a binary upgrade, before a large schema-changing PR lands in prod — capture the current state first. The bundle is the one-command rollback if anything goes wrong.
- Disaster recovery / hot spare. Schedule a nightly
crewship backup create --scope=workspace --passphrase-file … cron and ship the bundle to a separate host. If the primary disk dies, restoring onto a fresh binary is one crewship backup restore away — no replication agent, no streaming WAL, no extra moving parts.
- Workspace migration to another host. Moving a workspace from a dev VM to a prod host (or between two prod hosts) is exactly what bundles are for. Create on the source, scp the file,
crewship backup restore --as-workspace <new-slug> on the destination, then provision crews. The --as-workspace rename avoids the “two acme workspaces colliding” hazard.
- Legal hold or compliance archive. Customer leaves; you need to keep their workspace state on cold storage for N years. One AGE-encrypted
.tar.zst is a forever-readable artefact — no live database, no service required. inspect later proves the bundle’s contents without decrypting.
- Forensic snapshot before incident response. Suspected compromise of an admin account, or a “what was the state when X happened” investigation.
backup create + immediate offsite copy preserves the audit trail before anyone (you, the attacker, the well-meaning oncall) starts changing things.
Skip backups for ephemeral development workspaces (the dev VM rebuilds them from seed scripts anyway) and single-agent throwaways with no user-visible state worth preserving — the bundle metadata cost is real (~few MB minimum) and not every workspace earns it.
Key concepts
| Term | What it means here |
|---|
| Bundle | A single .tar.zst artefact containing one MANIFEST (plaintext JSON), one payload.age (AGE-sealed tar of DB rows + workspace files), and one payload.sha256. The whole backup is this one file — portable, hashable, copy-able with scp. |
| Scope | workspace (the workspace row + every crew under it) or crew (a single crew + its agents). Crew bundles restore independently of their parent workspace, so a “move this crew to another workspace” is a viable migration path. |
| Format version | Integer in MANIFEST.format_version. Compatibility guarantee is N-2 — a bundle produced on format v3 restores on v3, v4, and v5 servers. Bumped only when a migration changes the bundle layout itself (not the schema rows inside). |
restoreBackfill hook | Per-migration Go function in internal/backup/restore_backfill.go. Runs when a bundle predates a migration and the restoring server has columns the source didn’t. Pure ADD COLUMN migrations rely on the SQL DEFAULT; complex backfills (JSON shape changes, foreign keys) provide a hook so restored rows land sanely. |
| AGE encryption | The bundle’s payload is sealed with filippo.io/age. Default mode is passphrase (scrypt-derived key); --recipient age1… switches to X25519 public-key encryption for hand-offs. --no-encrypt produces a plaintext payload for test/CI use only. |
| Passphrase keyring | Opt-in cache at ~/.crewship/backup-keyring.enc (AGE-encrypted with a single OS-keyring-stored key) so operators don’t retype the passphrase on every rotate / verify. Enabled via --use-keyring on create / restore. |
| Advisory lock | A row in the backup_locks table, primary-keyed on workspace ID. Taken before any DB dump or docker pause; released by defer Release() on the happy path. 1-hour TTL (DefaultLockTTL) so a crashed backup self-heals after the window. |
refuseIfBackupInProgress guard | A shared middleware wired into the assignments, peer-query, and webhook handlers. Reads the lock state and refuses new agent runs while a backup is in progress — closes the TOCTOU window between ensureAgentsIdle (initial check) and docker pause (the actual freeze). |
| Dry-run restore | --dry-run on crewship backup restore. Decrypts, validates the manifest, replays the DB transaction, then rolls back. The only side effect is one backup.restore.dry_run audit row — distinct from backup.restore so auditors can tell “verified” apart from “actually restored”. |
--as-workspace / --as-crew rename | Restore the bundle under a new slug instead of the original. Refuses to run the docker phase (container names derive from the slug) and tells the operator to crewship crew provision afterwards. Avoids “two acme workspaces colliding” during DR drills. |
| Retention sweep | crewship backup rotate --keep-last N --keep-days D. Per-workspace — never touches another workspace’s bundles. Both flags can be combined; both must be positive. --dry-run lists what would be deleted without touching disk. |
| Stale lock | A lock whose holder process is dead but whose row still exists. Detected when acquired_at + DefaultLockTTL < now(). Auto-released on next create; manually clearable via crewship backup unlock --force (admin-only, with confirmation). |
| Pre-migration snapshot | An automatic safety net distinct from the manual bundle system. Whenever crewship start detects pending migrations, database.SnapshotBeforeMigrate writes a VACUUM INTO copy of the live SQLite database to <dbpath>.pre-migrate-vN-to-vM-<UTC>.bak before any DDL runs. Last 10 snapshots are retained per database; opt out with CREWSHIP_SKIP_MIGRATION_BACKUP=1. Not a substitute for proper backups — it’s the binary upgrade safety net, not a disaster-recovery primitive. |
Usage
The whole backup surface is the crewship backup command group — nine subcommands that cover the create → verify → restore → rotate lifecycle. Each subcommand has a dedicated section below with full flag reference and copy-pasteable examples; this table is the entry point.
| Command | Purpose | Deep dive |
|---|
crewship backup create | Produce a new bundle for a workspace or crew. | Create a workspace backup / Back up a single crew |
crewship backup list | List bundles on disk. | List, inspect, verify |
crewship backup inspect | Print a bundle’s manifest without decrypting the payload. | List, inspect, verify |
crewship backup verify | Decrypt + checksum a bundle end-to-end (no DB writes). | List, inspect, verify |
crewship backup restore | Restore a bundle (supports --dry-run). | Restore |
crewship backup delete | Remove a single bundle (interactive confirm or --force). | Delete & rotate |
crewship backup rotate | Retention sweep — --keep-last N / --keep-days D. | Delete & rotate |
crewship backup status | Show the advisory lock state for the current workspace. | Lock semantics |
crewship backup unlock | Release a stale lock owned by this host (admin-only, confirm required). | Lock semantics |
The minimum end-to-end loop is four commands: create to produce the bundle, verify to prove it’s not corrupt, restore --dry-run to prove the destination will accept it, then restore for real. Every other subcommand exists for retention (rotate), introspection (list, inspect, status), or recovery (unlock).
create and restore accept --use-keyring to cache and reuse the
workspace passphrase via ~/.crewship/backup-keyring.enc. See
Passphrase keyring below.
Bundle layout
crewship-<scope>-<slug>-<iso-ts>.tar.zst
├── MANIFEST (plaintext JSON, format_version, scope, checksums)
├── payload.age (AGE-sealed tar.zst of DB rows + workspace files)
└── payload.sha256 (SHA-256 of the sealed payload bytes)
- Scope:
workspace (workspace row + all its crews) or crew (single crew + its agents).
- Encryption: AGE passphrase (default) or AGE X25519 recipient.
--no-encrypt produces a plaintext payload for test / CI use.
- Default location:
~/.crewship/backups/ on the server, mode 0700.
- Naming:
crewship-<scope>-<slug>-<iso-ts>.tar.zst. Collisions append -<hash8>.
Create a workspace backup
crewship backup create --scope=workspace
# Passphrase: ********
# Confirm passphrase: ********
# ✓ Backup created: /home/admin/.crewship/backups/crewship-workspace-acme-2026-04-15T12-05-01Z.tar.zst
The CLI prompts twice for a passphrase (to guard against typos) and confirms success with a row summarising scope, size, format version, and the SHA-256 of the sealed payload.
Non-interactive / CI
Supply the passphrase from a file:
crewship backup create --scope=workspace --passphrase-file /run/secrets/backup.pw
Or pipe a single line on stdin (falls back automatically when stdin is not a TTY and --passphrase-file is not set).
Asymmetric encryption
If the restoring party holds an AGE X25519 keypair, pass their public key instead of a shared secret:
crewship backup create \
--scope=workspace \
--recipient age1ql3z7hjy54pw3hyww5ayyfg7zqgvc7w3j2elw8zmrj2kg5sfn9aqmcac8p
--recipient, --passphrase-file, and --no-encrypt are mutually exclusive.
Back up a single crew
crewship backup create --scope=crew --crew dev-team
--crew accepts either a slug or a crew ID. Crew-scope bundles restore independently of their parent workspace.
List, inspect, verify
crewship backup list
# FILE SCOPE SIZE ENCRYPTED FORMAT CREATED_AT
# crewship-workspace-acme-2026-04-15T12-05-01Z… workspace 12.8 MiB yes v1 2026-04-15T12:05:01Z
crewship backup inspect ~/.crewship/backups/crewship-workspace-acme-…tar.zst
# { "format_version": 1, "scope": "workspace", "contents": { "workspace": { "slug": "acme", … } } }
crewship backup verify ~/.crewship/backups/crewship-workspace-acme-…tar.zst
# ✓ VALID — /home/admin/.crewship/backups/… (12.8 MiB)
inspect only reads the plaintext MANIFEST — it never touches the sealed payload, so no passphrase is needed. verify recomputes the SHA-256 of the sealed bytes against the manifest and fails if the bundle was truncated or tampered with. Neither decrypts.
Restore
crewship backup restore ~/.crewship/backups/crewship-workspace-acme-…tar.zst
# Passphrase: ********
# ✓ Restore complete — workspace=acme crews=4 rows=312 id=ws_abc123
The server rejects the restore if a workspace (or crew) with the same slug already exists. Override with --as-workspace <new-slug> or --as-crew <new-slug> to land the payload under a fresh identity:
crewship backup restore bundle.tar.zst --as-workspace acme-dr
# ⚠ Docker phase skipped (--as-workspace/--as-crew supplied).
# Provision the new crews with `crewship crew provision`.
When the bundle is landed under a new slug the docker phase is intentionally skipped — container names are derived from the slug, and renaming a live crew requires an explicit provision step.
Dry run
crewship backup restore bundle.tar.zst --dry-run
# ✓ Restore validation complete (dry-run; no workspace/crew data changes applied)
A dry-run decrypts the bundle, validates the manifest, replays the DB transaction, and then rolls back. The only side effect is a single backup.restore.dry_run row in the audit log — handy for proving a bundle is restorable before the real cutover. The distinct audit action lets auditors tell “verified” apart from “actually restored”.
Delete & rotate
crewship backup delete ~/.crewship/backups/old.tar.zst
# Delete backup …? [y/N] y
# ✓ Backup deleted
crewship backup rotate --keep-last 10 --keep-days 30 --dry-run
# Would delete 3 bundle(s):
# /home/admin/.crewship/backups/crewship-workspace-acme-2025-12-01…
# …
rotate applies retention per workspace — it never touches another workspace’s bundles. Either --keep-last N (bundles above the N newest are dropped) or --keep-days D (bundles older than D days are dropped) must be positive; both can be combined.
delete requires interactive confirmation, or --force in scripts / CI. The same rule applies to backup unlock.
Lock semantics
Each workspace holds at most one advisory backup lock at a time (table backup_locks, per-workspace PK). The lock:
- Is taken before any DB dump or docker pause and released by a deferred
Release() on the happy path.
- Has a 1-hour TTL (
DefaultLockTTL); a crashed backup self-heals after the window.
- Blocks concurrent
backup create calls — the second caller gets HTTP 409 Conflict with a “another backup is already in progress” message.
- Blocks new agent runs via the shared
refuseIfBackupInProgress guard wired into the assignments, peer-query, and webhook handlers. This closes the TOCTOU window between ensureAgentsIdle and docker pause.
Inspect or release the lock:
crewship backup status
# WORKSPACE ACQUIRED_BY ACQUIRED_AT EXPIRES_AT
# ws_abc123 admin@acme.io 2026-04-15T12:04:58Z 2026-04-15T13:04:58Z
crewship backup unlock --force
# ✓ Backup lock released.
backup unlock is an emergency escape hatch. Only use it when you can confirm no backup is actually running (e.g. the previous CLI session crashed and the 1 h TTL has not yet fired). Forcibly releasing a live backup’s lock will let a second backup start alongside it, and the two will race on the docker pause/unpause sequence.
Examples
Nightly hot-spare backup with 14-day retention
A workspace on prod-server.example.com should produce a bundle every night, ship it to a separate backup host, and keep 14 days on disk locally as a fast-rollback safety net.
# /etc/systemd/system/crewship-backup.service
[Unit]
Description=Crewship nightly backup
After=crewshipd.service
[Service]
Type=oneshot
User=crewship
Environment=CREWSHIP_DATA_DIR=/var/lib/crewship
ExecStart=/usr/local/bin/crewship backup create \
--scope=workspace \
--passphrase-file=/etc/crewship/backup.pw \
--use-keyring
ExecStartPost=/usr/bin/rsync -a --remove-source-files \
/var/lib/crewship/backups/ \
backup-host:/srv/crewship-backups/prod/
ExecStartPost=/usr/local/bin/crewship backup rotate \
--keep-last=14 --keep-days=14 --force
Paired with a crewship-backup.timer that fires at 02:37 daily (off-the-hour to avoid clustering with the rest of the fleet). The --use-keyring flag means the passphrase is read once and cached — the timer doesn’t have to redeliver it. The rotate step runs after the rsync so local-disk pressure is bounded even on long runs without a remote-side sweep.
Workspace migration to a new host
The acme workspace lives on crewship-old. You’re moving it to crewship-new to retire the old host. The destination already has its own workspaces, so a same-slug restore would conflict.
# On crewship-old:
crewship backup create --scope=workspace --passphrase-file backup.pw
# ✓ Backup created: /home/admin/.crewship/backups/crewship-workspace-acme-2026-05-14T03-12-00Z.tar.zst
# Ship it:
scp ~/.crewship/backups/crewship-workspace-acme-*.tar.zst crewship-new:/tmp/
# On crewship-new — pick a new slug because the host already has things:
crewship backup restore /tmp/crewship-workspace-acme-*.tar.zst \
--as-workspace=acme-migrated \
--passphrase-file=backup.pw
# ⚠ Docker phase skipped (--as-workspace supplied).
# Provision the new crews with `crewship crew provision`.
# Stand the crews up under the new slug:
crewship crew provision --workspace=acme-migrated --all
Once the new instance has parity (sanity-check via the UI), users get the new URL, and acme on the old host gets archived to cold storage before its workspace row is deleted.
DR drill with --dry-run
Every quarter the team proves the disaster-recovery bundle is actually restorable, without disturbing production state. Run the drill on a throwaway VM:
# 1. Copy the most recent prod bundle to the drill host:
scp prod:/var/lib/crewship/backups/crewship-workspace-acme-latest.tar.zst .
# 2. Inspect the manifest first (no passphrase needed):
crewship backup inspect crewship-workspace-acme-latest.tar.zst | jq '.format_version, .contents.workspace.slug'
# 1
# "acme"
# 3. Verify checksum end-to-end:
crewship backup verify crewship-workspace-acme-latest.tar.zst
# ✓ VALID — crewship-workspace-acme-latest.tar.zst (487 MiB)
# 4. Replay the restore inside a transaction, then roll back:
crewship backup restore crewship-workspace-acme-latest.tar.zst \
--dry-run \
--passphrase-file=backup.pw
# ✓ Restore validation complete (dry-run; no workspace/crew data changes applied)
A backup.restore.dry_run audit entry shows up in the workspace journal — auditors looking at “did anyone restore prod?” can tell drill runs apart from real restores by the action name.
API reference
The backup surface is CLI-first — every operation is reachable via crewship backup … and every flag the CLI accepts maps to an HTTP body field. The full HTTP schema lives at /api-reference/backup; a quick orientation:
| Method | Path | What it does |
|---|
POST | /api/v1/backups | Create a bundle. Auth + OWNER/ADMIN. Same trust gate as the CLI; concurrent calls hit the advisory lock and the second returns 409. |
GET | /api/v1/backups | List bundles on disk for the current workspace. |
GET | /api/v1/backups/{filename}/manifest | Plaintext manifest only — equivalent to crewship backup inspect. No passphrase required. |
POST | /api/v1/backups/{filename}/verify | SHA-256 integrity check of the sealed bytes against the manifest. Does not decrypt — no passphrase required. No DB writes. |
POST | /api/v1/backups/{filename}/restore | Restore. Body accepts dry_run, as_workspace, as_crew. Holds the lock for the whole call. |
DELETE | /api/v1/backups/{filename} | Delete a bundle. Idempotent — 404 is not an error in scripted use. |
POST | /api/v1/backups/rotate | Retention sweep. Body accepts keep_last, keep_days, dry_run. |
GET | /api/v1/backups/lock | Inspect the advisory lock state (used by crewship backup status). |
DELETE | /api/v1/backups/lock | Force-release a stale lock. Requires force=true body field — never call from automation without a confirmation gate. |
All routes are mounted in internal/api/router_backup.go. The CLI talks to these directly when run against a remote host (--server flag) and falls back to in-process Go calls when run on the same machine as crewshipd — bypassing HTTP entirely for the host-shell use case. The two paths share the same handler functions, so flags work identically.
Webhook payloads for backup.created / backup.restored / backup.restore.dry_run events are documented separately under Webhooks; metric emissions are documented under Metrics.
Streaming & memory bounds (large backups)
Restore and verify stream the sealed payload to a temp directory rather than buffering per-crew sections in a map[slug][]byte. Peak heap stays bounded by the zstd decoder window regardless of bundle size, so multi-GB restores run cleanly on small hosts. The extraction scratch directory is os.TempDir()/crewship-backup-… and is removed on Close(); a killed process leaves it behind for the next os.TempDir cleanup.
Passphrase keyring
The --use-keyring flag on create and restore caches the workspace
passphrase in ~/.crewship/backup-keyring.enc so scripts and repeat
rehearsals don’t re-prompt. The file is an AES-256-GCM-encrypted JSON
map keyed by workspace ID, using the same v1:<base64> envelope as the
credstore — without the host’s ENCRYPTION_KEY the contents are
unreadable even if the file leaks.
# First use on this workspace: prompts, then persists.
crewship backup create --scope=workspace --use-keyring
# Subsequent runs: silent, no prompt.
crewship backup create --scope=workspace --use-keyring
crewship backup restore bundle.tar.zst --use-keyring
Semantics worth knowing:
- No silent fallback on failure. Opening the keyring or reading an
entry reports the real error and aborts; only
ErrKeyringEntryNotFound
(first use on this workspace) falls through to a prompt.
- Write failures are non-fatal on
create. The bundle is already
written when the keyring save runs; the CLI logs a warning and
continues.
- Keyring is local to the operator’s host.
--use-keyring always
writes to ~/.crewship/ on the invoking machine — even if a future
remote bundle backend (S3 / GCS) is configured, the passphrase never
travels with the bundle.
- Single-process mutex, not file-locked. Two concurrent CLI
invocations against the same workspace are last-write-wins (the file
is small and the failure mode is “one passphrase lost, never data
corruption”). Filesystem-level locking is on the v0.2 roadmap.
--passphrase-file takes precedence. When both flags are passed,
the file wins and the keyring is not consulted.
Webhooks
Set CREWSHIP_BACKUP_WEBHOOK_URL (and CREWSHIP_BACKUP_WEBHOOK_SECRET)
on the server process to receive a signed POST for each backup lifecycle
event. Delivery is fire-and-forget from a goroutine — a slow or down
webhook never blocks the backup run.
export CREWSHIP_BACKUP_WEBHOOK_URL=https://hooks.example.com/crewship
export CREWSHIP_BACKUP_WEBHOOK_SECRET=$(openssl rand -hex 32)
Each event is JSON with the shape:
{
"event": "backup.created",
"timestamp": "2026-04-15T12:05:01Z",
"workspace_id": "ws_abc123",
"scope": "workspace",
"path": "/home/admin/.crewship/backups/...tar.zst",
"bytes": 13421772,
"payload_sha256": "…",
"error": ""
}
Events: backup.created, backup.failed, backup.restored.
Each request carries an X-Crewship-Signature: sha256=<hex> header —
HMAC-SHA256 over the raw body using CREWSHIP_BACKUP_WEBHOOK_SECRET.
Receivers must verify the signature (same scheme as Crewship’s
inbound webhooks; validate via webhook.ValidateHMAC after stripping
the sha256= prefix). The secret is required whenever URL is set —
sending a body unsigned would let any network listener forge events to
a downstream consumer that trusts the feed. URLs with userinfo or query
strings are redacted before ever appearing in logs / audit rows, so
basic-auth credentials or signed-URL tokens do not leak.
Metrics
GET /api/v1/admin/backups/metrics (instance OWNER only; see below)
returns a point-in-time snapshot of process-lifetime counters. The
numbers reset on restart — persistent observability belongs in the
audit log and its dashboards.
{
"created_total": 5,
"created_by_scope": { "workspace": 4, "crew": 1 },
"failed_total": 0,
"failed_by_reason": {},
"restored_total": 2,
"size_bytes_total": 67108864,
"duration_seconds_p50": 4.1,
"duration_seconds_p95": 12.7,
"duration_seconds_mean": 6.3,
"lock_held_seconds_by_workspace": { "ws_abc123": 0 }
}
Duration quantiles are approximated from an in-memory ring buffer —
fine for the dozens-to-hundreds of samples a single host accumulates
between restarts; not a general-purpose histogram. For long-horizon
reporting, ingest the backup.* rows from audit_log.
Instance-scope backup
An instance-scope backup bundles every workspace on a Crewship host
plus the cross-workspace surfaces that make the install usable — the
credstore, the auth signing secret, and the instance identity
(instance_config.hostname). It is the disaster-recovery path for an
entire host, not a normal operational tool.
Key differences from workspace/crew scope:
- Access control. Gated by the
CREWSHIP_OWNER_EMAIL env var
(server-level OWNER), not workspace role. A workspace OWNER / ADMIN
on their own is refused with HTTP 403.
- Rate limit. One instance backup per user per sliding hour. A
runaway cron cannot DoS the host.
- Encryption is recipient-only.
--passphrase-file is refused for
this scope — the surface is too broad (every workspace’s secrets in
one blob) to trust a brute-forceable passphrase. Callers must supply
an AGE age1… X25519 public key and hold the matching private key
offline.
- Cross-host restores force session-key rotation. The bundle records
the source hostname; a restore onto a different target invalidates
every existing JWE session to prevent source-host tokens from
remaining valid after DR.
Full threat model, crypto chain, and operational checklist:
Security → Instance-Scope Backup Security.
Admin UI
A Backups tab lives in /admin for OWNER / ADMIN users. It wraps the
same REST endpoints the CLI drives and adds:
- A status banner for the advisory lock (who holds it, TTL remaining).
- Create / restore dialogs with passphrase input (no keyring — the
keyring is a CLI-side convenience; the browser never sees
~/.crewship/).
- An inspect panel that renders the plaintext manifest without
decrypting the payload (same as
crewship backup inspect).
- A bundle list with size, scope, format version, and created-at,
fetched via
hooks/use-backups.ts.
The UI does not expose instance-scope operations — they remain
CLI + env-gated to reduce blast radius from a compromised admin session.
Known caveats (v0.2 roadmap)
The following items are intentionally deferred to v0.2. They do
not block production use but are worth knowing:
preBackup / postBackup hooks — no user-defined hooks yet. If
your workspace has services that need an app-level flush, run them
manually before invoking crewship backup create.
- Remote backends (S3 / B2 / GCS) — bundles live on the server’s
local disk only. The storage layer is now abstracted behind a
StorageOps interface so a future backend swap won’t require a
second refactor of every call-site. Today: use scp / rclone /
restic to ship bundles off-box, or stream a single bundle via
GET /api/v1/admin/backups/download.
- Scheduled backups — no built-in scheduler. Wrap
crewship backup create in cron or systemd.timer.
- Forward migration replay hooks. The plumbing is wired —
migrations can register a per-version
restoreBackfill function and
the restorer walks the applied ∖ manifest set in ascending order
after the main transaction commits — but no migration registers a
hook yet. A failed backfill surfaces as ErrRestoreBackfillFailed;
the restored rows are visible but may be missing backfilled columns
until an admin investigates.
- Cross-process keyring lock. The per-process mutex does not cover
two concurrent CLI invocations racing on the same keyring file.
Common pitfalls
- Losing the passphrase = losing the data. AGE bundles are not recoverable without the passphrase (or X25519 private key). There is no master key, no support escape hatch. Store the passphrase in a separate trust zone from the bundle — a password manager on a different host, a sealed envelope, anything that doesn’t share a failure mode with the disk holding the
.tar.zst.
backup unlock --force on a live backup corrupts the bundle. The advisory lock exists precisely because two concurrent backups race on the docker pause / docker unpause sequence. Only ever clear a lock when you can prove its holder is dead (CLI session crashed, host rebooted) and the 1-hour TTL hasn’t fired yet. When in doubt, wait for the TTL.
CREWSHIP_DATA_DIR mismatch silently targets the wrong database. Like the Admin CLI trap — if the server runs with a custom data dir but you invoke crewship backup … without the same env var, the CLI defaults to ~/.crewship and operates on an empty/separate database. Backups succeed but capture nothing useful; restores produce “workspace doesn’t exist” errors. Export the same CREWSHIP_DATA_DIR the server uses, or pass --data-dir.
- Stop the server before a
restore against the same database. Restore takes the advisory lock but a running crewshipd may still hold open handles on tables the restorer wants to truncate. Symptoms range from “database is locked” SQLite errors to a half-applied restore that leaves the workspace in a non-bootable state. The lock guards against concurrent backups, not against the server’s own writes — stop the service first.
- Format version drift breaks restores past N-2. A bundle written by format v5 will restore on v5, v6, and v7 servers; v8 onward, the compatibility window has rolled past it. If you’re restoring from cold storage that’s been sitting for a year+, verify the manifest’s
format_version against the destination first with crewship backup inspect.
--recipient / --passphrase-file / --no-encrypt are mutually exclusive. Passing two raises a CLI error rather than silently picking one. A bundle encrypted with --recipient age1… will not decrypt with a passphrase, and vice versa — match the restore flag to the original create flag.
- Same-slug restore refuses by default.
crewship backup restore will not overwrite an existing workspace or crew under the same slug. If the existing one is stale and you want to replace it, delete it first (crewship workspace delete / crewship crew delete); if you want both side-by-side, use --as-workspace=<new-slug> and then crewship crew provision.
- Disk space during restore is not bounded by the heap streaming guarantee. Peak RAM is fixed by the zstd window, but the extraction scratch directory at
$TMPDIR/crewship-backup-* needs ~2× the bundle size on disk. Restoring a 4 GB bundle onto a host with 6 GB free in /tmp will fail mid-stream. Mount /tmp on the data volume or set TMPDIR to a larger filesystem.
- Scripted retries without
--use-keyring deadlock on the passphrase prompt. A non-TTY second invocation will block waiting for stdin. Either pass --passphrase-file, pipe the passphrase via stdin redirect, or use --use-keyring so the first call seeds the cache.
- Bundles include workspace bind-mount files. User code in
/workspace, agent memory markdown, anything mounted into the agent container — all included in the bundle. AGE encryption is the only defence; treat a .tar.zst like any other backup of source code and credentials.
/secrets is the only directory NEVER included. This is load-bearing — agents pull credentials from /secrets at runtime via Keeper, and including them in a portable bundle would defeat the SECRET-tier guarantee. Don’t add code that bypasses the secret-skipping filter without an equally rigorous out-of-band channel.
Security notes
- The
/secrets mount is never included in a bundle.
- The workspace bind mount (user code) and memory markdown files are included. AGE encryption of the payload is the only defence against leakage from physical bundle distribution — treat bundles like any other backup of your source tree.
- Every
backup.* event writes to audit_log with user, role, scope, sealed-payload SHA-256, and size. Dry-run restores write backup.restore.dry_run so auditors can tell rehearsals from real cutover.
Automatic pre-migration snapshots
Distinct from the manual crewship backup bundle system above:
Crewship auto-snapshots the SQLite DB before any pending migration
runs. This is the “binary upgrade went sideways, give me my data
back” safety net.
How it works
Every crewship start calls
database.SnapshotBeforeMigrate before database.Migrate. If any
migrations are pending (rows in the migrations[] slice with version
MAX(version) FROM _migrations), the function:
- Resolves the SQLite file path from the connection DSN.
- Computes a target name:
<dbpath>.pre-migrate-v<from>-to-v<to>-<UTC-RFC3339>.bak.
- Issues
VACUUM INTO '<target>' — SQLite’s hot-copy mechanism. Safer
than a plain file copy because it serializes against concurrent
writers and produces a defragmented, WAL-checkpointed snapshot.
- Chmods the snapshot to
0600.
- Prunes older snapshots, keeping the 10 most recent per database file.
On error the boot aborts before any migration runs — silently
continuing without a rollback point would defeat the entire purpose.
Opting out
CREWSHIP_SKIP_MIGRATION_BACKUP=1 crewship start
Skips the snapshot. Useful for CI environments where the DB is
ephemeral and snapshot I/O is just overhead.
Recovering from a bad migration
If crewship start succeeds, applies a migration, then exhibits
runtime errors that point at schema drift:
# Find the most recent pre-migration snapshot
ls -t ~/.crewship/crewship.db.pre-migrate-*.bak | head -1
# Stop crewship
crewship stop # or kill -TERM if needed
# Replace the current DB with the snapshot
mv ~/.crewship/crewship.db ~/.crewship/crewship.db.bad
cp ~/.crewship/crewship.db.pre-migrate-v66-to-v87-20260514T143012Z.bak \
~/.crewship/crewship.db
# Run with the PREVIOUS binary (the one that wrote the snapshot)
/path/to/older/crewship start
The snapshot is a complete SQLite file — no special tooling needed.
Tools like sqlite3 open it directly for inspection.
Limits of pre-migration snapshots vs manual backups
| Capability | Pre-migration snapshot | crewship backup create |
|---|
| Encrypted at rest | No (chmod 0600 only) | Yes (AGE) |
| Portable across hosts | Same machine only | Yes |
| Includes workspace files | No (DB only) | Yes (DB + bind-mount + memory) |
| Captures container state | No | Yes (snapshots running containers) |
| Cross-version restore | Snapshot binary only | Forward-compatible via restoreBackfill |
| Triggered automatically | Yes, on pending migration | No (manual or scheduled) |
| Retention | 10 newest per DB file | Operator-controlled via rotate |
Use both. Pre-migration snapshots cover binary upgrades; crewship backup covers DR, host migration, legal hold, and forensic
preservation. They are not interchangeable.
- Backup API reference — REST endpoint shapes for every
crewship backup subcommand.
- Admin CLI — the host-shell write surface that backups protect against; always
crewship backup create before running admin writes in prod.
- Migrations Catalog — what migration ran when, and which
restoreBackfill hook (if any) will fire on a cross-version restore.
- Troubleshooting — recovering from a stuck migration or corrupt DB.
- Security → Audit log — where
backup.created, backup.restored, and backup.restore.dry_run events land.