Backup & Restore

Overview

Crewship’s backup system produces portable .tar.zst bundles that capture a workspace (or a single crew) in one file. Bundles are AGE-encrypted by default, carry a versioned manifest, and can be restored on any Crewship instance that speaks the same format version (N-2 compatibility guarantee). The whole subsystem is admin-only by design: every backup subcommand requires the OWNER or ADMIN role on the workspace, and the runner refuses MEMBER / VIEWER calls at both the CLI parsing layer and the server-side handler — defence in depth, not just a UI veneer. The architectural choice that shapes everything else is “one bundle, one file”. Backups don’t depend on an external object store, don’t require a sidecar, and don’t shard across files. A .tar.zst is a single artefact an operator can scp to a backup host, hand to a customer for legal hold, or check into a private bucket — without the rest of Crewship being available. Inside the tarball, the manifest is plaintext JSON (so crewship backup inspect can read it without the AGE recipient key) and the payload is the encrypted SQLite snapshot plus any referenced workspace files. The forward-compatible manifest schema means a bundle produced on N can restore on N-1 and N-2; older bundles run their migration’s restoreBackfill hook so columns added since the snapshot are populated sanely instead of left at SQL defaults. Restores are intentionally cautious. --dry-run walks the entire restore plan — schema diff, row counts, blob deltas — and prints what would happen without writing a single byte to the destination database. Advisory locking (backup_locks table, per-workspace) prevents two concurrent restore runs from corrupting each other, and the lock file records the host + PID so crewship backup status can tell you who’s holding it. If a host crashes mid-restore, crewship backup unlock is the manual recovery — admin-only with a confirmation prompt, because clearing a real lock from another live process is how you trash a workspace.

When to use it

Backups are cheap to make and expensive to wish you’d made. The five canonical reasons to run crewship backup create:

Before any destructive admin write. Before crewship admin reset-password, before a database migration on a binary upgrade, before a large schema-changing PR lands in prod — capture the current state first. The bundle is the one-command rollback if anything goes wrong.
Disaster recovery / hot spare. Schedule a nightly crewship backup create --scope=workspace --passphrase-file … cron and ship the bundle to a separate host. If the primary disk dies, restoring onto a fresh binary is one crewship backup restore away — no replication agent, no streaming WAL, no extra moving parts.
Workspace migration to another host. Moving a workspace from a dev VM to a prod host (or between two prod hosts) is exactly what bundles are for. Create on the source, scp the file, crewship backup restore --as-workspace <new-slug> on the destination, then provision crews. The --as-workspace rename avoids the “two acme workspaces colliding” hazard.
Legal hold or compliance archive. Customer leaves; you need to keep their workspace state on cold storage for N years. One AGE-encrypted .tar.zst is a forever-readable artefact — no live database, no service required. inspect later proves the bundle’s contents without decrypting.
Forensic snapshot before incident response. Suspected compromise of an admin account, or a “what was the state when X happened” investigation. backup create + immediate offsite copy preserves the audit trail before anyone (you, the attacker, the well-meaning oncall) starts changing things.

Skip backups for ephemeral development workspaces (the dev VM rebuilds them from seed scripts anyway) and single-agent throwaways with no user-visible state worth preserving — the bundle metadata cost is real (~few MB minimum) and not every workspace earns it.

Key concepts

Term	What it means here
Bundle	A single `.tar.zst` artefact containing one `MANIFEST` (plaintext JSON), one `payload.age` (AGE-sealed tar of DB rows + workspace files), and one `payload.sha256`. The whole backup is this one file — portable, hashable, copy-able with `scp`.
Scope	`workspace` (the workspace row + every crew under it) or `crew` (a single crew + its agents). Crew bundles restore independently of their parent workspace, so a “move this crew to another workspace” is a viable migration path.
Format version	Integer in `MANIFEST.format_version`. Compatibility guarantee is N-2 — a bundle produced on format v3 restores on v3, v4, and v5 servers. Bumped only when a migration changes the bundle layout itself (not the schema rows inside).
`restoreBackfill` hook	Per-migration Go function in `internal/backup/restore_backfill.go`. Runs when a bundle predates a migration and the restoring server has columns the source didn’t. Pure `ADD COLUMN` migrations rely on the SQL DEFAULT; complex backfills (JSON shape changes, foreign keys) provide a hook so restored rows land sanely.
AGE encryption	The bundle’s payload is sealed with `filippo.io/age`. Default mode is passphrase (`scrypt`-derived key); `--recipient age1…` switches to X25519 public-key encryption for hand-offs. `--no-encrypt` produces a plaintext payload for test/CI use only.
Passphrase keyring	Opt-in cache at `~/.crewship/backup-keyring.enc` (AGE-encrypted with a single OS-keyring-stored key) so operators don’t retype the passphrase on every `rotate` / `verify`. Enabled via `--use-keyring` on `create` / `restore`.
Advisory lock	A row in the `backup_locks` table, primary-keyed on workspace ID. Taken before any DB dump or docker pause; released by `defer Release()` on the happy path. 1-hour TTL (`DefaultLockTTL`) so a crashed backup self-heals after the window.
`refuseIfBackupInProgress` guard	A shared middleware wired into the assignments, peer-query, and webhook handlers. Reads the lock state and refuses new agent runs while a backup is in progress — closes the TOCTOU window between `ensureAgentsIdle` (initial check) and `docker pause` (the actual freeze).
Dry-run restore	`--dry-run` on `crewship backup restore`. Decrypts, validates the manifest, replays the DB transaction, then rolls back. The only side effect is one `backup.restore.dry_run` audit row — distinct from `backup.restore` so auditors can tell “verified” apart from “actually restored”.
`--as-workspace` / `--as-crew` rename	Restore the bundle under a new slug instead of the original. Refuses to run the docker phase (container names derive from the slug) and tells the operator to `crewship crew provision` afterwards. Avoids “two `acme` workspaces colliding” during DR drills.
Retention sweep	`crewship backup rotate --keep-last N --keep-days D`. Per-workspace — never touches another workspace’s bundles. Both flags can be combined; both must be positive. `--dry-run` lists what would be deleted without touching disk.
Stale lock	A lock whose holder process is dead but whose row still exists. Detected when `acquired_at + DefaultLockTTL < now()`. Auto-released on next `create`; manually clearable via `crewship backup unlock --force` (admin-only, with confirmation).
Pre-migration snapshot	An automatic safety net distinct from the manual bundle system. Whenever `crewship start` detects pending migrations, `database.SnapshotBeforeMigrate` writes a `VACUUM INTO` copy of the live SQLite database to `<dbpath>.pre-migrate-vN-to-vM-<UTC>.bak` before any DDL runs. Last 10 snapshots are retained per database; opt out with `CREWSHIP_SKIP_MIGRATION_BACKUP=1`. Not a substitute for proper backups — it’s the binary upgrade safety net, not a disaster-recovery primitive.

Usage

The whole backup surface is the crewship backup command group — nine subcommands that cover the create → verify → restore → rotate lifecycle. Each subcommand has a dedicated section below with full flag reference and copy-pasteable examples; this table is the entry point.

Command	Purpose	Deep dive
`crewship backup create`	Produce a new bundle for a workspace or crew.	Create a workspace backup / Back up a single crew
`crewship backup list`	List bundles on disk.	List, inspect, verify
`crewship backup inspect`	Print a bundle’s manifest without decrypting the payload.	List, inspect, verify
`crewship backup verify`	Decrypt + checksum a bundle end-to-end (no DB writes).	List, inspect, verify
`crewship backup restore`	Restore a bundle (supports `--dry-run`).	Restore
`crewship backup delete`	Remove a single bundle (interactive confirm or `--force`).	Delete & rotate
`crewship backup rotate`	Retention sweep — `--keep-last N` / `--keep-days D`.	Delete & rotate
`crewship backup status`	Show the advisory lock state for the current workspace.	Lock semantics
`crewship backup unlock`	Release a stale lock owned by this host (admin-only, confirm required).	Lock semantics

The minimum end-to-end loop is four commands: create to produce the bundle, verify to prove it’s not corrupt, restore --dry-run to prove the destination will accept it, then restore for real. Every other subcommand exists for retention (rotate), introspection (list, inspect, status), or recovery (unlock).

create and restore accept --use-keyring to cache and reuse the workspace passphrase via ~/.crewship/backup-keyring.enc. See Passphrase keyring below.

Bundle layout

crewship-<scope>-<slug>-<iso-ts>.tar.zst
├── MANIFEST           (plaintext JSON, format_version, scope, checksums)
├── payload.age        (AGE-sealed tar.zst of DB rows + workspace files)
└── payload.sha256     (SHA-256 of the sealed payload bytes)

Scope: workspace (workspace row + all its crews) or crew (single crew + its agents).
Encryption: AGE passphrase (default) or AGE X25519 recipient. --no-encrypt produces a plaintext payload for test / CI use.
Default location: ~/.crewship/backups/ on the server, mode 0700.
Naming: crewship-<scope>-<slug>-<iso-ts>.tar.zst. Collisions append -<hash8>.

Create a workspace backup

crewship backup create --scope=workspace
# Passphrase: ********
# Confirm passphrase: ********
# ✓ Backup created: /home/admin/.crewship/backups/crewship-workspace-acme-2026-04-15T12-05-01Z.tar.zst

The CLI prompts twice for a passphrase (to guard against typos) and confirms success with a row summarising scope, size, format version, and the SHA-256 of the sealed payload.

Non-interactive / CI

Supply the passphrase from a file:

crewship backup create --scope=workspace --passphrase-file /run/secrets/backup.pw

Or pipe a single line on stdin (falls back automatically when stdin is not a TTY and --passphrase-file is not set).

Asymmetric encryption

If the restoring party holds an AGE X25519 keypair, pass their public key instead of a shared secret:

crewship backup create \
  --scope=workspace \
  --recipient age1ql3z7hjy54pw3hyww5ayyfg7zqgvc7w3j2elw8zmrj2kg5sfn9aqmcac8p

--recipient, --passphrase-file, and --no-encrypt are mutually exclusive.

Back up a single crew

crewship backup create --scope=crew --crew dev-team

--crew accepts either a slug or a crew ID. Crew-scope bundles restore independently of their parent workspace.

List, inspect, verify

crewship backup list
# FILE                                              SCOPE      SIZE     ENCRYPTED  FORMAT  CREATED_AT
# crewship-workspace-acme-2026-04-15T12-05-01Z…   workspace  12.8 MiB yes        v1      2026-04-15T12:05:01Z

crewship backup inspect ~/.crewship/backups/crewship-workspace-acme-…tar.zst
# { "format_version": 1, "scope": "workspace", "contents": { "workspace": { "slug": "acme", … } } }

crewship backup verify ~/.crewship/backups/crewship-workspace-acme-…tar.zst
# ✓ VALID — /home/admin/.crewship/backups/… (12.8 MiB)

inspect only reads the plaintext MANIFEST — it never touches the sealed payload, so no passphrase is needed. verify recomputes the SHA-256 of the sealed bytes against the manifest and fails if the bundle was truncated or tampered with. Neither decrypts.

Restore

crewship backup restore ~/.crewship/backups/crewship-workspace-acme-…tar.zst
# Passphrase: ********
# ✓ Restore complete — workspace=acme crews=4 rows=312 id=ws_abc123

The server rejects the restore if a workspace (or crew) with the same slug already exists. Override with --as-workspace <new-slug> or --as-crew <new-slug> to land the payload under a fresh identity:

crewship backup restore bundle.tar.zst --as-workspace acme-dr
# ⚠ Docker phase skipped (--as-workspace/--as-crew supplied).
#   Provision the new crews with `crewship crew provision`.

When the bundle is landed under a new slug the docker phase is intentionally skipped — container names are derived from the slug, and renaming a live crew requires an explicit provision step.

Dry run

crewship backup restore bundle.tar.zst --dry-run
# ✓ Restore validation complete (dry-run; no workspace/crew data changes applied)

A dry-run decrypts the bundle, validates the manifest, replays the DB transaction, and then rolls back. The only side effect is a single backup.restore.dry_run row in the audit log — handy for proving a bundle is restorable before the real cutover. The distinct audit action lets auditors tell “verified” apart from “actually restored”.

Delete & rotate

crewship backup delete ~/.crewship/backups/old.tar.zst
# Delete backup …? [y/N] y
# ✓ Backup deleted

crewship backup rotate --keep-last 10 --keep-days 30 --dry-run
# Would delete 3 bundle(s):
#   /home/admin/.crewship/backups/crewship-workspace-acme-2025-12-01…
#   …

rotate applies retention per workspace — it never touches another workspace’s bundles. Either --keep-last N (bundles above the N newest are dropped) or --keep-days D (bundles older than D days are dropped) must be positive; both can be combined. delete requires interactive confirmation, or --force in scripts / CI. The same rule applies to backup unlock.

Lock semantics

Each workspace holds at most one advisory backup lock at a time (table backup_locks, per-workspace PK). The lock:

Is taken before any DB dump or docker pause and released by a deferred Release() on the happy path.
Has a 1-hour TTL (DefaultLockTTL); a crashed backup self-heals after the window.
Blocks concurrent backup create calls — the second caller gets HTTP 409 Conflict with a “another backup is already in progress” message.
Blocks new agent runs via the shared refuseIfBackupInProgress guard wired into the assignments, peer-query, and webhook handlers. This closes the TOCTOU window between ensureAgentsIdle and docker pause.

Inspect or release the lock:

crewship backup status
# WORKSPACE  ACQUIRED_BY     ACQUIRED_AT           EXPIRES_AT
# ws_abc123  admin@acme.io   2026-04-15T12:04:58Z  2026-04-15T13:04:58Z

crewship backup unlock --force
# ✓ Backup lock released.

backup unlock is an emergency escape hatch. Only use it when you can confirm no backup is actually running (e.g. the previous CLI session crashed and the 1 h TTL has not yet fired). Forcibly releasing a live backup’s lock will let a second backup start alongside it, and the two will race on the docker pause/unpause sequence.

Examples

Nightly hot-spare backup with 14-day retention

A workspace on prod-server.example.com should produce a bundle every night, ship it to a separate backup host, and keep 14 days on disk locally as a fast-rollback safety net.

# /etc/systemd/system/crewship-backup.service
[Unit]
Description=Crewship nightly backup
After=crewshipd.service

[Service]
Type=oneshot
User=crewship
Environment=CREWSHIP_DATA_DIR=/var/lib/crewship
ExecStart=/usr/local/bin/crewship backup create \
  --scope=workspace \
  --passphrase-file=/etc/crewship/backup.pw \
  --use-keyring
ExecStartPost=/usr/bin/rsync -a --remove-source-files \
  /var/lib/crewship/backups/ \
  backup-host:/srv/crewship-backups/prod/
ExecStartPost=/usr/local/bin/crewship backup rotate \
  --keep-last=14 --keep-days=14 --force

Paired with a crewship-backup.timer that fires at 02:37 daily (off-the-hour to avoid clustering with the rest of the fleet). The --use-keyring flag means the passphrase is read once and cached — the timer doesn’t have to redeliver it. The rotate step runs after the rsync so local-disk pressure is bounded even on long runs without a remote-side sweep.

Workspace migration to a new host

The acme workspace lives on crewship-old. You’re moving it to crewship-new to retire the old host. The destination already has its own workspaces, so a same-slug restore would conflict.

# On crewship-old:
crewship backup create --scope=workspace --passphrase-file backup.pw
# ✓ Backup created: /home/admin/.crewship/backups/crewship-workspace-acme-2026-05-14T03-12-00Z.tar.zst

# Ship it:
scp ~/.crewship/backups/crewship-workspace-acme-*.tar.zst crewship-new:/tmp/

# On crewship-new — pick a new slug because the host already has things:
crewship backup restore /tmp/crewship-workspace-acme-*.tar.zst \
  --as-workspace=acme-migrated \
  --passphrase-file=backup.pw
# ⚠ Docker phase skipped (--as-workspace supplied).
#   Provision the new crews with `crewship crew provision`.

# Stand the crews up under the new slug:
crewship crew provision --workspace=acme-migrated --all

Once the new instance has parity (sanity-check via the UI), users get the new URL, and acme on the old host gets archived to cold storage before its workspace row is deleted.

DR drill with `--dry-run`

Every quarter the team proves the disaster-recovery bundle is actually restorable, without disturbing production state. Run the drill on a throwaway VM:

# 1. Copy the most recent prod bundle to the drill host:
scp prod:/var/lib/crewship/backups/crewship-workspace-acme-latest.tar.zst .

# 2. Inspect the manifest first (no passphrase needed):
crewship backup inspect crewship-workspace-acme-latest.tar.zst | jq '.format_version, .contents.workspace.slug'
# 1
# "acme"

# 3. Verify checksum end-to-end:
crewship backup verify crewship-workspace-acme-latest.tar.zst
# ✓ VALID — crewship-workspace-acme-latest.tar.zst (487 MiB)

# 4. Replay the restore inside a transaction, then roll back:
crewship backup restore crewship-workspace-acme-latest.tar.zst \
  --dry-run \
  --passphrase-file=backup.pw
# ✓ Restore validation complete (dry-run; no workspace/crew data changes applied)

A backup.restore.dry_run audit entry shows up in the workspace journal — auditors looking at “did anyone restore prod?” can tell drill runs apart from real restores by the action name.

API reference

The backup surface is CLI-first — every operation is reachable via crewship backup … and every flag the CLI accepts maps to an HTTP body field. The full HTTP schema lives at /api-reference/backup; a quick orientation:

Method	Path	What it does
`POST`	`/api/v1/backups`	Create a bundle. Auth + OWNER/ADMIN. Same trust gate as the CLI; concurrent calls hit the advisory lock and the second returns 409.
`GET`	`/api/v1/backups`	List bundles on disk for the current workspace.
`GET`	`/api/v1/backups/{filename}/manifest`	Plaintext manifest only — equivalent to `crewship backup inspect`. No passphrase required.
`POST`	`/api/v1/backups/{filename}/verify`	SHA-256 integrity check of the sealed bytes against the manifest. Does not decrypt — no passphrase required. No DB writes.
`POST`	`/api/v1/backups/{filename}/restore`	Restore. Body accepts `dry_run`, `as_workspace`, `as_crew`. Holds the lock for the whole call.
`DELETE`	`/api/v1/backups/{filename}`	Delete a bundle. Idempotent — 404 is not an error in scripted use.
`POST`	`/api/v1/backups/rotate`	Retention sweep. Body accepts `keep_last`, `keep_days`, `dry_run`.
`GET`	`/api/v1/backups/lock`	Inspect the advisory lock state (used by `crewship backup status`).
`DELETE`	`/api/v1/backups/lock`	Force-release a stale lock. Requires `force=true` body field — never call from automation without a confirmation gate.

All routes are mounted in internal/api/router_backup.go. The CLI talks to these directly when run against a remote host (--server flag) and falls back to in-process Go calls when run on the same machine as crewshipd — bypassing HTTP entirely for the host-shell use case. The two paths share the same handler functions, so flags work identically. Webhook payloads for backup.created / backup.restored / backup.restore.dry_run events are documented separately under Webhooks; metric emissions are documented under Metrics.

Streaming & memory bounds (large backups)

Restore and verify stream the sealed payload to a temp directory rather than buffering per-crew sections in a map[slug][]byte. Peak heap stays bounded by the zstd decoder window regardless of bundle size, so multi-GB restores run cleanly on small hosts. The extraction scratch directory is os.TempDir()/crewship-backup-… and is removed on Close(); a killed process leaves it behind for the next os.TempDir cleanup.

Passphrase keyring

The --use-keyring flag on create and restore caches the workspace passphrase in ~/.crewship/backup-keyring.enc so scripts and repeat rehearsals don’t re-prompt. The file is an AES-256-GCM-encrypted JSON map keyed by workspace ID, using the same v1:<base64> envelope as the credstore — without the host’s ENCRYPTION_KEY the contents are unreadable even if the file leaks.

# First use on this workspace: prompts, then persists.
crewship backup create --scope=workspace --use-keyring

# Subsequent runs: silent, no prompt.
crewship backup create --scope=workspace --use-keyring
crewship backup restore bundle.tar.zst --use-keyring

Semantics worth knowing:

No silent fallback on failure. Opening the keyring or reading an entry reports the real error and aborts; only ErrKeyringEntryNotFound (first use on this workspace) falls through to a prompt.
Write failures are non-fatal on create. The bundle is already written when the keyring save runs; the CLI logs a warning and continues.
Keyring is local to the operator’s host. --use-keyring always writes to ~/.crewship/ on the invoking machine — even if a future remote bundle backend (S3 / GCS) is configured, the passphrase never travels with the bundle.
Single-process mutex, not file-locked. Two concurrent CLI invocations against the same workspace are last-write-wins (the file is small and the failure mode is “one passphrase lost, never data corruption”). Filesystem-level locking is on the v0.2 roadmap.
--passphrase-file takes precedence. When both flags are passed, the file wins and the keyring is not consulted.

Webhooks

Set CREWSHIP_BACKUP_WEBHOOK_URL (and CREWSHIP_BACKUP_WEBHOOK_SECRET) on the server process to receive a signed POST for each backup lifecycle event. Delivery is fire-and-forget from a goroutine — a slow or down webhook never blocks the backup run.

export CREWSHIP_BACKUP_WEBHOOK_URL=https://hooks.example.com/crewship
export CREWSHIP_BACKUP_WEBHOOK_SECRET=$(openssl rand -hex 32)

Each event is JSON with the shape:

{
  "event":          "backup.created",
  "timestamp":      "2026-04-15T12:05:01Z",
  "workspace_id":   "ws_abc123",
  "scope":          "workspace",
  "path":           "/home/admin/.crewship/backups/...tar.zst",
  "bytes":          13421772,
  "payload_sha256": "…",
  "error":          ""
}

Events: backup.created, backup.failed, backup.restored. Each request carries an X-Crewship-Signature: sha256=<hex> header — HMAC-SHA256 over the raw body using CREWSHIP_BACKUP_WEBHOOK_SECRET. Receivers must verify the signature (same scheme as Crewship’s inbound webhooks; validate via webhook.ValidateHMAC after stripping the sha256= prefix). The secret is required whenever URL is set — sending a body unsigned would let any network listener forge events to a downstream consumer that trusts the feed. URLs with userinfo or query strings are redacted before ever appearing in logs / audit rows, so basic-auth credentials or signed-URL tokens do not leak.

Metrics

GET /api/v1/admin/backups/metrics (instance OWNER only; see below) returns a point-in-time snapshot of process-lifetime counters. The numbers reset on restart — persistent observability belongs in the audit log and its dashboards.

{
  "created_total":                5,
  "created_by_scope":             { "workspace": 4, "crew": 1 },
  "failed_total":                 0,
  "failed_by_reason":             {},
  "restored_total":               2,
  "size_bytes_total":             67108864,
  "duration_seconds_p50":         4.1,
  "duration_seconds_p95":         12.7,
  "duration_seconds_mean":        6.3,
  "lock_held_seconds_by_workspace": { "ws_abc123": 0 }
}

Duration quantiles are approximated from an in-memory ring buffer — fine for the dozens-to-hundreds of samples a single host accumulates between restarts; not a general-purpose histogram. For long-horizon reporting, ingest the backup.* rows from audit_log.

Instance-scope backup

An instance-scope backup bundles every workspace on a Crewship host plus the cross-workspace surfaces that make the install usable — the credstore, the auth signing secret, and the instance identity (instance_config.hostname). It is the disaster-recovery path for an entire host, not a normal operational tool. Key differences from workspace/crew scope:

Access control. Gated by the CREWSHIP_OWNER_EMAIL env var (server-level OWNER), not workspace role. A workspace OWNER / ADMIN on their own is refused with HTTP 403.
Rate limit. One instance backup per user per sliding hour. A runaway cron cannot DoS the host.
Encryption is recipient-only. --passphrase-file is refused for this scope — the surface is too broad (every workspace’s secrets in one blob) to trust a brute-forceable passphrase. Callers must supply an AGE age1… X25519 public key and hold the matching private key offline.
Cross-host restores force session-key rotation. The bundle records the source hostname; a restore onto a different target invalidates every existing JWE session to prevent source-host tokens from remaining valid after DR.

Full threat model, crypto chain, and operational checklist: Security → Instance-Scope Backup Security.

Admin UI

A Backups tab lives in /admin for OWNER / ADMIN users. It wraps the same REST endpoints the CLI drives and adds:

A status banner for the advisory lock (who holds it, TTL remaining).
Create / restore dialogs with passphrase input (no keyring — the keyring is a CLI-side convenience; the browser never sees ~/.crewship/).
An inspect panel that renders the plaintext manifest without decrypting the payload (same as crewship backup inspect).
A bundle list with size, scope, format version, and created-at, fetched via hooks/use-backups.ts.

The UI does not expose instance-scope operations — they remain CLI + env-gated to reduce blast radius from a compromised admin session.

Known caveats (v0.2 roadmap)

The following items are intentionally deferred to v0.2. They do not block production use but are worth knowing:

preBackup / postBackup hooks — no user-defined hooks yet. If your workspace has services that need an app-level flush, run them manually before invoking crewship backup create.
Remote backends (S3 / B2 / GCS) — bundles live on the server’s local disk only. The storage layer is now abstracted behind a StorageOps interface so a future backend swap won’t require a second refactor of every call-site. Today: use scp / rclone / restic to ship bundles off-box, or stream a single bundle via GET /api/v1/admin/backups/download.
Scheduled backups — no built-in scheduler. Wrap crewship backup create in cron or systemd.timer.
Forward migration replay hooks. The plumbing is wired — migrations can register a per-version restoreBackfill function and the restorer walks the applied ∖ manifest set in ascending order after the main transaction commits — but no migration registers a hook yet. A failed backfill surfaces as ErrRestoreBackfillFailed; the restored rows are visible but may be missing backfilled columns until an admin investigates.
Cross-process keyring lock. The per-process mutex does not cover two concurrent CLI invocations racing on the same keyring file.

Common pitfalls

Losing the passphrase = losing the data. AGE bundles are not recoverable without the passphrase (or X25519 private key). There is no master key, no support escape hatch. Store the passphrase in a separate trust zone from the bundle — a password manager on a different host, a sealed envelope, anything that doesn’t share a failure mode with the disk holding the .tar.zst.
backup unlock --force on a live backup corrupts the bundle. The advisory lock exists precisely because two concurrent backups race on the docker pause / docker unpause sequence. Only ever clear a lock when you can prove its holder is dead (CLI session crashed, host rebooted) and the 1-hour TTL hasn’t fired yet. When in doubt, wait for the TTL.
CREWSHIP_DATA_DIR mismatch silently targets the wrong database. Like the Admin CLI trap — if the server runs with a custom data dir but you invoke crewship backup … without the same env var, the CLI defaults to ~/.crewship and operates on an empty/separate database. Backups succeed but capture nothing useful; restores produce “workspace doesn’t exist” errors. Export the same CREWSHIP_DATA_DIR the server uses, or pass --data-dir.
Stop the server before a restore against the same database. Restore takes the advisory lock but a running crewshipd may still hold open handles on tables the restorer wants to truncate. Symptoms range from “database is locked” SQLite errors to a half-applied restore that leaves the workspace in a non-bootable state. The lock guards against concurrent backups, not against the server’s own writes — stop the service first.
Format version drift breaks restores past N-2. A bundle written by format v5 will restore on v5, v6, and v7 servers; v8 onward, the compatibility window has rolled past it. If you’re restoring from cold storage that’s been sitting for a year+, verify the manifest’s format_version against the destination first with crewship backup inspect.
--recipient / --passphrase-file / --no-encrypt are mutually exclusive. Passing two raises a CLI error rather than silently picking one. A bundle encrypted with --recipient age1… will not decrypt with a passphrase, and vice versa — match the restore flag to the original create flag.
Same-slug restore refuses by default. crewship backup restore will not overwrite an existing workspace or crew under the same slug. If the existing one is stale and you want to replace it, delete it first (crewship workspace delete / crewship crew delete); if you want both side-by-side, use --as-workspace=<new-slug> and then crewship crew provision.
Disk space during restore is not bounded by the heap streaming guarantee. Peak RAM is fixed by the zstd window, but the extraction scratch directory at $TMPDIR/crewship-backup-* needs ~2× the bundle size on disk. Restoring a 4 GB bundle onto a host with 6 GB free in /tmp will fail mid-stream. Mount /tmp on the data volume or set TMPDIR to a larger filesystem.
Scripted retries without --use-keyring deadlock on the passphrase prompt. A non-TTY second invocation will block waiting for stdin. Either pass --passphrase-file, pipe the passphrase via stdin redirect, or use --use-keyring so the first call seeds the cache.
Bundles include workspace bind-mount files. User code in /workspace, agent memory markdown, anything mounted into the agent container — all included in the bundle. AGE encryption is the only defence; treat a .tar.zst like any other backup of source code and credentials.
/secrets is the only directory NEVER included. This is load-bearing — agents pull credentials from /secrets at runtime via Keeper, and including them in a portable bundle would defeat the SECRET-tier guarantee. Don’t add code that bypasses the secret-skipping filter without an equally rigorous out-of-band channel.

Security notes

The /secrets mount is never included in a bundle.
The workspace bind mount (user code) and memory markdown files are included. AGE encryption of the payload is the only defence against leakage from physical bundle distribution — treat bundles like any other backup of your source tree.
Every backup.* event writes to audit_log with user, role, scope, sealed-payload SHA-256, and size. Dry-run restores write backup.restore.dry_run so auditors can tell rehearsals from real cutover.

Automatic pre-migration snapshots

Distinct from the manual crewship backup bundle system above: Crewship auto-snapshots the SQLite DB before any pending migration runs. This is the “binary upgrade went sideways, give me my data back” safety net.

How it works

Every crewship start calls database.SnapshotBeforeMigrate before database.Migrate. If any migrations are pending (rows in the migrations[] slice with version

MAX(version) FROM _migrations), the function:

Resolves the SQLite file path from the connection DSN.
Computes a target name: <dbpath>.pre-migrate-v<from>-to-v<to>-<UTC-RFC3339>.bak.
Issues VACUUM INTO '<target>' — SQLite’s hot-copy mechanism. Safer than a plain file copy because it serializes against concurrent writers and produces a defragmented, WAL-checkpointed snapshot.
Chmods the snapshot to 0600.
Prunes older snapshots, keeping the 10 most recent per database file.

On error the boot aborts before any migration runs — silently continuing without a rollback point would defeat the entire purpose.

Opting out

CREWSHIP_SKIP_MIGRATION_BACKUP=1 crewship start

Skips the snapshot. Useful for CI environments where the DB is ephemeral and snapshot I/O is just overhead.

Recovering from a bad migration

If crewship start succeeds, applies a migration, then exhibits runtime errors that point at schema drift:

# Find the most recent pre-migration snapshot
ls -t ~/.crewship/crewship.db.pre-migrate-*.bak | head -1

# Stop crewship
crewship stop  # or kill -TERM if needed

# Replace the current DB with the snapshot
mv ~/.crewship/crewship.db ~/.crewship/crewship.db.bad
cp ~/.crewship/crewship.db.pre-migrate-v66-to-v87-20260514T143012Z.bak \
   ~/.crewship/crewship.db

# Run with the PREVIOUS binary (the one that wrote the snapshot)
/path/to/older/crewship start

The snapshot is a complete SQLite file — no special tooling needed. Tools like sqlite3 open it directly for inspection.

Limits of pre-migration snapshots vs manual backups

Capability	Pre-migration snapshot	`crewship backup create`
Encrypted at rest	No (chmod 0600 only)	Yes (AGE)
Portable across hosts	Same machine only	Yes
Includes workspace files	No (DB only)	Yes (DB + bind-mount + memory)
Captures container state	No	Yes (snapshots running containers)
Cross-version restore	Snapshot binary only	Forward-compatible via `restoreBackfill`
Triggered automatically	Yes, on pending migration	No (manual or scheduled)
Retention	10 newest per DB file	Operator-controlled via `rotate`

Use both. Pre-migration snapshots cover binary upgrades; crewship backup covers DR, host migration, legal hold, and forensic preservation. They are not interchangeable.

Backup API reference — REST endpoint shapes for every crewship backup subcommand.
Admin CLI — the host-shell write surface that backups protect against; always crewship backup create before running admin writes in prod.
Migrations Catalog — what migration ran when, and which restoreBackfill hook (if any) will fire on a cross-version restore.
Troubleshooting — recovering from a stuck migration or corrupt DB.
Security → Audit log — where backup.created, backup.restored, and backup.restore.dry_run events land.

Get Started

Guides

Security

Configuration

Backup & Restore

Backup & Restore

Overview

When to use it

Key concepts

Usage

Bundle layout

Create a workspace backup

Non-interactive / CI

Asymmetric encryption

Back up a single crew

List, inspect, verify

Restore

Dry run

Delete & rotate

Lock semantics

Examples

Nightly hot-spare backup with 14-day retention

Workspace migration to a new host

DR drill with `--dry-run`

API reference

Streaming & memory bounds (large backups)

Passphrase keyring

Webhooks

Metrics

Instance-scope backup

Admin UI

Known caveats (v0.2 roadmap)

Common pitfalls

Security notes

Automatic pre-migration snapshots

How it works

Opting out

Recovering from a bad migration

Limits of pre-migration snapshots vs manual backups

Get Started

Guides

Security

Configuration

Documentation Index

​Backup & Restore

​Overview

​When to use it

​Key concepts

​Usage

​Bundle layout

​Create a workspace backup

​Non-interactive / CI

​Asymmetric encryption

​Back up a single crew

​List, inspect, verify

​Restore

​Dry run

​Delete & rotate

​Lock semantics

​Examples

​Nightly hot-spare backup with 14-day retention

​Workspace migration to a new host

​DR drill with --dry-run

​API reference

​Streaming & memory bounds (large backups)

​Passphrase keyring

​Webhooks

​Metrics

​Instance-scope backup

​Admin UI

​Known caveats (v0.2 roadmap)

​Common pitfalls

​Security notes

​Automatic pre-migration snapshots

​How it works

​Opting out

​Recovering from a bad migration

​Limits of pre-migration snapshots vs manual backups

​Related

Backup & Restore

Overview

When to use it

Key concepts

Usage

Bundle layout

Create a workspace backup

Non-interactive / CI

Asymmetric encryption

Back up a single crew

List, inspect, verify

Restore

Dry run

Delete & rotate

Lock semantics

Examples

Nightly hot-spare backup with 14-day retention

Workspace migration to a new host

DR drill with `--dry-run`

API reference

Streaming & memory bounds (large backups)

Passphrase keyring

Webhooks

Metrics

Instance-scope backup

Admin UI

Known caveats (v0.2 roadmap)

Common pitfalls

Security notes

Automatic pre-migration snapshots

How it works

Opting out

Recovering from a bad migration

Limits of pre-migration snapshots vs manual backups

Related