> ## Documentation Index
> Fetch the complete documentation index at: https://docs.crewship.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Backup & Restore

> Admin-only workspace and crew backup bundles — AGE-encrypted .tar.zst with forward-compatible manifest, advisory locking, retention rotation, and dry-run restores.

# Backup & Restore

## Overview

Crewship's backup system produces portable `.tar.zst` bundles that capture a workspace (or a single crew) in one file. Bundles are AGE-encrypted by default, carry a versioned manifest, and can be restored on any Crewship instance that speaks the same format version (N-2 compatibility guarantee). The whole subsystem is admin-only by design: every backup subcommand requires the **OWNER** or **ADMIN** role on the workspace, and the runner refuses MEMBER / VIEWER calls at both the CLI parsing layer and the server-side handler — defence in depth, not just a UI veneer.

The architectural choice that shapes everything else is "one bundle, one file". Backups don't depend on an external object store, don't require a sidecar, and don't shard across files. A `.tar.zst` is a single artefact an operator can `scp` to a backup host, hand to a customer for legal hold, or check into a private bucket — without the rest of Crewship being available. Inside the tarball, the manifest is plaintext JSON (so `crewship backup inspect` can read it without the AGE recipient key) and the payload is the encrypted SQLite snapshot plus any referenced workspace files. The forward-compatible manifest schema means a bundle produced on N can restore on N-1 and N-2; older bundles run their migration's `restoreBackfill` hook so columns added since the snapshot are populated sanely instead of left at SQL defaults.

Restores are intentionally cautious. `--dry-run` walks the entire restore plan — schema diff, row counts, blob deltas — and prints what *would* happen without writing a single byte to the destination database. Advisory locking (`backup_locks` table, per-workspace) prevents two concurrent `restore` runs from corrupting each other, and the lock file records the host + PID so `crewship backup status` can tell you who's holding it. If a host crashes mid-restore, `crewship backup unlock` is the manual recovery — admin-only with a confirmation prompt, because clearing a real lock from another live process is how you trash a workspace.

## When to use it

Backups are cheap to make and expensive to wish you'd made. The five canonical reasons to run `crewship backup create`:

* **Before any destructive admin write.** Before `crewship admin reset-password`, before a database migration on a binary upgrade, before a large schema-changing PR lands in prod — capture the current state first. The bundle is the one-command rollback if anything goes wrong.
* **Disaster recovery / hot spare.** Schedule a nightly `crewship backup create --scope=workspace --passphrase-file …` cron and ship the bundle to a separate host. If the primary disk dies, restoring onto a fresh binary is one `crewship backup restore` away — no replication agent, no streaming WAL, no extra moving parts.
* **Workspace migration to another host.** Moving a workspace from a dev VM to a prod host (or between two prod hosts) is exactly what bundles are for. Create on the source, scp the file, `crewship backup restore --as-workspace <new-slug>` on the destination, then provision crews. The `--as-workspace` rename avoids the "two `acme` workspaces colliding" hazard.
* **Legal hold or compliance archive.** Customer leaves; you need to keep their workspace state on cold storage for N years. One AGE-encrypted `.tar.zst` is a forever-readable artefact — no live database, no service required. `inspect` later proves the bundle's contents without decrypting.
* **Forensic snapshot before incident response.** Suspected compromise of an admin account, or a "what was the state when X happened" investigation. `backup create` + immediate offsite copy preserves the audit trail before anyone (you, the attacker, the well-meaning oncall) starts changing things.

Skip backups for **ephemeral development workspaces** (the dev VM rebuilds them from seed scripts anyway) and **single-agent throwaways** with no user-visible state worth preserving — the bundle metadata cost is real (\~few MB minimum) and not every workspace earns it.

## Key concepts

| Term                                      | What it means here                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| ----------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Bundle**                                | A single `.tar.zst` artefact containing one `MANIFEST` (plaintext JSON), one `payload.age` (AGE-sealed tar of DB rows + workspace files), and one `payload.sha256`. The whole backup is this one file — portable, hashable, copy-able with `scp`.                                                                                                                                                                                                                                               |
| **Scope**                                 | `workspace` (the workspace row + every crew under it) or `crew` (a single crew + its agents). Crew bundles restore independently of their parent workspace, so a "move this crew to another workspace" is a viable migration path.                                                                                                                                                                                                                                                              |
| **Format version**                        | Integer in `MANIFEST.format_version`. Compatibility guarantee is N-2 — a bundle produced on format v3 restores on v3, v4, and v5 servers. Bumped only when a migration changes the bundle layout itself (not the schema rows inside).                                                                                                                                                                                                                                                           |
| **`restoreBackfill` hook**                | Per-migration Go function — replay logic lives in `internal/backup/runner_restore.go` (`replayRestoreBackfills`). Runs when a bundle predates a migration and the restoring server has columns the source didn't. Pure `ADD COLUMN` migrations rely on the SQL DEFAULT; complex backfills (JSON shape changes, foreign keys) provide a hook so restored rows land sanely.                                                                                                                       |
| **AGE encryption**                        | The bundle's payload is sealed with `filippo.io/age`. Default mode is passphrase (`scrypt`-derived key); `--recipient age1…` switches to X25519 public-key encryption for hand-offs. `--no-encrypt` produces a plaintext payload for test/CI use only.                                                                                                                                                                                                                                          |
| **Passphrase keyring**                    | Opt-in cache at `~/.crewship/backup-keyring.enc` (AGE-encrypted with a single OS-keyring-stored key) so operators don't retype the passphrase on every `rotate` / `verify`. Enabled via `--use-keyring` on `create` / `restore`.                                                                                                                                                                                                                                                                |
| **Advisory lock**                         | A row in the `backup_locks` table, primary-keyed on workspace ID. Taken before any DB dump or docker pause; released by `defer Release()` on the happy path. 1-hour TTL (`DefaultLockTTL`) so a crashed backup self-heals after the window.                                                                                                                                                                                                                                                     |
| **`refuseIfBackupInProgress` guard**      | A shared middleware wired into the assignments, peer-query, and webhook handlers. Reads the lock state and refuses *new* agent runs while a backup is in progress — closes the TOCTOU window between `ensureAgentsIdle` (initial check) and `docker pause` (the actual freeze).                                                                                                                                                                                                                 |
| **Dry-run restore**                       | `--dry-run` on `crewship backup restore`. Decrypts, validates the manifest, replays the DB transaction, then rolls back. The only side effect is one `backup.restore.dry_run` audit row — distinct from `backup.restore` so auditors can tell "verified" apart from "actually restored".                                                                                                                                                                                                        |
| **`--as-workspace` / `--as-crew` rename** | Restore the bundle under a new slug instead of the original. Refuses to run the docker phase (container names derive from the slug) and tells the operator to `crewship crew provision` afterwards. Avoids "two `acme` workspaces colliding" during DR drills.                                                                                                                                                                                                                                  |
| **Retention sweep**                       | `crewship backup rotate --keep-last N --keep-days D`. Per-workspace — never touches another workspace's bundles. Both flags can be combined; both must be positive. `--dry-run` lists what would be deleted without touching disk.                                                                                                                                                                                                                                                              |
| **Stale lock**                            | A lock whose holder process is dead but whose row still exists. Detected when `acquired_at + DefaultLockTTL < now()`. Auto-released on next `create`; manually clearable via `crewship backup unlock --force` (admin-only, with confirmation).                                                                                                                                                                                                                                                  |
| **Pre-migration snapshot**                | An automatic safety net distinct from the manual bundle system. Whenever `crewship start` detects pending migrations, `database.SnapshotBeforeMigrate` writes a `VACUUM INTO` copy of the live SQLite database to `<dbpath>.pre-migrate-vN-to-vM-<UTC>.bak` before any DDL runs. Last 10 snapshots are retained per database; opt out with `CREWSHIP_SKIP_MIGRATION_BACKUP=1`. **Not a substitute** for proper backups — it's the binary upgrade safety net, not a disaster-recovery primitive. |

## Usage

The whole backup surface is the `crewship backup` command group — twelve subcommands that cover the create → verify → restore → rotate lifecycle (plus `metrics`, `download`, and `self-test`). The core nine are tabulated below with full flag reference and copy-pasteable examples; `metrics` is covered under [Metrics](#metrics), `download` under [Known caveats](#known-caveats-v02-roadmap). This table is the entry point.

| Command                   | Purpose                                                                 | Deep dive                                                                                                 |
| ------------------------- | ----------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- |
| `crewship backup create`  | Produce a new bundle for a workspace or crew.                           | [Create a workspace backup](#create-a-workspace-backup) / [Back up a single crew](#back-up-a-single-crew) |
| `crewship backup list`    | List bundles on disk.                                                   | [List, inspect, verify](#list-inspect-verify)                                                             |
| `crewship backup inspect` | Print a bundle's manifest without decrypting the payload.               | [List, inspect, verify](#list-inspect-verify)                                                             |
| `crewship backup verify`  | Decrypt + checksum a bundle end-to-end (no DB writes).                  | [List, inspect, verify](#list-inspect-verify)                                                             |
| `crewship backup restore` | Restore a bundle (supports `--dry-run`).                                | [Restore](#restore)                                                                                       |
| `crewship backup delete`  | Remove a single bundle (interactive confirm or `--force`).              | [Delete & rotate](#delete--rotate)                                                                        |
| `crewship backup rotate`  | Retention sweep — `--keep-last N` / `--keep-days D`.                    | [Delete & rotate](#delete--rotate)                                                                        |
| `crewship backup status`  | Show the advisory lock state for the current workspace.                 | [Lock semantics](#lock-semantics)                                                                         |
| `crewship backup unlock`  | Release a stale lock owned by this host (admin-only, confirm required). | [Lock semantics](#lock-semantics)                                                                         |

The minimum end-to-end loop is four commands: `create` to produce the bundle, `verify` to prove it's not corrupt, `restore --dry-run` to prove the destination will accept it, then `restore` for real. Every other subcommand exists for retention (`rotate`), introspection (`list`, `inspect`, `status`), or recovery (`unlock`).

<Note>
  `create` and `restore` accept `--use-keyring` to cache and reuse the
  workspace passphrase via `~/.crewship/backup-keyring.enc`. See
  [Passphrase keyring](#passphrase-keyring) below.
</Note>

## Bundle layout

```
crewship-<scope>-<slug>-<iso-ts>.tar.zst
├── MANIFEST           (plaintext JSON, format_version, scope, checksums)
├── payload.age        (AGE-sealed tar.zst of DB rows + workspace files)
└── payload.sha256     (SHA-256 of the sealed payload bytes)
```

* **Scope:** `workspace` (workspace row + all its crews) or `crew` (single crew + its agents).
* **Encryption:** AGE passphrase (default) or AGE X25519 recipient. `--no-encrypt` produces a plaintext payload for test / CI use.
* **Default location:** `~/.crewship/backups/` on the server, mode `0700`.
* **Naming:** `crewship-<scope>-<slug>-<iso-ts>.tar.zst`. Collisions append `-<hash8>`.

## Create a workspace backup

```bash theme={null}
crewship backup create --scope=workspace
# Passphrase: ********
# Confirm passphrase: ********
# ✓ Backup created: /home/admin/.crewship/backups/crewship-workspace-acme-2026-04-15T12-05-01Z.tar.zst
```

The CLI prompts twice for a passphrase (to guard against typos) and confirms success with a row summarising scope, size, format version, and the SHA-256 of the sealed payload.

### Non-interactive / CI

Supply the passphrase from a file:

```bash theme={null}
crewship backup create --scope=workspace --passphrase-file /run/secrets/backup.pw
```

Or pipe a single line on stdin (falls back automatically when stdin is not a TTY and `--passphrase-file` is not set).

### Asymmetric encryption

If the restoring party holds an AGE X25519 keypair, pass their public key instead of a shared secret:

```bash theme={null}
crewship backup create \
  --scope=workspace \
  --recipient age1ql3z7hjy54pw3hyww5ayyfg7zqgvc7w3j2elw8zmrj2kg5sfn9aqmcac8p
```

`--recipient`, `--passphrase-file`, and `--no-encrypt` are mutually exclusive.

## Back up a single crew

```bash theme={null}
crewship backup create --scope=crew --crew dev-team
```

`--crew` accepts either a slug or a crew ID. Crew-scope bundles restore independently of their parent workspace.

## List, inspect, verify

```bash theme={null}
crewship backup list
# FILE                                              SCOPE      SIZE     ENCRYPTED  FORMAT  CREATED_AT
# crewship-workspace-acme-2026-04-15T12-05-01Z…   workspace  12.8 MiB yes        v1      2026-04-15T12:05:01Z

crewship backup inspect ~/.crewship/backups/crewship-workspace-acme-…tar.zst
# { "format_version": 1, "scope": "workspace", "contents": { "workspace": { "slug": "acme", … } } }

crewship backup verify ~/.crewship/backups/crewship-workspace-acme-…tar.zst
# ✓ VALID — /home/admin/.crewship/backups/… (12.8 MiB)
```

`inspect` only reads the plaintext MANIFEST — it never touches the sealed payload, so no passphrase is needed. `verify` recomputes the SHA-256 of the sealed bytes against the manifest and fails if the bundle was truncated or tampered with. Neither decrypts.

## Restore

```bash theme={null}
crewship backup restore ~/.crewship/backups/crewship-workspace-acme-…tar.zst
# Passphrase: ********
# ✓ Restore complete — workspace=acme crews=4 rows=312 id=ws_abc123
```

The server rejects the restore if a workspace (or crew) with the same slug already exists. Override with `--as-workspace <new-slug>` or `--as-crew <new-slug>` to land the payload under a fresh identity:

```bash theme={null}
crewship backup restore bundle.tar.zst --as-workspace acme-dr
# ⚠ Docker phase skipped (--as-workspace/--as-crew supplied).
#   Provision the new crews with `crewship crew provision`.
```

When the bundle is landed under a new slug the docker phase is intentionally skipped — container names are derived from the slug, and renaming a live crew requires an explicit provision step.

### Dry run

```bash theme={null}
crewship backup restore bundle.tar.zst --dry-run
# ✓ Restore validation complete (dry-run; no workspace/crew data changes applied)
```

A dry-run decrypts the bundle, validates the manifest, replays the DB transaction, and then **rolls back**. The only side effect is a single `backup.restore.dry_run` row in the audit log — handy for proving a bundle is restorable before the real cutover. The distinct audit action lets auditors tell "verified" apart from "actually restored".

## Delete & rotate

```bash theme={null}
crewship backup delete ~/.crewship/backups/old.tar.zst
# Delete backup …? [y/N] y
# ✓ Backup deleted

crewship backup rotate --keep-last 10 --keep-days 30 --dry-run
# Would delete 3 bundle(s):
#   /home/admin/.crewship/backups/crewship-workspace-acme-2025-12-01…
#   …
```

`rotate` applies retention per workspace — it never touches another workspace's bundles. Either `--keep-last N` (bundles above the N newest are dropped) or `--keep-days D` (bundles older than D days are dropped) must be positive; both can be combined.

`delete` requires interactive confirmation, or `--force` in scripts / CI. The same rule applies to `backup unlock`.

## Lock semantics

Each workspace holds at most one **advisory backup lock** at a time (table `backup_locks`, per-workspace PK). The lock:

* Is taken before any DB dump or docker pause and released by a deferred `Release()` on the happy path.
* Has a **1-hour TTL** (`DefaultLockTTL`); a crashed backup self-heals after the window.
* Blocks concurrent `backup create` calls — the second caller gets HTTP **409 Conflict** with a "another backup is already in progress" message.
* Blocks **new agent runs** via the shared `refuseIfBackupInProgress` guard wired into the assignments, peer-query, and webhook handlers. This closes the TOCTOU window between `ensureAgentsIdle` and `docker pause`.

Inspect or release the lock:

```bash theme={null}
crewship backup status
# WORKSPACE  ACQUIRED_BY     ACQUIRED_AT           EXPIRES_AT
# ws_abc123  admin@acme.io   2026-04-15T12:04:58Z  2026-04-15T13:04:58Z

crewship backup unlock --force
# ✓ Backup lock released.
```

<Warning>
  `backup unlock` is an emergency escape hatch. Only use it when you can confirm no backup is actually running (e.g. the previous CLI session crashed and the 1 h TTL has not yet fired). Forcibly releasing a live backup's lock will let a second backup start alongside it, and the two will race on the docker pause/unpause sequence.
</Warning>

## Examples

### Nightly hot-spare backup with 14-day retention

A workspace on `prod-server.example.com` should produce a bundle every night, ship it to a separate backup host, and keep 14 days on disk locally as a fast-rollback safety net.

```bash theme={null}
# /etc/systemd/system/crewship-backup.service
[Unit]
Description=Crewship nightly backup
After=crewshipd.service

[Service]
Type=oneshot
User=crewship
Environment=CREWSHIP_DATA_DIR=/var/lib/crewship
ExecStart=/usr/local/bin/crewship backup create \
  --scope=workspace \
  --passphrase-file=/etc/crewship/backup.pw \
  --use-keyring
ExecStartPost=/usr/bin/rsync -a --remove-source-files \
  /var/lib/crewship/backups/ \
  backup-host:/srv/crewship-backups/prod/
ExecStartPost=/usr/local/bin/crewship backup rotate \
  --keep-last=14 --keep-days=14 --force
```

Paired with a `crewship-backup.timer` that fires at 02:37 daily (off-the-hour to avoid clustering with the rest of the fleet). The `--use-keyring` flag means the passphrase is read once and cached — the timer doesn't have to redeliver it. The `rotate` step runs after the rsync so local-disk pressure is bounded even on long runs without a remote-side sweep.

### Workspace migration to a new host

The `acme` workspace lives on `crewship-old`. You're moving it to `crewship-new` to retire the old host. The destination already has its own workspaces, so a same-slug restore would conflict.

```bash theme={null}
# On crewship-old:
crewship backup create --scope=workspace --passphrase-file backup.pw
# ✓ Backup created: /home/admin/.crewship/backups/crewship-workspace-acme-2026-05-14T03-12-00Z.tar.zst

# Ship it:
scp ~/.crewship/backups/crewship-workspace-acme-*.tar.zst crewship-new:/tmp/

# On crewship-new — pick a new slug because the host already has things:
crewship backup restore /tmp/crewship-workspace-acme-*.tar.zst \
  --as-workspace=acme-migrated \
  --passphrase-file=backup.pw
# ⚠ Docker phase skipped (--as-workspace supplied).
#   Provision the new crews with `crewship crew provision`.

# Stand the crews up under the new slug:
crewship crew provision --workspace=acme-migrated --all
```

Once the new instance has parity (sanity-check via the UI), users get the new URL, and `acme` on the old host gets archived to cold storage before its workspace row is deleted.

### DR drill with `--dry-run`

Every quarter the team proves the disaster-recovery bundle is actually restorable, without disturbing production state. Run the drill on a throwaway VM:

```bash theme={null}
# 1. Copy the most recent prod bundle to the drill host:
scp prod:/var/lib/crewship/backups/crewship-workspace-acme-latest.tar.zst .

# 2. Inspect the manifest first (no passphrase needed):
crewship backup inspect crewship-workspace-acme-latest.tar.zst | jq '.format_version, .contents.workspace.slug'
# 1
# "acme"

# 3. Verify checksum end-to-end:
crewship backup verify crewship-workspace-acme-latest.tar.zst
# ✓ VALID — crewship-workspace-acme-latest.tar.zst (487 MiB)

# 4. Replay the restore inside a transaction, then roll back:
crewship backup restore crewship-workspace-acme-latest.tar.zst \
  --dry-run \
  --passphrase-file=backup.pw
# ✓ Restore validation complete (dry-run; no workspace/crew data changes applied)
```

A `backup.restore.dry_run` audit entry shows up in the workspace journal — auditors looking at "did anyone restore prod?" can tell drill runs apart from real restores by the action name.

## API reference

The backup surface is CLI-first — every operation is reachable via `crewship backup …` and every flag the CLI accepts maps to an HTTP body field. The full HTTP schema lives at [`/api-reference/backup`](/api-reference/backup); a quick orientation:

| Method   | Path                            | What it does                                                                                                                                                                                                                                 |
| -------- | ------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `POST`   | `/api/v1/admin/backups`         | Create a bundle. **Auth + OWNER/ADMIN.** Same trust gate as the CLI; concurrent calls hit the [advisory lock](#lock-semantics) and the second returns 409.                                                                                   |
| `GET`    | `/api/v1/admin/backups`         | List bundles on disk for the current workspace.                                                                                                                                                                                              |
| `GET`    | `/api/v1/admin/backups/inspect` | Plaintext manifest only — equivalent to `crewship backup inspect`. Takes the bundle as a query param; no passphrase required.                                                                                                                |
| `GET`    | `/api/v1/admin/backups/verify`  | SHA-256 integrity check of the sealed bytes against the manifest. **Does not decrypt** — no passphrase required. No DB writes.                                                                                                               |
| `POST`   | `/api/v1/admin/backups/restore` | Restore. Body accepts `dry_run`, `as_workspace`, `as_crew`, and `replace` (v3 DR wipe-and-replace of existing workspace contents — workspace-scope only, incompatible with `as_workspace`/`as_crew`). **Holds the lock for the whole call.** |
| `DELETE` | `/api/v1/admin/backups`         | Delete a bundle. Idempotent — 404 is not an error in scripted use.                                                                                                                                                                           |
| `POST`   | `/api/v1/admin/backups/rotate`  | Retention sweep. Body accepts `keep_last`, `keep_days`, `dry_run`.                                                                                                                                                                           |
| `GET`    | `/api/v1/admin/backups/status`  | Inspect the advisory lock state (used by `crewship backup status`).                                                                                                                                                                          |
| `DELETE` | `/api/v1/admin/backups/status`  | Force-release a stale lock. Requires `force=true` body field — **never** call from automation without a confirmation gate.                                                                                                                   |

All routes are mounted in `internal/api/router_admin.go`. The CLI talks to these directly when run against a remote host (`--server` flag) and falls back to in-process Go calls when run on the same machine as `crewshipd` — bypassing HTTP entirely for the host-shell use case. The two paths share the same handler functions, so flags work identically.

Webhook payloads for `backup.created` / `backup.restored` / `backup.failed` events are documented separately under [Webhooks](#webhooks); metric emissions are documented under [Metrics](#metrics). (Dry-run restores deliberately fire no webhook — they are not real restores.)

## Streaming & memory bounds (large backups)

Restore and verify stream the sealed payload to a temp directory rather than buffering per-crew sections in a `map[slug][]byte`. Peak heap stays bounded by the zstd decoder window regardless of bundle size, so multi-GB restores run cleanly on small hosts. The extraction scratch directory is `os.TempDir()/crewship-backup-…` and is removed on `Close()`; a killed process leaves it behind for the next `os.TempDir` cleanup.

## Passphrase keyring

The `--use-keyring` flag on `create` and `restore` caches the workspace
passphrase in `~/.crewship/backup-keyring.enc` so scripts and repeat
rehearsals don't re-prompt. The file is an AES-256-GCM-encrypted JSON
map keyed by workspace ID, using the same `v1:<base64>` envelope as the
credstore — without the host's `ENCRYPTION_KEY` the contents are
unreadable even if the file leaks.

```bash theme={null}
# First use on this workspace: prompts, then persists.
crewship backup create --scope=workspace --use-keyring

# Subsequent runs: silent, no prompt.
crewship backup create --scope=workspace --use-keyring
crewship backup restore bundle.tar.zst --use-keyring
```

Semantics worth knowing:

* **No silent fallback on failure.** Opening the keyring or reading an
  entry reports the real error and aborts; only `ErrKeyringEntryNotFound`
  (first use on this workspace) falls through to a prompt.
* **Write failures are non-fatal** on `create`. The bundle is already
  written when the keyring save runs; the CLI logs a warning and
  continues.
* **Keyring is local to the operator's host.** `--use-keyring` always
  writes to `~/.crewship/` on the invoking machine — even if a future
  remote bundle backend (S3 / GCS) is configured, the passphrase never
  travels with the bundle.
* **Single-process mutex, not file-locked.** Two concurrent CLI
  invocations against the same workspace are last-write-wins (the file
  is small and the failure mode is "one passphrase lost, never data
  corruption"). Filesystem-level locking is on the v0.2 roadmap.
* **`--passphrase-file` takes precedence.** When both flags are passed,
  the file wins and the keyring is not consulted.

## Webhooks

Set `CREWSHIP_BACKUP_WEBHOOK_URL` (and `CREWSHIP_BACKUP_WEBHOOK_SECRET`)
on the server process to receive a signed POST for each backup lifecycle
event. Delivery is fire-and-forget from a goroutine — a slow or down
webhook never blocks the backup run.

```bash theme={null}
export CREWSHIP_BACKUP_WEBHOOK_URL=https://hooks.example.com/crewship
export CREWSHIP_BACKUP_WEBHOOK_SECRET=$(openssl rand -hex 32)
```

Each event is JSON with the shape:

```json theme={null}
{
  "event":          "backup.created",
  "timestamp":      "2026-04-15T12:05:01Z",
  "workspace_id":   "ws_abc123",
  "scope":          "workspace",
  "path":           "/home/admin/.crewship/backups/...tar.zst",
  "bytes":          13421772,
  "payload_sha256": "…",
  "error":          ""
}
```

Events: `backup.created`, `backup.failed`, `backup.restored`.

Each request carries an `X-Crewship-Signature: sha256=<hex>` header —
HMAC-SHA256 over the raw body using `CREWSHIP_BACKUP_WEBHOOK_SECRET`.
Receivers **must** verify the signature (same scheme as Crewship's
inbound webhooks; validate via `webhook.ValidateHMAC` after stripping
the `sha256=` prefix). The secret is required whenever `URL` is set —
sending a body unsigned would let any network listener forge events to
a downstream consumer that trusts the feed. URLs with userinfo or query
strings are redacted before ever appearing in logs / audit rows, so
basic-auth credentials or signed-URL tokens do not leak.

## Metrics

`GET /api/v1/admin/backups/metrics` (instance OWNER only; see below)
returns a point-in-time snapshot of process-lifetime counters. The
numbers reset on restart — persistent observability belongs in the
audit log and its dashboards.

```json theme={null}
{
  "created_total":                5,
  "created_by_scope":             { "workspace": 4, "crew": 1 },
  "failed_total":                 0,
  "failed_by_reason":             {},
  "restored_total":               2,
  "size_bytes_total":             67108864,
  "duration_seconds_p50":         4.1,
  "duration_seconds_p95":         12.7,
  "duration_seconds_mean":        6.3,
  "lock_held_seconds_by_workspace": { "ws_abc123": 0 }
}
```

Duration quantiles are approximated from an in-memory ring buffer —
fine for the dozens-to-hundreds of samples a single host accumulates
between restarts; not a general-purpose histogram. For long-horizon
reporting, ingest the `backup.*` rows from `audit_log`.

## Instance-scope backup

An instance-scope backup bundles **every workspace** on a Crewship host
plus the cross-workspace surfaces that make the install usable — the
credstore, the auth signing secret, and the instance identity
(`instance_config.hostname`). It is the disaster-recovery path for an
entire host, not a normal operational tool.

Key differences from workspace/crew scope:

* **Access control.** Gated by the `CREWSHIP_OWNER_EMAIL` env var
  (server-level OWNER), not workspace role. A workspace OWNER / ADMIN
  on their own is refused with HTTP 403.
* **Rate limit.** One instance backup per user per sliding hour. A
  runaway cron cannot DoS the host.
* **Encryption is recipient-only.** `--passphrase-file` is refused for
  this scope — the surface is too broad (every workspace's secrets in
  one blob) to trust a brute-forceable passphrase. Callers must supply
  an AGE `age1…` X25519 public key and hold the matching private key
  offline.
* **Cross-host restores force session-key rotation.** The bundle records
  the source hostname; a restore onto a different target invalidates
  every existing JWE session to prevent source-host tokens from
  remaining valid after DR.

Full threat model, crypto chain, and operational checklist:
[Security → Instance-Scope Backup Security](/security/backup-instance-security).

## Admin UI

A Backups tab lives in `/admin` for OWNER / ADMIN users. It wraps the
same REST endpoints the CLI drives and adds:

* A status banner for the advisory lock (who holds it, TTL remaining).
* Create / restore dialogs with passphrase input (no keyring — the
  keyring is a CLI-side convenience; the browser never sees
  `~/.crewship/`).
* An inspect panel that renders the plaintext manifest without
  decrypting the payload (same as `crewship backup inspect`).
* A bundle list with size, scope, format version, and created-at,
  fetched via `hooks/use-backups.ts`.

The UI does **not** expose instance-scope operations — they remain
CLI + env-gated to reduce blast radius from a compromised admin session.

## Known caveats (v0.2 roadmap)

<Note>
  The following items are intentionally deferred to v0.2. They do
  **not** block production use but are worth knowing.
</Note>

* **`preBackup` / `postBackup` hooks** — no user-defined hooks yet. If
  your workspace has services that need an app-level flush, run them
  manually before invoking `crewship backup create`.
* **Remote backends (S3 / B2 / GCS)** — bundles live on the server's
  local disk only. The storage layer is now abstracted behind a
  `StorageOps` interface so a future backend swap won't require a
  second refactor of every call-site. Today: use `scp` / `rclone` /
  `restic` to ship bundles off-box, or stream a single bundle via
  `GET /api/v1/admin/backups/download`.
* **Scheduled backups** — no built-in scheduler. Wrap `crewship backup
  create` in `cron` or `systemd.timer`.
* **Forward migration replay hooks.** The plumbing is wired —
  migrations can register a per-version `restoreBackfill` function and
  the restorer walks the `applied ∖ manifest` set in ascending order
  after the main transaction commits — but no migration registers a
  hook yet. A failed backfill surfaces as `ErrRestoreBackfillFailed`;
  the restored rows are visible but may be missing backfilled columns
  until an admin investigates.
* **Cross-process keyring lock.** The per-process mutex does not cover
  two concurrent CLI invocations racing on the same keyring file.

## Common pitfalls

<Warning>
  **Losing the passphrase = losing the data.** AGE bundles are not recoverable without the passphrase (or X25519 private key). There is no master key, no support escape hatch. Store the passphrase in a separate trust zone from the bundle — a password manager on a different host, a sealed envelope, anything that doesn't share a failure mode with the disk holding the `.tar.zst`.
</Warning>

<AccordionGroup>
  <Accordion title="backup unlock --force on a live backup corrupts the bundle">
    The advisory lock exists precisely because two concurrent backups race on the `docker pause` / `docker unpause` sequence. Only ever clear a lock when you can *prove* its holder is dead (CLI session crashed, host rebooted) and the 1-hour TTL hasn't fired yet. When in doubt, wait for the TTL.
  </Accordion>

  <Accordion title="CREWSHIP_DATA_DIR mismatch silently targets the wrong database">
    Like the [Admin CLI](/guides/admin-cli) trap — if the server runs with a custom data dir but you invoke `crewship backup …` without the same env var, the CLI defaults to `~/.crewship` and operates on an empty/separate database. Backups succeed but capture nothing useful; restores produce "workspace doesn't exist" errors. Export the same `CREWSHIP_DATA_DIR` the server uses (there is no `--data-dir` flag — the env var is the only override).
  </Accordion>

  <Accordion title="Stop the server before a restore against the same database">
    Restore takes the advisory lock but a running `crewshipd` may still hold open handles on tables the restorer wants to truncate. Symptoms range from "database is locked" SQLite errors to a half-applied restore that leaves the workspace in a non-bootable state. The lock guards against concurrent *backups*, not against the server's own writes — stop the service first.
  </Accordion>

  <Accordion title="Format version drift breaks restores past N-2">
    A bundle written by format v5 will restore on v5, v6, and v7 servers; v8 onward, the compatibility window has rolled past it. If you're restoring from cold storage that's been sitting for a year+, verify the manifest's `format_version` against the destination first with `crewship backup inspect`.
  </Accordion>

  <Accordion title="--recipient / --passphrase-file / --no-encrypt are mutually exclusive">
    Passing two raises a CLI error rather than silently picking one. A bundle encrypted with `--recipient age1…` will not decrypt with a passphrase, and vice versa — match the restore flag to the original create flag.
  </Accordion>

  <Accordion title="Same-slug restore refuses by default">
    `crewship backup restore` will not overwrite an existing workspace or crew under the same slug. If the existing one is stale and you want to replace it, delete it first (a crew via `crewship crew delete`; a workspace must be removed out-of-band — there is no workspace-delete command, see the [workspaces API note](/api-reference/workspaces)); if you want both side-by-side, use `--as-workspace=<new-slug>` and then `crewship crew provision`.
  </Accordion>

  <Accordion title="Disk space during restore is not bounded by the heap streaming guarantee">
    Peak RAM is fixed by the zstd window, but the extraction scratch directory at `$TMPDIR/crewship-backup-*` needs \~2× the bundle size on disk. Restoring a 4 GB bundle onto a host with 6 GB free in `/tmp` will fail mid-stream. Mount `/tmp` on the data volume or set `TMPDIR` to a larger filesystem.
  </Accordion>

  <Accordion title="Scripted retries without --use-keyring deadlock on the passphrase prompt">
    A non-TTY second invocation will block waiting for stdin. Either pass `--passphrase-file`, pipe the passphrase via stdin redirect, or use `--use-keyring` so the first call seeds the cache.
  </Accordion>

  <Accordion title="Bundles include workspace bind-mount files">
    User code in `/workspace`, agent memory markdown, anything mounted into the agent container — all included in the bundle. AGE encryption is the only defence; treat a `.tar.zst` like any other backup of source code and credentials.
  </Accordion>

  <Accordion title="/secrets is the only directory NEVER included">
    This is load-bearing — agents pull credentials from `/secrets` at runtime via Keeper, and including them in a portable bundle would defeat the SECRET-tier guarantee. Don't add code that bypasses the secret-skipping filter without an equally rigorous out-of-band channel.
  </Accordion>
</AccordionGroup>

## Security notes

* The `/secrets` mount is **never** included in a bundle.
* The workspace bind mount (user code) and memory markdown files are included. AGE encryption of the payload is the only defence against leakage from physical bundle distribution — treat bundles like any other backup of your source tree.
* Every `backup.*` event writes to `audit_log` with user, role, scope, sealed-payload SHA-256, and size. Dry-run restores write `backup.restore.dry_run` so auditors can tell rehearsals from real cutover.

## Automatic pre-migration snapshots

Distinct from the manual `crewship backup` bundle system above:
Crewship auto-snapshots the SQLite DB **before** any pending migration
runs. This is the "binary upgrade went sideways, give me my data
back" safety net.

### How it works

Every `crewship start` calls
`database.SnapshotBeforeMigrate` before `database.Migrate`. If any
migrations are pending (rows in the `migrations[]` slice with version

> `MAX(version) FROM _migrations`), the function:

1. Resolves the SQLite file path from the connection DSN.
2. Computes a target name:
   `<dbpath>.pre-migrate-v<from>-to-v<to>-<UTC-RFC3339>.bak`.
3. Issues `VACUUM INTO '<target>'` — SQLite's hot-copy mechanism. Safer
   than a plain file copy because it serializes against concurrent
   writers and produces a defragmented, WAL-checkpointed snapshot.
4. Chmods the snapshot to `0600`.
5. Prunes older snapshots, keeping the 10 most recent per database file.

On error the boot aborts before any migration runs — silently
continuing without a rollback point would defeat the entire purpose.

### Opting out

```bash theme={null}
CREWSHIP_SKIP_MIGRATION_BACKUP=1 crewship start
```

Skips the snapshot. Useful for CI environments where the DB is
ephemeral and snapshot I/O is just overhead.

### Recovering from a bad migration

If `crewship start` succeeds, applies a migration, then exhibits
runtime errors that point at schema drift:

<Steps>
  <Step title="Find the most recent pre-migration snapshot">
    ```bash theme={null}
    ls -t ~/.crewship/crewship.db.pre-migrate-*.bak | head -1
    ```
  </Step>

  <Step title="Stop crewship">
    ```bash theme={null}
    crewship stop  # or kill -TERM if needed
    ```
  </Step>

  <Step title="Replace the current DB with the snapshot">
    ```bash theme={null}
    mv ~/.crewship/crewship.db ~/.crewship/crewship.db.bad
    cp ~/.crewship/crewship.db.pre-migrate-v66-to-v87-20260514T143012Z.bak \
       ~/.crewship/crewship.db
    ```
  </Step>

  <Step title="Run with the PREVIOUS binary (the one that wrote the snapshot)">
    ```bash theme={null}
    /path/to/older/crewship start
    ```
  </Step>
</Steps>

The snapshot is a complete SQLite file — no special tooling needed.
Tools like `sqlite3` open it directly for inspection.

### Limits of pre-migration snapshots vs manual backups

| Capability               | Pre-migration snapshot    | `crewship backup create`                 |
| ------------------------ | ------------------------- | ---------------------------------------- |
| Encrypted at rest        | No (chmod 0600 only)      | Yes (AGE)                                |
| Portable across hosts    | Same machine only         | Yes                                      |
| Includes workspace files | No (DB only)              | Yes (DB + bind-mount + memory)           |
| Captures container state | No                        | Yes (snapshots running containers)       |
| Cross-version restore    | Snapshot binary only      | Forward-compatible via `restoreBackfill` |
| Triggered automatically  | Yes, on pending migration | No (manual or scheduled)                 |
| Retention                | 10 newest per DB file     | Operator-controlled via `rotate`         |

**Use both.** Pre-migration snapshots cover binary upgrades; `crewship
backup` covers DR, host migration, legal hold, and forensic
preservation. They are not interchangeable.

## Related

* [Backup API reference](/api-reference/backup) — REST endpoint shapes for every `crewship backup` subcommand.
* [Admin CLI](/guides/admin-cli) — the host-shell write surface that backups protect against; always `crewship backup create` before running admin writes in prod.
* [Migrations Catalog](/guides/migrations) — what migration ran when, and which `restoreBackfill` hook (if any) will fire on a cross-version restore.
* [Troubleshooting](/guides/troubleshooting) — recovering from a stuck migration or corrupt DB.
* [Security → Audit log](/security/audit) — where `backup.created`, `backup.restored`, and `backup.restore.dry_run` events land.