Backup & Restore
Overview
Crewship’s backup system produces portable.tar.zst bundles that capture a workspace (or a single crew) in one file. Bundles are AGE-encrypted by default, carry a versioned manifest, and can be restored on any Crewship instance that speaks the same format version (N-2 compatibility guarantee). The whole subsystem is admin-only by design: every backup subcommand requires the OWNER or ADMIN role on the workspace, and the runner refuses MEMBER / VIEWER calls at both the CLI parsing layer and the server-side handler — defence in depth, not just a UI veneer.
The architectural choice that shapes everything else is “one bundle, one file”. Backups don’t depend on an external object store, don’t require a sidecar, and don’t shard across files. A .tar.zst is a single artefact an operator can scp to a backup host, hand to a customer for legal hold, or check into a private bucket — without the rest of Crewship being available. Inside the tarball, the manifest is plaintext JSON (so crewship backup inspect can read it without the AGE recipient key) and the payload is the encrypted SQLite snapshot plus any referenced workspace files. The forward-compatible manifest schema means a bundle produced on N can restore on N-1 and N-2; older bundles run their migration’s restoreBackfill hook so columns added since the snapshot are populated sanely instead of left at SQL defaults.
Restores are intentionally cautious. --dry-run walks the entire restore plan — schema diff, row counts, blob deltas — and prints what would happen without writing a single byte to the destination database. Advisory locking (backup_locks table, per-workspace) prevents two concurrent restore runs from corrupting each other, and the lock file records the host + PID so crewship backup status can tell you who’s holding it. If a host crashes mid-restore, crewship backup unlock is the manual recovery — admin-only with a confirmation prompt, because clearing a real lock from another live process is how you trash a workspace.
When to use it
Backups are cheap to make and expensive to wish you’d made. The five canonical reasons to runcrewship backup create:
- Before any destructive admin write. Before
crewship admin reset-password, before a database migration on a binary upgrade, before a large schema-changing PR lands in prod — capture the current state first. The bundle is the one-command rollback if anything goes wrong. - Disaster recovery / hot spare. Schedule a nightly
crewship backup create --scope=workspace --passphrase-file …cron and ship the bundle to a separate host. If the primary disk dies, restoring onto a fresh binary is onecrewship backup restoreaway — no replication agent, no streaming WAL, no extra moving parts. - Workspace migration to another host. Moving a workspace from a dev VM to a prod host (or between two prod hosts) is exactly what bundles are for. Create on the source, scp the file,
crewship backup restore --as-workspace <new-slug>on the destination, then provision crews. The--as-workspacerename avoids the “twoacmeworkspaces colliding” hazard. - Legal hold or compliance archive. Customer leaves; you need to keep their workspace state on cold storage for N years. One AGE-encrypted
.tar.zstis a forever-readable artefact — no live database, no service required.inspectlater proves the bundle’s contents without decrypting. - Forensic snapshot before incident response. Suspected compromise of an admin account, or a “what was the state when X happened” investigation.
backup create+ immediate offsite copy preserves the audit trail before anyone (you, the attacker, the well-meaning oncall) starts changing things.
Key concepts
| Term | What it means here |
|---|---|
| Bundle | A single .tar.zst artefact containing one MANIFEST (plaintext JSON), one payload.age (AGE-sealed tar of DB rows + workspace files), and one payload.sha256. The whole backup is this one file — portable, hashable, copy-able with scp. |
| Scope | workspace (the workspace row + every crew under it) or crew (a single crew + its agents). Crew bundles restore independently of their parent workspace, so a “move this crew to another workspace” is a viable migration path. |
| Format version | Integer in MANIFEST.format_version. Compatibility guarantee is N-2 — a bundle produced on format v3 restores on v3, v4, and v5 servers. Bumped only when a migration changes the bundle layout itself (not the schema rows inside). |
restoreBackfill hook | Per-migration Go function — replay logic lives in internal/backup/runner_restore.go (replayRestoreBackfills). Runs when a bundle predates a migration and the restoring server has columns the source didn’t. Pure ADD COLUMN migrations rely on the SQL DEFAULT; complex backfills (JSON shape changes, foreign keys) provide a hook so restored rows land sanely. |
| AGE encryption | The bundle’s payload is sealed with filippo.io/age. Default mode is passphrase (scrypt-derived key); --recipient age1… switches to X25519 public-key encryption for hand-offs. --no-encrypt produces a plaintext payload for test/CI use only. |
| Passphrase keyring | Opt-in cache at ~/.crewship/backup-keyring.enc (AGE-encrypted with a single OS-keyring-stored key) so operators don’t retype the passphrase on every rotate / verify. Enabled via --use-keyring on create / restore. |
| Advisory lock | A row in the backup_locks table, primary-keyed on workspace ID. Taken before any DB dump or docker pause; released by defer Release() on the happy path. 1-hour TTL (DefaultLockTTL) so a crashed backup self-heals after the window. |
refuseIfBackupInProgress guard | A shared middleware wired into the assignments, peer-query, and webhook handlers. Reads the lock state and refuses new agent runs while a backup is in progress — closes the TOCTOU window between ensureAgentsIdle (initial check) and docker pause (the actual freeze). |
| Dry-run restore | --dry-run on crewship backup restore. Decrypts, validates the manifest, replays the DB transaction, then rolls back. The only side effect is one backup.restore.dry_run audit row — distinct from backup.restore so auditors can tell “verified” apart from “actually restored”. |
--as-workspace / --as-crew rename | Restore the bundle under a new slug instead of the original. Refuses to run the docker phase (container names derive from the slug) and tells the operator to crewship crew provision afterwards. Avoids “two acme workspaces colliding” during DR drills. |
| Retention sweep | crewship backup rotate --keep-last N --keep-days D. Per-workspace — never touches another workspace’s bundles. Both flags can be combined; both must be positive. --dry-run lists what would be deleted without touching disk. |
| Stale lock | A lock whose holder process is dead but whose row still exists. Detected when acquired_at + DefaultLockTTL < now(). Auto-released on next create; manually clearable via crewship backup unlock --force (admin-only, with confirmation). |
| Pre-migration snapshot | An automatic safety net distinct from the manual bundle system. Whenever crewship start detects pending migrations, database.SnapshotBeforeMigrate writes a VACUUM INTO copy of the live SQLite database to <dbpath>.pre-migrate-vN-to-vM-<UTC>.bak before any DDL runs. Last 10 snapshots are retained per database; opt out with CREWSHIP_SKIP_MIGRATION_BACKUP=1. Not a substitute for proper backups — it’s the binary upgrade safety net, not a disaster-recovery primitive. |
Usage
The whole backup surface is thecrewship backup command group — twelve subcommands that cover the create → verify → restore → rotate lifecycle (plus metrics, download, and self-test). The core nine are tabulated below with full flag reference and copy-pasteable examples; metrics is covered under Metrics, download under Known caveats. This table is the entry point.
| Command | Purpose | Deep dive |
|---|---|---|
crewship backup create | Produce a new bundle for a workspace or crew. | Create a workspace backup / Back up a single crew |
crewship backup list | List bundles on disk. | List, inspect, verify |
crewship backup inspect | Print a bundle’s manifest without decrypting the payload. | List, inspect, verify |
crewship backup verify | Decrypt + checksum a bundle end-to-end (no DB writes). | List, inspect, verify |
crewship backup restore | Restore a bundle (supports --dry-run). | Restore |
crewship backup delete | Remove a single bundle (interactive confirm or --force). | Delete & rotate |
crewship backup rotate | Retention sweep — --keep-last N / --keep-days D. | Delete & rotate |
crewship backup status | Show the advisory lock state for the current workspace. | Lock semantics |
crewship backup unlock | Release a stale lock owned by this host (admin-only, confirm required). | Lock semantics |
create to produce the bundle, verify to prove it’s not corrupt, restore --dry-run to prove the destination will accept it, then restore for real. Every other subcommand exists for retention (rotate), introspection (list, inspect, status), or recovery (unlock).
create and restore accept --use-keyring to cache and reuse the
workspace passphrase via ~/.crewship/backup-keyring.enc. See
Passphrase keyring below.Bundle layout
- Scope:
workspace(workspace row + all its crews) orcrew(single crew + its agents). - Encryption: AGE passphrase (default) or AGE X25519 recipient.
--no-encryptproduces a plaintext payload for test / CI use. - Default location:
~/.crewship/backups/on the server, mode0700. - Naming:
crewship-<scope>-<slug>-<iso-ts>.tar.zst. Collisions append-<hash8>.
Create a workspace backup
Non-interactive / CI
Supply the passphrase from a file:--passphrase-file is not set).
Asymmetric encryption
If the restoring party holds an AGE X25519 keypair, pass their public key instead of a shared secret:--recipient, --passphrase-file, and --no-encrypt are mutually exclusive.
Back up a single crew
--crew accepts either a slug or a crew ID. Crew-scope bundles restore independently of their parent workspace.
List, inspect, verify
inspect only reads the plaintext MANIFEST — it never touches the sealed payload, so no passphrase is needed. verify recomputes the SHA-256 of the sealed bytes against the manifest and fails if the bundle was truncated or tampered with. Neither decrypts.
Restore
--as-workspace <new-slug> or --as-crew <new-slug> to land the payload under a fresh identity:
Dry run
backup.restore.dry_run row in the audit log — handy for proving a bundle is restorable before the real cutover. The distinct audit action lets auditors tell “verified” apart from “actually restored”.
Delete & rotate
rotate applies retention per workspace — it never touches another workspace’s bundles. Either --keep-last N (bundles above the N newest are dropped) or --keep-days D (bundles older than D days are dropped) must be positive; both can be combined.
delete requires interactive confirmation, or --force in scripts / CI. The same rule applies to backup unlock.
Lock semantics
Each workspace holds at most one advisory backup lock at a time (tablebackup_locks, per-workspace PK). The lock:
- Is taken before any DB dump or docker pause and released by a deferred
Release()on the happy path. - Has a 1-hour TTL (
DefaultLockTTL); a crashed backup self-heals after the window. - Blocks concurrent
backup createcalls — the second caller gets HTTP 409 Conflict with a “another backup is already in progress” message. - Blocks new agent runs via the shared
refuseIfBackupInProgressguard wired into the assignments, peer-query, and webhook handlers. This closes the TOCTOU window betweenensureAgentsIdleanddocker pause.
Examples
Nightly hot-spare backup with 14-day retention
A workspace onprod-server.example.com should produce a bundle every night, ship it to a separate backup host, and keep 14 days on disk locally as a fast-rollback safety net.
crewship-backup.timer that fires at 02:37 daily (off-the-hour to avoid clustering with the rest of the fleet). The --use-keyring flag means the passphrase is read once and cached — the timer doesn’t have to redeliver it. The rotate step runs after the rsync so local-disk pressure is bounded even on long runs without a remote-side sweep.
Workspace migration to a new host
Theacme workspace lives on crewship-old. You’re moving it to crewship-new to retire the old host. The destination already has its own workspaces, so a same-slug restore would conflict.
acme on the old host gets archived to cold storage before its workspace row is deleted.
DR drill with --dry-run
Every quarter the team proves the disaster-recovery bundle is actually restorable, without disturbing production state. Run the drill on a throwaway VM:
backup.restore.dry_run audit entry shows up in the workspace journal — auditors looking at “did anyone restore prod?” can tell drill runs apart from real restores by the action name.
API reference
The backup surface is CLI-first — every operation is reachable viacrewship backup … and every flag the CLI accepts maps to an HTTP body field. The full HTTP schema lives at /api-reference/backup; a quick orientation:
| Method | Path | What it does |
|---|---|---|
POST | /api/v1/admin/backups | Create a bundle. Auth + OWNER/ADMIN. Same trust gate as the CLI; concurrent calls hit the advisory lock and the second returns 409. |
GET | /api/v1/admin/backups | List bundles on disk for the current workspace. |
GET | /api/v1/admin/backups/inspect | Plaintext manifest only — equivalent to crewship backup inspect. Takes the bundle as a query param; no passphrase required. |
GET | /api/v1/admin/backups/verify | SHA-256 integrity check of the sealed bytes against the manifest. Does not decrypt — no passphrase required. No DB writes. |
POST | /api/v1/admin/backups/restore | Restore. Body accepts dry_run, as_workspace, as_crew, and replace (v3 DR wipe-and-replace of existing workspace contents — workspace-scope only, incompatible with as_workspace/as_crew). Holds the lock for the whole call. |
DELETE | /api/v1/admin/backups | Delete a bundle. Idempotent — 404 is not an error in scripted use. |
POST | /api/v1/admin/backups/rotate | Retention sweep. Body accepts keep_last, keep_days, dry_run. |
GET | /api/v1/admin/backups/status | Inspect the advisory lock state (used by crewship backup status). |
DELETE | /api/v1/admin/backups/status | Force-release a stale lock. Requires force=true body field — never call from automation without a confirmation gate. |
internal/api/router_admin.go. The CLI talks to these directly when run against a remote host (--server flag) and falls back to in-process Go calls when run on the same machine as crewshipd — bypassing HTTP entirely for the host-shell use case. The two paths share the same handler functions, so flags work identically.
Webhook payloads for backup.created / backup.restored / backup.failed events are documented separately under Webhooks; metric emissions are documented under Metrics. (Dry-run restores deliberately fire no webhook — they are not real restores.)
Streaming & memory bounds (large backups)
Restore and verify stream the sealed payload to a temp directory rather than buffering per-crew sections in amap[slug][]byte. Peak heap stays bounded by the zstd decoder window regardless of bundle size, so multi-GB restores run cleanly on small hosts. The extraction scratch directory is os.TempDir()/crewship-backup-… and is removed on Close(); a killed process leaves it behind for the next os.TempDir cleanup.
Passphrase keyring
The--use-keyring flag on create and restore caches the workspace
passphrase in ~/.crewship/backup-keyring.enc so scripts and repeat
rehearsals don’t re-prompt. The file is an AES-256-GCM-encrypted JSON
map keyed by workspace ID, using the same v1:<base64> envelope as the
credstore — without the host’s ENCRYPTION_KEY the contents are
unreadable even if the file leaks.
- No silent fallback on failure. Opening the keyring or reading an
entry reports the real error and aborts; only
ErrKeyringEntryNotFound(first use on this workspace) falls through to a prompt. - Write failures are non-fatal on
create. The bundle is already written when the keyring save runs; the CLI logs a warning and continues. - Keyring is local to the operator’s host.
--use-keyringalways writes to~/.crewship/on the invoking machine — even if a future remote bundle backend (S3 / GCS) is configured, the passphrase never travels with the bundle. - Single-process mutex, not file-locked. Two concurrent CLI invocations against the same workspace are last-write-wins (the file is small and the failure mode is “one passphrase lost, never data corruption”). Filesystem-level locking is on the v0.2 roadmap.
--passphrase-filetakes precedence. When both flags are passed, the file wins and the keyring is not consulted.
Webhooks
SetCREWSHIP_BACKUP_WEBHOOK_URL (and CREWSHIP_BACKUP_WEBHOOK_SECRET)
on the server process to receive a signed POST for each backup lifecycle
event. Delivery is fire-and-forget from a goroutine — a slow or down
webhook never blocks the backup run.
backup.created, backup.failed, backup.restored.
Each request carries an X-Crewship-Signature: sha256=<hex> header —
HMAC-SHA256 over the raw body using CREWSHIP_BACKUP_WEBHOOK_SECRET.
Receivers must verify the signature (same scheme as Crewship’s
inbound webhooks; validate via webhook.ValidateHMAC after stripping
the sha256= prefix). The secret is required whenever URL is set —
sending a body unsigned would let any network listener forge events to
a downstream consumer that trusts the feed. URLs with userinfo or query
strings are redacted before ever appearing in logs / audit rows, so
basic-auth credentials or signed-URL tokens do not leak.
Metrics
GET /api/v1/admin/backups/metrics (instance OWNER only; see below)
returns a point-in-time snapshot of process-lifetime counters. The
numbers reset on restart — persistent observability belongs in the
audit log and its dashboards.
backup.* rows from audit_log.
Instance-scope backup
An instance-scope backup bundles every workspace on a Crewship host plus the cross-workspace surfaces that make the install usable — the credstore, the auth signing secret, and the instance identity (instance_config.hostname). It is the disaster-recovery path for an
entire host, not a normal operational tool.
Key differences from workspace/crew scope:
- Access control. Gated by the
CREWSHIP_OWNER_EMAILenv var (server-level OWNER), not workspace role. A workspace OWNER / ADMIN on their own is refused with HTTP 403. - Rate limit. One instance backup per user per sliding hour. A runaway cron cannot DoS the host.
- Encryption is recipient-only.
--passphrase-fileis refused for this scope — the surface is too broad (every workspace’s secrets in one blob) to trust a brute-forceable passphrase. Callers must supply an AGEage1…X25519 public key and hold the matching private key offline. - Cross-host restores force session-key rotation. The bundle records the source hostname; a restore onto a different target invalidates every existing JWE session to prevent source-host tokens from remaining valid after DR.
Admin UI
A Backups tab lives in/admin for OWNER / ADMIN users. It wraps the
same REST endpoints the CLI drives and adds:
- A status banner for the advisory lock (who holds it, TTL remaining).
- Create / restore dialogs with passphrase input (no keyring — the
keyring is a CLI-side convenience; the browser never sees
~/.crewship/). - An inspect panel that renders the plaintext manifest without
decrypting the payload (same as
crewship backup inspect). - A bundle list with size, scope, format version, and created-at,
fetched via
hooks/use-backups.ts.
Known caveats (v0.2 roadmap)
preBackup/postBackuphooks — no user-defined hooks yet. If your workspace has services that need an app-level flush, run them manually before invokingcrewship backup create.- Remote backends (S3 / B2 / GCS) — bundles live on the server’s
local disk only. The storage layer is now abstracted behind a
StorageOpsinterface so a future backend swap won’t require a second refactor of every call-site. Today: usescp/rclone/resticto ship bundles off-box, or stream a single bundle viaGET /api/v1/admin/backups/download. - Scheduled backups — no built-in scheduler. Wrap
crewship backup createincronorsystemd.timer. - Forward migration replay hooks. The plumbing is wired —
migrations can register a per-version
restoreBackfillfunction and the restorer walks theapplied ∖ manifestset in ascending order after the main transaction commits — but no migration registers a hook yet. A failed backfill surfaces asErrRestoreBackfillFailed; the restored rows are visible but may be missing backfilled columns until an admin investigates. - Cross-process keyring lock. The per-process mutex does not cover two concurrent CLI invocations racing on the same keyring file.
Common pitfalls
backup unlock --force on a live backup corrupts the bundle
backup unlock --force on a live backup corrupts the bundle
docker pause / docker unpause sequence. Only ever clear a lock when you can prove its holder is dead (CLI session crashed, host rebooted) and the 1-hour TTL hasn’t fired yet. When in doubt, wait for the TTL.CREWSHIP_DATA_DIR mismatch silently targets the wrong database
CREWSHIP_DATA_DIR mismatch silently targets the wrong database
crewship backup … without the same env var, the CLI defaults to ~/.crewship and operates on an empty/separate database. Backups succeed but capture nothing useful; restores produce “workspace doesn’t exist” errors. Export the same CREWSHIP_DATA_DIR the server uses (there is no --data-dir flag — the env var is the only override).Stop the server before a restore against the same database
Stop the server before a restore against the same database
crewshipd may still hold open handles on tables the restorer wants to truncate. Symptoms range from “database is locked” SQLite errors to a half-applied restore that leaves the workspace in a non-bootable state. The lock guards against concurrent backups, not against the server’s own writes — stop the service first.Format version drift breaks restores past N-2
Format version drift breaks restores past N-2
format_version against the destination first with crewship backup inspect.--recipient / --passphrase-file / --no-encrypt are mutually exclusive
--recipient / --passphrase-file / --no-encrypt are mutually exclusive
--recipient age1… will not decrypt with a passphrase, and vice versa — match the restore flag to the original create flag.Same-slug restore refuses by default
Same-slug restore refuses by default
crewship backup restore will not overwrite an existing workspace or crew under the same slug. If the existing one is stale and you want to replace it, delete it first (a crew via crewship crew delete; a workspace must be removed out-of-band — there is no workspace-delete command, see the workspaces API note); if you want both side-by-side, use --as-workspace=<new-slug> and then crewship crew provision.Disk space during restore is not bounded by the heap streaming guarantee
Disk space during restore is not bounded by the heap streaming guarantee
$TMPDIR/crewship-backup-* needs ~2× the bundle size on disk. Restoring a 4 GB bundle onto a host with 6 GB free in /tmp will fail mid-stream. Mount /tmp on the data volume or set TMPDIR to a larger filesystem.Scripted retries without --use-keyring deadlock on the passphrase prompt
Scripted retries without --use-keyring deadlock on the passphrase prompt
--passphrase-file, pipe the passphrase via stdin redirect, or use --use-keyring so the first call seeds the cache.Bundles include workspace bind-mount files
Bundles include workspace bind-mount files
/workspace, agent memory markdown, anything mounted into the agent container — all included in the bundle. AGE encryption is the only defence; treat a .tar.zst like any other backup of source code and credentials./secrets is the only directory NEVER included
/secrets is the only directory NEVER included
/secrets at runtime via Keeper, and including them in a portable bundle would defeat the SECRET-tier guarantee. Don’t add code that bypasses the secret-skipping filter without an equally rigorous out-of-band channel.Security notes
- The
/secretsmount is never included in a bundle. - The workspace bind mount (user code) and memory markdown files are included. AGE encryption of the payload is the only defence against leakage from physical bundle distribution — treat bundles like any other backup of your source tree.
- Every
backup.*event writes toaudit_logwith user, role, scope, sealed-payload SHA-256, and size. Dry-run restores writebackup.restore.dry_runso auditors can tell rehearsals from real cutover.
Automatic pre-migration snapshots
Distinct from the manualcrewship backup bundle system above:
Crewship auto-snapshots the SQLite DB before any pending migration
runs. This is the “binary upgrade went sideways, give me my data
back” safety net.
How it works
Everycrewship start calls
database.SnapshotBeforeMigrate before database.Migrate. If any
migrations are pending (rows in the migrations[] slice with version
MAX(version) FROM _migrations), the function:
- Resolves the SQLite file path from the connection DSN.
- Computes a target name:
<dbpath>.pre-migrate-v<from>-to-v<to>-<UTC-RFC3339>.bak. - Issues
VACUUM INTO '<target>'— SQLite’s hot-copy mechanism. Safer than a plain file copy because it serializes against concurrent writers and produces a defragmented, WAL-checkpointed snapshot. - Chmods the snapshot to
0600. - Prunes older snapshots, keeping the 10 most recent per database file.
Opting out
Recovering from a bad migration
Ifcrewship start succeeds, applies a migration, then exhibits
runtime errors that point at schema drift:
sqlite3 open it directly for inspection.
Limits of pre-migration snapshots vs manual backups
| Capability | Pre-migration snapshot | crewship backup create |
|---|---|---|
| Encrypted at rest | No (chmod 0600 only) | Yes (AGE) |
| Portable across hosts | Same machine only | Yes |
| Includes workspace files | No (DB only) | Yes (DB + bind-mount + memory) |
| Captures container state | No | Yes (snapshots running containers) |
| Cross-version restore | Snapshot binary only | Forward-compatible via restoreBackfill |
| Triggered automatically | Yes, on pending migration | No (manual or scheduled) |
| Retention | 10 newest per DB file | Operator-controlled via rotate |
crewship backup covers DR, host migration, legal hold, and forensic
preservation. They are not interchangeable.
Related
- Backup API reference — REST endpoint shapes for every
crewship backupsubcommand. - Admin CLI — the host-shell write surface that backups protect against; always
crewship backup createbefore running admin writes in prod. - Migrations Catalog — what migration ran when, and which
restoreBackfillhook (if any) will fire on a cross-version restore. - Troubleshooting — recovering from a stuck migration or corrupt DB.
- Security → Audit log — where
backup.created,backup.restored, andbackup.restore.dry_runevents land.