> ## Documentation Index
> Fetch the complete documentation index at: https://docs.crewship.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Orchestration

> Multi-agent missions with task dependencies, retry loops, deadlock detection, and circuit breakers.

# Orchestration

Crewship's orchestration system manages multi-agent missions through the `MissionEngine` (`internal/orchestrator/mission.go`). It handles task scheduling, dependency resolution, failure recovery, and cross-crew coordination.

<Note>
  "Orchestration" here means the **engine subsystem** in `internal/orchestrator/`, not a navigable page. After the Plan/Run/Build/System IA refactor, the user-facing surfaces are split: [Routines](/guides/routines) for reusable recipes, **Issues** for the work-item tracker, [Inbox](/guides/inbox) for your actionable feed, and [Activity](/guides/activity) for the live trace canvas. The legacy `/orchestration` route now soft-redirects to `/activity`.
</Note>

## Mission Lifecycle

```
PLANNING --> IN_PROGRESS --> REVIEW --> COMPLETED
                |              |
                v              v
             FAILED        CANCELLED
```

A mission progresses through these states:

1. **PLANNING** -- Mission created, tasks defined (or waiting for Lead to plan)
2. **IN\_PROGRESS** -- Tasks being scheduled and executed
3. **REVIEW** -- All tasks finished (none failed). The mission enters review before final completion, allowing humans to inspect results
4. **COMPLETED** -- Mission accepted after review
5. **FAILED** -- A task failed and could not recover, or deadlock/timeout detected
6. **CANCELLED** -- Manually stopped by user or system

<Note>
  The REVIEW state is inserted between IN\_PROGRESS and COMPLETED. When all tasks reach a terminal state (COMPLETED, FAILED, or SKIPPED) and none have failed, the mission transitions to REVIEW rather than directly to COMPLETED. If any task failed, the mission transitions to FAILED instead.
</Note>

## The Mission Engine

The `MissionEngine` is the central orchestrator. Its behaviour is governed by a handful of fixed parameters:

| Parameter                        | Value                  | Source                                  |
| -------------------------------- | ---------------------- | --------------------------------------- |
| Polling interval                 | 3 seconds              | `time.NewTicker(3 * time.Second)`       |
| Circuit breaker threshold        | 3 consecutive failures | `circuitBreakerThreshold = 3`           |
| Mission timeout                  | 2 hours                | `missionTimeoutDefault = 2 * time.Hour` |
| Max result summary               | 8,000 chars            | `maxResultSummaryLen = 8000`            |
| Max brief total                  | 32,000 bytes           | `maxBriefTotalLen = 32000`              |
| Per-dependency output truncation | 4,000 chars            | `maxDepOutputLen = 4000`                |

### Mission Loop

The `runMissionLoop` function runs as a goroutine for each active mission. Every 3 seconds it:

```
1. Check mission status (still IN_PROGRESS?)
2. Lead planning phase: if 0 tasks, dispatch Lead to create plan
3. Schedule ready tasks (dependencies met, status PENDING)
4. Check mission completion (all tasks done?)
5. Detect deadlocks (all tasks BLOCKED with no progress)
```

### Restart Durability

Mission loops are in-memory goroutines, but missions survive server restarts. At boot — right after orphaned-run recovery — the server scans the database for missions still in `IN_PROGRESS` and re-attaches an orchestration loop to each one. Tasks the previous process never dispatched are picked up on the next tick, and `BLOCKED` tasks whose dependencies already completed are self-healed back to `PENDING`. No operator action is needed after a restart or crash; the 2-hour mission timeout restarts from the moment of re-attach.

## Task States

```
PENDING --> RUNNING --> COMPLETED
   ^          |             |
   |          v             v
   +------ FAILED    AWAITING_APPROVAL ---> COMPLETED
              |                         |
              v                         v
           BLOCKED                    FAILED (rejected)
           SKIPPED
```

| State               | Description                                                                         |
| ------------------- | ----------------------------------------------------------------------------------- |
| `PENDING`           | Ready to be scheduled                                                               |
| `RUNNING`           | Currently being executed by an agent                                                |
| `COMPLETED`         | Finished successfully                                                               |
| `FAILED`            | Execution failed                                                                    |
| `BLOCKED`           | Waiting for dependent tasks to complete                                             |
| `AWAITING_APPROVAL` | Task completed but held for human review before proceeding                          |
| `SKIPPED`           | Task was intentionally skipped (counts as terminal, does not cause mission failure) |

<Note>
  SKIPPED tasks are treated as terminal alongside COMPLETED and FAILED when checking mission completion. A skipped task does not block downstream dependencies and does not cause mission failure.
</Note>

## Token Budget Calculation

The orchestrator allocates system prompt space using a token budget system defined in `internal/tokenutil`:

| Constant                | Value  | Description                                             |
| ----------------------- | ------ | ------------------------------------------------------- |
| `MaxSystemPromptTokens` | 32,000 | Total conservative budget for the system prompt         |
| `ConversationBudgetPct` | 60%    | Percentage of remaining budget for conversation history |
| `MemoryBudgetPct`       | 40%    | Percentage of remaining budget for agent memory         |

The allocation works as follows:

```
1. Estimate base system prompt tokens
2. remaining = MaxSystemPromptTokens - baseTokens (min 2,000)
3. convTokenBudget  = remaining * 60 / 100
4. memTokenBudget   = remaining * 40 / 100
5. Inject conversation history (up to convTokenBudget)
6. Inject memory context (up to memTokenBudget)
```

After conversation and memory injection, the orchestrator appends additional context blocks in order: lead crew context (for LEAD agents), peer communication context (for crew AGENT members).

## Mission Brief Construction

When an agent is dispatched for a mission task, the `buildMissionBrief` function constructs a rich context prompt with five sections. The total brief is capped at 32KB (`maxBriefTotalLen`); if exceeded, the brief is truncated with a note.

### 1. IMPORTANT Preamble

Only included when dependency outputs exist. Instructs the agent not to ask clarifying questions:

```
IMPORTANT: You are part of a multi-agent mission pipeline.
Previous tasks have already been completed and their outputs are provided below.
DO NOT ask for additional information or clarification -- everything you need is in this prompt.
Use the dependency outputs below as your input and execute your task immediately.
```

### 2. \[MISSION]

Mission title, goal, and a DAG overview listing all tasks with their status markers:

* `+` COMPLETED
* `>` IN\_PROGRESS
* `x` FAILED
* ` ` PENDING/BLOCKED

### 3. \[INPUT FROM PREVIOUS TASKS]

Outputs from completed dependency tasks, injected **before** the assignment so agents read context first. When a task produced a structured handoff block, only the handoff summary, artifacts, and confidence are included (more concise). Otherwise the full result summary is included, truncated to 4,000 characters per dependency.

### 4. \[YOUR ASSIGNMENT]

The specific task title, description, and iteration number (if this is a retry).

### 5. \[OUTPUT FORMAT]

Structured handoff instructions requiring the agent to produce a `---HANDOFF---` block with summary, confidence, and artifacts.

## Lead Planning Phase

When a mission starts with 0 tasks, the engine dispatches the Lead agent to create a plan. The Lead uses its crew context to understand available agents and creates tasks via the sidecar `/mission/create` endpoint.

```
Mission (0 tasks)
    |
    v
Lead agent dispatched (LEAD role, with sidecar)
    |
    v
Lead creates tasks via curl to localhost:9119/mission/create
    |
    v
Mission engine detects new tasks -> begins scheduling
```

### LeadPlanning Flag

The `DispatchRequest` includes a `LeadPlanning` flag that tells the API layer to dispatch the agent as a LEAD with sidecar access. This is essential because Lead agents need access to the mission management API (`/mission/create`, `/mission/{id}`) to define tasks, while regular AGENT tasks skip the sidecar for security.

### TOCTOU Prevention

A time-of-check-to-time-of-use race is prevented by inserting a sentinel `missionState` into the `active` map before loading the mission from the database. The `planningDispatched` flag on the mission state prevents re-dispatching the Lead if it is still working. This flag is only set to `true` after `dispatchLeadPlanning` succeeds.

### Scaling Rules

The Lead agent follows complexity-based scaling rules injected via the system prompt:

| Complexity | Agents | Tool Calls | Duration | Tokens |
| ---------- | ------ | ---------- | -------- | ------ |
| SIMPLE     | 1      | 3-10       | \~5 min  | \~10K  |
| MEDIUM     | 1-2    | 10-15      | \~15 min | \~50K  |
| COMPLEX    | 2-4    | 15+        | \~30 min | \~100K |

## Workflow Templates

Four built-in workflow templates are defined in `internal/orchestrator/workflow.go`:

<Tabs>
  <Tab title="Sequential">
    Tasks execute one after another in order.

    ```
    step-1 --> step-2 --> step-3
    ```
  </Tab>

  <Tab title="Parallel">
    All tasks run simultaneously, then results are aggregated by the Lead.

    ```
    task-a --|
    task-b --|---> aggregate (Lead)
    task-c --|
    ```
  </Tab>

  <Tab title="Dev-Test Loop">
    Developer writes code, tester reviews. On failure, loops back to developer (max 3 iterations). This is the **Ralph Loop pattern**.

    ```
    develop <---(failure)--- test
       |                      ^
       +----(success)---------+
    (max 3 iterations each)
    ```
  </Tab>

  <Tab title="Pipeline">
    Sequential preparation, parallel work streams, final aggregation.

    ```
    prepare --> work-a --|
            --> work-b --|---> finalize (Lead)
    ```
  </Tab>
</Tabs>

## The Ralph Loop Pattern

The `LoopController` (`internal/orchestrator/loop.go`) manages task retry logic:

1. When a task fails and has `max_iterations > 1`, the controller increments the iteration counter and resets the task to `PENDING`
2. For loop-back patterns (dev-test-loop), when a downstream task fails, the upstream task is reset to restart the cycle
3. Previous failure context from the progress log is injected so the agent learns from mistakes

The `ShouldRetry` method checks if a failed task has remaining iterations. If yes, it resets the task:

* Status back to `PENDING`
* Iteration counter incremented
* All execution fields cleared (`assignment_id`, `result_summary`, `error_message`, `started_at`, `completed_at`, `duration_ms`)

The `RetryLoopBack` method handles the upstream reset pattern: when a downstream task (e.g., "test") fails, it checks the dependency chain. If an upstream task (e.g., "develop") has remaining iterations, that task is reset to PENDING and the failed downstream task is set to BLOCKED -- ready to run again once the upstream completes.

<Warning>
  Tasks without `max_iterations` set (or `max_iterations <= 1`) are never retried. A failed task without retry configuration causes the mission to fail.
</Warning>

## Task Approval Gate

The `checkApprovalGate` function determines whether a completed task should be held for human review. The gate evaluates three inputs:

1. **Explicit flag** -- if `approval_required = 1` on the task, it is always held
2. **Confidence threshold** -- the agent's self-reported confidence from the handoff block
3. **Escalation config** -- per-crew configuration with tiered thresholds

### Escalation Config

Each crew can define an `escalation_config` JSON object with three thresholds:

```json theme={null}
{
  "auto_approve_threshold": 0.9,
  "notify_threshold": 0.7,
  "require_approval_below": 0.5
}
```

| Threshold                | Behavior                                                                    |
| ------------------------ | --------------------------------------------------------------------------- |
| `auto_approve_threshold` | Confidence at or above this value: auto-approve (task goes to COMPLETED)    |
| `notify_threshold`       | Confidence below this value: send a `confidence.low` WebSocket notification |
| `require_approval_below` | Confidence below this value: hold the task in AWAITING\_APPROVAL            |

The evaluation order is:

1. If confidence >= `auto_approve_threshold`, return COMPLETED
2. If `approval_required` is explicitly set, return AWAITING\_APPROVAL
3. If no config or no confidence data, return COMPLETED
4. If confidence \< `require_approval_below`, return AWAITING\_APPROVAL
5. If confidence \< `notify_threshold`, send notification but return COMPLETED

### Approving or Rejecting Tasks

The `ApproveTask` method transitions a task from AWAITING\_APPROVAL:

* **Approved**: task moves to COMPLETED, dependent BLOCKED tasks are unblocked
* **Rejected**: task moves to FAILED, all downstream dependent tasks are recursively failed with reason "upstream task rejected"

Approval requires a `userID` for the audit trail. The approval status (`APPROVED` or `REJECTED`), approver, timestamp, and evaluation notes are persisted on the task.

<Note>
  When a task is held in AWAITING\_APPROVAL, the mission engine sends an `approval.required` WebSocket message to the workspace so dashboards can display a badge or notification.
</Note>

## Circular Dependency Detection

The `ValidateDAG` method checks all mission tasks for:

1. **References to nonexistent task IDs** -- any `depends_on` entry that does not match an existing task ID causes validation to fail
2. **Circular dependencies** -- detected using Kahn's algorithm (topological sort)

DAG validation runs before the mission loop begins scheduling, preventing tasks from being dispatched into an unresolvable dependency graph. The error message reports the number of tasks involved in the cycle: `"circular dependency detected: N tasks involved in cycle"`.

<Accordion title="Kahn's Algorithm internals">
  The implementation builds an adjacency list and computes in-degrees for each task:

  ```
  1. Initialize in-degree for each task based on depends_on count
  2. Enqueue all tasks with in-degree 0 (no dependencies)
  3. For each dequeued task, decrement in-degree of tasks that depend on it
  4. If a task's in-degree reaches 0, enqueue it
  5. If visited count != total tasks, a cycle exists
  ```
</Accordion>

## Deadlock Detection

The mission engine detects deadlocks when all remaining tasks are `BLOCKED` with no task currently `IN_PROGRESS`, `PENDING`, or `AWAITING_APPROVAL`. The detection logic:

1. If any task is PENDING, IN\_PROGRESS, or AWAITING\_APPROVAL -- not deadlocked (progress is still possible)
2. COMPLETED, SKIPPED, and FAILED tasks are terminal -- they cannot contribute to progress
3. If all non-terminal tasks are BLOCKED -- deadlock confirmed

When a deadlock is detected:

1. The mission is marked as `FAILED`
2. A `mission_deadlock` progress event is emitted
3. All `AWAITING_APPROVAL` tasks are failed with "mission timed out"

## Circuit Breaker

The circuit breaker tracks consecutive failures per agent. After 3 consecutive failures (`circuitBreakerThreshold`), the agent is considered unhealthy and tasks are not dispatched to it.

## CooldownManager

The `CooldownManager` (`internal/orchestrator/failover.go`) handles rate limit detection and credential cooldown. When an agent run fails due to a rate limit, the associated credential is placed in a cooldown period to avoid hammering the provider.

### Rate Limit Detection

The `IsRateLimitError` function checks stderr output against known patterns. Detection requires exit code 1 and a case-insensitive match against any of these patterns.

<Accordion title="Rate-limit patterns">
  | Pattern              | Example                         |
  | -------------------- | ------------------------------- |
  | `rate limit`         | "Rate limit exceeded"           |
  | `rate_limit`         | "rate\_limit\_error"            |
  | `429`                | "HTTP 429"                      |
  | `too many requests`  | "Too many requests"             |
  | `quota exceeded`     | "Quota exceeded for model"      |
  | `insufficient_quota` | "insufficient\_quota"           |
  | `billing_hard_limit` | "billing\_hard\_limit\_reached" |
</Accordion>

### Cooldown Behavior

When a rate limit is detected:

1. `MarkCooldown(credentialID, 5*time.Minute)` places the credential in a 5-minute cooldown
2. `IsInCooldown(credentialID)` returns true during this period, causing the orchestrator to skip that credential
3. `ClearExpired()` removes stale entries

The cooldown is per-credential, not per-agent. If an agent has multiple credentials assigned, only the rate-limited credential is paused -- the orchestrator can fall back to an alternate credential.

## Progress Logging

The `ProgressWriter` (`internal/orchestrator/progress.go`) appends structured JSONL events to a per-mission progress file at `data/crews/{crewSlug}/missions/{traceID}/progress.jsonl`.

### Event Types

Each event includes a UTC timestamp.

<Accordion title="Progress event reference">
  | Event              | Fields                        | When                                        |
  | ------------------ | ----------------------------- | ------------------------------------------- |
  | `mission_started`  | `mission_id`                  | Mission loop begins                         |
  | `task_started`     | `task_id`, `agent`, `title`   | Task dispatched to agent                    |
  | `task_COMPLETED`   | `task_id`, `agent`, `summary` | Task finished successfully                  |
  | `task_FAILED`      | `task_id`, `agent`, `error`   | Task execution failed                       |
  | `task_retry`       | `task_id`, `agent`            | LoopController resets a task for retry      |
  | `mission_deadlock` | `mission_id`                  | All tasks BLOCKED with no progress          |
  | `mission_REVIEW`   | `mission_id`                  | All tasks terminal, mission entering review |
  | `mission_timeout`  | `mission_id`                  | Mission exceeded 2-hour timeout             |
</Accordion>

The progress file is append-only and agents can read it during retry iterations to understand what happened in previous attempts (the Ralph Loop "external state" pattern).

The `BuildProgressContext` method formats the JSONL into a human-readable text block suitable for injection into an agent's system prompt.

## Structured Handoff

Agents produce structured handoff data at the end of tasks:

```
---HANDOFF---
summary: Created the REST API endpoints for user management
confidence: high
artifacts: internal/api/users.go, internal/api/users_test.go
---END HANDOFF---
```

The `parseHandoff` function extracts this structure from agent output. Both `summary` and `confidence` are required for a valid handoff -- partial blocks are treated as unparsed.

The confidence value (`low`, `medium`, `high`) feeds into the approval gate. When parsed as a float (via escalation config), it determines whether the task auto-approves or requires human review.

## Cross-Crew Missions

Mission tasks can reference agents from connected crews. The system auto-routes assignments to the correct crew container. Crew connections must be established by workspace admins before use.

<Note>
  Crew-to-crew handoff with critique exchange (e.g. backend crew hands a draft to a testing crew for review) is on the v0.2 roadmap.
</Note>

## Sidecar API for Orchestration

Lead agents interact with the orchestration system through the sidecar proxy at `localhost:9119`:

| Endpoint              | Method | Description                        |
| --------------------- | ------ | ---------------------------------- |
| `/assign`             | POST   | Assign a task to a crew member     |
| `/results/{id}`       | GET    | Poll for assignment result         |
| `/query`              | POST   | Ask a crew member a quick question |
| `/standup`            | GET    | Get crew standup summary           |
| `/escalate`           | POST   | Escalate an issue to humans        |
| `/mission/create`     | POST   | Create a multi-task mission        |
| `/mission/{id}`       | GET    | Check mission status               |
| `/mission/{id}/start` | POST   | Start a mission                    |
| `/mission/templates`  | GET    | List available workflow templates  |

## What's Next

* [Agent memory](/guides/agent-memory) -- persistent agent memory across sessions
* [Scheduling](/guides/scheduling) -- cron-based automated agent runs