Keeper Reviews — admin panel workflow

The Admin → Keeper reviews panel surfaces every Keeper Phase 2 decision the platform has logged. Four sub-tabs correspond to the four F4 evaluators (PRD §6 F4.1–4.4); each row is one evaluator decision with the full LLM prompt + response retained for forensic inspection.

The panel is a read-only audit log, not an action queue — there is no approve/reject button in the panel itself. Operator actions (confirming a lifecycle transition, approving a lesson) happen in the inbox, which is where the evaluators route their blocking rows. See “Operator action” in the decision tables below.

This page is the operator playbook — how to read the panel, what to do with each row, when to override an evaluator decision. For the per-endpoint API reference see Admin API → Keeper Phase 2; for the gate that controls auto-apply behaviour see Autonomy + self-learning.

What the panel shows

┌─ Keeper Reviews ──────────────────────────────────────┐
│ [Skill Review F4.1] [Behavior F4.2] [Memory Health   │
│  F4.3] [Negative Learning F4.4]   ⟲ Refresh           │
│                                                       │
│ ─── Skill Review (3 pending) ──────────────────────── │
│ Anna · ops · ALLOW · risk 2 · 2026-05-21 04:03        │
│ → Skill still actively used (4 invocations in 30d…)   │
│                                                       │
│ Bob · ops · DENY · risk 7 · 2026-05-21 04:03          │
│ → Failures dominating (3 of last 5); recommend stale  │
│                                                       │
│ Carol · qa · ESCALATE · risk 5 · 2026-05-21 04:03     │
│ → Low usage + recent assignment; needs manager review │
└───────────────────────────────────────────────────────┘

Backend reads from one source of truth — GET /api/v1/admin/keeper/requests?limit=200 — and the panel filters client-side by request_type. The panel pulls a wide window (200 rows) so all four sub-tabs render from a single round-trip; if the workload outgrows that ceiling the server adds a ?request_type= filter (currently not warranted).

The four sub-tabs

F4.1 Skill Review

Cron-fires daily 03:00 UTC. Every skill with at least one enabled assignment on a live agent gets evaluated against:

last_used_at (when did any agent last invoke it)
assignments (how many agents have it assigned)
stats.invocations_30d + stats.failures_30d
failure_snippets (last 3 failure excerpts, if any)

Skills nobody has assigned are skipped: there is no workspace to bill the review’s LLM call to and no workspace inbox to notify, so a review would burn spend on an outcome no one sees. The next assignment picks the skill back up on the following daily sweep. The review call itself is billed to the first (alphabetical) workspace using the skill. The evaluator returns one of:

Decision	What it means	Operator action
ALLOW	Skill is fine. Keeps `lifecycle_state='active'`, bumps the verify clock.	Nothing — review later if the rationale changes.
DENY	Failures dominating, or no recent usage. Recommends `stale` (still callable but flagged) or `archived` (hidden from new assignments). Routes a blocking inbox row to the assigned agents’ workspace.	Open the row, read the failure snippets, decide: confirm the lifecycle transition via inbox approve, or push back via inbox reject.
ESCALATE	The LLM is uncertain — risk score is high but no clear DENY signal. Manager-targeted inbox row.	Read the LLM rationale, make the call. Often the right answer is “this is a manual call I should make based on context the LLM doesn’t have.”

F4.2 Behavior

Sampled post-tool-call. The orchestrator’s EventPostToolCall hook fires on a per-crew rate (default 1-in-5) and runs the tool-call shape through the evaluator. Each row carries:

tool_name (shell_exec, http_post, file_write, etc.)
tool_args_snippet (truncated args — secrets scrubbed before storage)
decision + risk_score
should_block (true only when crew’s behavior_mode='block' AND decision is DENY)

For behavior_mode='warn' (the default) every DENY here is a non-blocking inbox notification — the agent’s NEXT tool call proceeds. The operator’s job is to skim the warns, spot patterns (“agent X keeps trying to call rm -rf”), and either:

Tighten the crew’s autonomy_level (crewship policy set --crew <slug> --level strict)
Update the agent’s PERSONA.md to explicitly forbid the pattern (operator-only edit)
Flip the crew to behavior_mode='block' once you trust the signal enough that interrupts are worth it

F4.3 Memory Health

Cron-fires daily 03:30 UTC (30-min offset from skill review to avoid both LLM cron jobs hammering at the same wall-clock time). Each row carries a HealthSnapshot:

reachability_pct (what fraction of memory rows have at least one inbound reference)
stalest_entry_days (oldest unreferenced row’s age)
contradiction_count (rows in memory_relations with kind='refutes')

Decision matrix:

Snapshot signal	Likely decision	Auto-action
Reachability > 80%, stalest < 90 days, 0 contradictions	ALLOW	None
Reachability < 60% OR stalest > 365 days	DENY	`auto_consolidate=true` → triggers `consolidator.Run` for the workspace
Mixed signals, contradictions > 0	ESCALATE	Blocking inbox — operator reads which rows contradict each other and decides which is canonical

auto_consolidate=true runs the consolidator without operator approval — it’s a maintenance action, not a policy decision. The journal records every consolidator run for postmortem.

F4.4 Negative Learning

Fires after a guardrail trip, run failure, or explicit “log this lesson” trigger. Each row asks: should this failure produce a kind='negative' lesson in the agent’s lessons.md? The decision interacts with self_learning_enabled (v106 — see Autonomy + self-learning):

Decision	self_learning=1	self_learning=0 (default)
ALLOW	Lesson auto-applies. Row shows `decision=ALLOW`, `write_lesson=true`, `lesson_id=…`. Nothing for operator to do.	Lesson queued in inbox with `self_learning_gate=off` marker. Row still shows `decision=ALLOW`. Operator approves via inbox; lesson lands then.
DENY	Lesson discarded (“Curator dropped the failure as transient/noise”). Blocking inbox row so operator can see signals that disappeared — auditable failure-feedback path.	Same.
ESCALATE	Blocking inbox row for operator decision.	Same.

Important UX trap to avoid: an ALLOW row for an agent with self_learning=0 will look like the lesson auto-applied (because the response says write_lesson=true). It DIDN’T — check the agent’s lessons.md to confirm, and check the inbox for the pending approval. The audit-row’s payload_json contains "self_learning_gate": "off" when this gate fired.

Row detail — what to read

Click any row to open the detail sheet. Fields you’ll care about:

Agent + crew context

agent_name, crew_name, agent_id (linkable to the agent canvas), crew_id (linkable to the crew canvas). Helpful for “which agent / crew is generating most of these reviews” patterns.

Decision triple

decision (ALLOW/DENY/ESCALATE) + reason (one-line LLM summary) + risk_score (1-10 integer the evaluator self-assessed). Risk above 7 is rare and worth investigating even when the decision is ALLOW — it usually means the LLM found context that’s worth the operator seeing.

Full LLM round-trip

ollama_prompt (full prompt the evaluator sent — includes the request body + the system prompt + any context the evaluator built) and ollama_raw_response (raw JSON the LLM returned, before the decision-parser normalised it). Both are stored verbatim so a postmortem can reconstruct exactly what the evaluator saw and what it answered. Use these to debug “the LLM made the wrong call here” cases. Common failure modes:

The prompt didn’t include enough context — operator updates the evaluator’s prompt template (engineering work)
The LLM mis-parsed the request — operator switches the F3 aux model slot to a stronger model temporarily
The decision-parser tripped on an edge case — operator files a bug against the F4 normaliser

Secrets handling

The tool_args_snippet and failure_snippet fields pass through redactSecrets() (app/(dashboard)/admin/utils.ts) before render. API keys, OpenAI tokens, JWTs, Authorization: Bearer … headers etc. get replaced with [REDACTED:type] markers. If you see [REDACTED:openai-key] in a snippet, that’s the panel doing its job — the underlying audit row also has the secret scrubbed (the scrubber from Layer 3 ran on ingestion).

Operator workflows

”Triaging the morning backlog”

After overnight cron runs, the panel typically has 10–50 pending rows across the four tabs. Suggested sweep order:

F4.4 Negative Learning first — these are blocking inbox rows that hold the agent’s mission until you approve/reject. Anything that’s been sitting > 12 hours is hurting the agent’s velocity.
F4.1 Skill Review — DENY rows recommend lifecycle transitions. Confirm or push back; the assigned agents will pick up the new state on next invocation.
F4.3 Memory Health — ESCALATE rows need eyeball on the contradiction surface. ALLOW + auto_consolidate already ran without you; nothing to do.
F4.2 Behavior — warn-mode rows are pattern-spotting; you don’t action them individually, you look for repeated patterns and tighten policy.

”Why did Anna’s skill get archived overnight?”

Open Admin → Keeper Reviews → Skill Review
Filter by agent_name=Anna (browser Cmd+F — the panel doesn’t have search yet; tracked as a UI nice-to-have)
Click the matching row
Read ollama_prompt to see the stats the evaluator received
Read ollama_raw_response to see the LLM’s full rationale
If the decision was wrong, revert by setting lifecycle_state back via SQL (no UI override yet — tracked as a UI follow-up) and either:
- Update the skill’s failure-tolerance threshold in the evaluator prompt template (eng work)
- Pin the skill in the Keeper reviews panel so future reviews can’t archive it

”I think the behavior monitor is too noisy”

Symptoms: every other row in F4.2 is DENY for the same tool_name and the agent’s chat is unaffected (warn mode). The evaluator is over-firing. Fixes in order of preference:

Update the agent’s PERSONA.md to explain the legitimate use case for that tool — the next F4.2 evaluation reads the persona and is less likely to DENY
Lower the F4.2 sampling rate (the default is every 5th tool call per crew, tuned via the SetSampleEvery hook in internal/keeper/behaviorhook — not yet exposed as a CLI flag)
Switch the F3 Behavior aux slot to a stricter model that’s less prone to false positives

If symptoms persist, file an issue — the F4.2 prompt template may need a tightening pass.

Auto-refresh + filters

The panel does NOT auto-refresh — the operator clicks Refresh to pull the latest 200 rows. This is intentional: F4 cron decisions land in bursts (daily 03:00 + 03:30 UTC) and a polling refresh would generate API noise without buying anything. The decision is reversible; if operators ask for live updates a 60-second polling refresh can land. Sub-tab counts are live within the 200-row window — switching tabs is instant (client-side filter). The count badge next to each sub-tab label reads the filtered row count. The sub-tab strip is WAI-ARIA compliant:

ArrowLeft / ArrowRight cycle through the four tabs
Home jumps to the first tab; End jumps to the last
Tab from inside the panel exits to the next form control (roving tabIndex)
Active tab carries aria-selected="true" for screen readers

What’s NOT in the panel (tracked follow-ups)

Per-row override action — currently no “force ALLOW” / “force DENY” button. Overrides go through SQL or via re-running the evaluator with adjusted inputs. UI override = future PR.
Free-text search — browser Cmd+F is the only search today. Server-side filtering would need ?request_type=&q= API extension.
Auto-refresh polling — see above; deliberate.
Bulk action — no “approve all pending lesson proposals” — every approval is per-row by design (audit trail).
Export — no CSV download. Use GET /api/v1/admin/keeper/requests?limit=200&offset=N directly + jq for now.

Admin API → Keeper Phase 2 — per-endpoint reference for the four evaluator routes
Autonomy + self-learning — the policy + self_learning gates these decisions flow through
Inbox guide — where DENY + ESCALATE rows surface for operator approval
Keeper guide — Phase 1 (access/execute) decisions on the same panel

Get Started

Guides

Security

Configuration

Manifest Reference

Keeper Reviews — admin panel workflow

Keeper Reviews — admin panel workflow

What the panel shows

The four sub-tabs

F4.1 Skill Review

F4.2 Behavior

F4.3 Memory Health

F4.4 Negative Learning

Row detail — what to read

Agent + crew context

Decision triple

Full LLM round-trip

Secrets handling

Operator workflows

”Triaging the morning backlog”

”Why did Anna’s skill get archived overnight?”

”I think the behavior monitor is too noisy”

Auto-refresh + filters

Keyboard navigation

What’s NOT in the panel (tracked follow-ups)

​Keeper Reviews — admin panel workflow

​What the panel shows

​The four sub-tabs

​F4.1 Skill Review

​F4.2 Behavior

​F4.3 Memory Health

​F4.4 Negative Learning

​Row detail — what to read

​Agent + crew context

​Decision triple

​Full LLM round-trip

​Secrets handling

​Operator workflows

​”Triaging the morning backlog”

​”Why did Anna’s skill get archived overnight?”

​”I think the behavior monitor is too noisy”

​Auto-refresh + filters

​Keyboard navigation

​What’s NOT in the panel (tracked follow-ups)

​Related

Keeper Reviews — admin panel workflow

What the panel shows

The four sub-tabs

F4.1 Skill Review

F4.2 Behavior

F4.3 Memory Health

F4.4 Negative Learning

Row detail — what to read

Agent + crew context

Decision triple

Full LLM round-trip

Secrets handling

Operator workflows

”Triaging the morning backlog”

”Why did Anna’s skill get archived overnight?”

”I think the behavior monitor is too noisy”

Auto-refresh + filters

Keyboard navigation

What’s NOT in the panel (tracked follow-ups)

Related