How To Set Up An Agent Production Engineer
A complete operating document for building an automated production engineer: Linear-first triage, bounded evidence specialists, durable automation, approval gates, redacted operational memory, and careful production DB read access.
Decision In One Page
Consolidate production engineering work into one authoritative agent suite centered on
.agents/skills/prode-triage/SKILL.md. The suite replaces split
prod-agent* skills with a single evidence-first process, bounded specialists,
durable runner contract, redaction controls, and an operational-memory layer.
| Problem | Production tickets were easy for agents to approach from the wrong layer: code first, one evidence source, stale memory, duplicated artifacts, or incomplete handoffs. |
|---|---|
| Decision | Make prode-triage the coordinator and bind it to specialist routing, a durable runner contract, PRODE Wiki memory, redaction controls, and Linear-first progress. |
| Invariant | Current production evidence is required before implementation. Operational memory can route investigation, but it cannot prove the current incident. |
| Owner Model | The coordinator owns the ticket and final decision; specialists own bounded evidence; the runner owns durable automation; humans own production-write approval and PR merge. |
| Acceptance | A reviewer can reproduce the evidence path from Linear, specialist outputs, evidence bundle, PR handoff, and post-merge monitor without relying on hidden chat context. |
What the suite does
Fetches Linear and live production evidence first, routes specialists, builds an evidence bundle, classifies, implements only when confidence and approvals allow, opens a PR, updates Linear, monitors after merge, and writes candidate learnings.
What the suite must not do
Treat old wiki memory as proof, read broad source code before evidence, mutate production without approval, leak customer or financial data, duplicate Linear/PR artifacts, or merge PRs by itself.
Linear and live evidence precede source investigation, fix planning, and branch work.
The coordinator inventories MCPs, connectors, CLIs, and skills before routing specialists.
Production writes, PR merges, and unsafe operational actions require fresh human approval.
Wiki pages guide query selection and negative controls, but never count as proof.
Setup Contract
An Agent Production Engineer is not just a prompt. It is a controlled operating environment with a coordinator skill, read-only evidence tools, durable state, redaction controls, and explicit human approval gates. If any part is missing, the agent should degrade to investigation or block instead of pretending it can safely fix production.
| Entry point | .agents/skills/prode-triage/SKILL.md is the normative workflow. Repo entrypoints such as AGENTS.md and CLAUDE.md should point to it rather than duplicating rules. |
|---|---|
| Secrets | Use secure local or platform secret storage for LINEAR_API_KEY, GITHUB_TOKEN, SENTRY_AUTH_TOKEN, AWS credentials, and production read-replica credentials. Never write secret values to prompts, docs, artifacts, git config, or issue text. |
| Tool map | The coordinator inventories Linear, Sentry, AWS/CloudWatch, GitHub, DB read-replica, vendor, support, release, browser, emulator, and code-review tools before routing specialists. |
| Runner state | Automated operation requires a durable store for leases, phase transitions, idempotency namespaces, artifact pointers, PR URLs, approval records, and terminal outcomes. |
| Human gates | Production writes, deploy changes, merges, Done/Cancelled Linear transitions, risky data boundaries, and broad telemetry suppression require explicit fresh approval. |
Minimum viable agent
- Can fetch Linear tickets and comments.
- Can read linked Sentry and production evidence.
- Can write Linear progress and open PRs.
- Cannot write production or merge PRs.
Minimum viable runner
- Has ticket leases and idempotency keys.
- Persists every phase transition.
- Stores redacted artifact pointers.
- Resumes instead of duplicating work.
Minimum viable safety
- Production DB is read-only by credential.
- Specialists run with read-only tools.
- Redaction lint checks wiki paths.
- Approvals expire and are action-specific.
Non-negotiable setup rule
If a required evidence tool is unavailable, the run records the missing tool and confidence impact. It must not silently substitute a weaker source and continue as though the same evidence standard was met.
Canonical Files To Install
The setup should keep the operating contract in git so humans and agents read the same rules. Local chat context is not an acceptable source of truth for production engineering.
| File | Purpose | Setup requirement |
|---|---|---|
.agents/skills/prode-triage/SKILL.md |
Coordinator workflow, triggers, Linear requirements, evidence gates, and handoff templates. | Must be the only normative top-level workflow. |
.agents/skills/prode-triage/references/specialists.md |
Specialist role definitions, tool boundaries, input bundle, and output shape. | Must be read whenever a ticket needs source-specific investigation. |
.agents/skills/prode-triage/references/automation-runner.md |
Durable runner contract: leases, state machine, idempotency, audit log, approvals, and post-merge monitor. | Required before any automated execution. |
AGENTS.md / CLAUDE.md |
Repo-level routing so production tickets invoke the PRODE workflow. | Should point to the skill instead of copying sections that can drift. |
docs/prode-wiki/ |
Reviewed operational memory: failure modes, query recipes, and negative controls. | Must distinguish memory from current evidence. |
docs/learnings/inbox.md |
Candidate learnings from completed runs. | Automation writes candidates here, not directly into durable wiki pages. |
scripts/prode-wiki-redaction-lint.sh |
Prevents common sensitive artifact formats and risky content from entering the wiki. | Must run in CI for wiki and learning paths. |
Architecture
The suite is deliberately split into one coordinator skill, specialist references, durable automation rules, and a separate operational-memory ADR.
prode-triage is shown as a coordinator boundary, not as a node fanning out to peers. Inputs flow into it; controlled artifacts flow out.
The dashed wiki arrow is deliberate: operational memory can influence routing, but only the evidence bundle can support classification.
Coordinator
Owns the ticket, Linear updates, classification, implementation, PR, handoff, and final decision.
Specialists
Return bounded evidence. They do not mutate Linear, GitHub, AWS, Sentry state, files, or production data.
Runner
Makes automation retry-safe through leases, idempotency keys, artifacts, and explicit terminal states.
Evidence-First Workflow
The most important design choice is the ordering. Agents do not start from code. They start from the ticket and production evidence, then consult memory, then route specialists.
The workflow now includes tool/MCP inventory before wiki and specialist routing, matching the suite contract.
The red lane names non-happy exits so automation does not collapse every ticket into "fix and PR".
Fetch
Read Linear and linked production evidence. No source-code investigation yet.
Route
Consult PRODE Wiki as memory, then route the minimum required specialists.
Prove
Build a live evidence bundle with classification, confidence, and gaps.
Deliver
Implement, verify, review, open PR, update Linear, monitor after merge.
Ticket runbook
For each PRODE ticket: fetch Linear; inventory tools; consult PRODE Wiki; route specialists; persist specialist outputs; classify with confidence; update Linear; only then read code, create a worktree, implement, verify, review, open a PR, write the handoff, await human merge, monitor, and propose learning.
Specialist Routing
Specialists are bounded roles. They gather evidence from one source or one path, return structured findings, and stop. The coordinator owns the ticket.
| Specialist | Phase | Use when | Boundary |
|---|---|---|---|
| Sentry | Before code | Sentry issue, crash, trace tags, production exception. | Read-only Sentry. |
| AWS / CloudWatch | Before code | 5xxs, ECS/RDS/ELB/Batch, log correlation, infra symptoms. | Read-only AWS allowlist. |
| Vendor / Integration | Before code | Webhook, provider, payment, KYC, banking, broker, notification failures. | No provider-side mutation. |
| Product / Support | Before code | Screenshots, support reports, user-visible symptoms, expected behaviour. | Ticket-linked context only. |
| Release / Deploy | Before code | Deploy spikes, app versions, feature flags, Sentry releases. | Read-only release metadata. |
| DB Read-Replica | Before code | Production state must be confirmed to classify or reproduce. | Configured read-only SELECT only. |
| Reproduction | After evidence source known | Behaviour can be safely reproduced outside production writes. | Local, staging, emulator, browser, or read-only API paths. |
| Code Path | After evidence bundle | Root cause needs source-code grounding before implementation. | Source reads only; no edits. |
| Review | After diff | PR readiness needs independent challenge. | Diff review only unless explicitly delegated. |
How specialists know which tools to read
Specialists do not discover tools broadly. The coordinator first inventories available MCP tools, connectors, CLIs, and skills, records the chosen tool map in the run artifact, then passes each specialist only the read-only tools relevant to its evidence source.
SELECT queries only.Automation Runner
The runner is the durable control plane. It prevents duplicate comments, branches, PRs, and half-finished hidden-context work.
The bands compress the full runner enum into operating phases; the canonical state names remain in the runner reference.
Automation may await review and monitor after merge, but merge and production-write approvals remain human-owned gates.
Durable by design
- Ticket lease
- Idempotency namespace
- Persisted phase state
- Audit log and artifact pointers
Stops instead of guessing
- Confidence below 70%
- Evidence missing
- Approval required
- Specialists materially conflict
Evidence Bundle Contract
The evidence bundle is the handoff between investigation and implementation. The agent cannot implement until this bundle exists, confidence is at least 70%, and required approvals are resolved.
| Bundle field | Required content | Why it matters |
|---|---|---|
| Linear snapshot | Title, labels, status, comments, attachments, reporter context, linked resources. | Prevents the agent from solving a stale or partial version of the ticket. |
| Tool map | Available MCPs, connectors, CLIs, skills, blocked tools, and selected routes. | Shows whether the run had the right evidence sources. |
| Operational memory | Wiki pages read, skipped, stale, or absent; query recipes and negative controls reused. | Explains routing decisions without converting memory into proof. |
| Specialist outputs | Required/skipped status, sources, findings, correlation IDs, confidence, gaps, next pivot. | Makes investigation reproducible by a reviewer. |
| Classification | Allowed classification, severity, confidence, root-cause hypothesis, and evidence gaps. | Determines whether the run may fix, block, no-code, vendor-route, or escalate. |
| Approval state | N/A, pending, approved, approver, timestamp, exact action, expiry. | Prevents implicit approval from leaking across risky actions. |
Implementation may start when
The bundle exists, required specialists have returned or been explicitly skipped with reason, confidence is at least 70%, approvals are satisfied, and the fix scope is narrow enough for a PR.
Implementation must stop when
Evidence is missing, tool access is blocked, specialists conflict materially, confidence is below 70%, production mutation is required, or the next action belongs to a human, vendor, or incident process.
PRODE Wiki: Memory, Not Evidence
ADR-001 introduces a Karpathy-style LLM-readable wiki for operational memory. It helps route the investigation, but live systems remain canonical.
Wiki pages can supply known failure modes, query recipes, and negative controls, but they stay outside classification evidence.
Current Linear, Sentry, AWS, DB, vendor, release, support, and reproduction facts feed the bundle and decision.
Hard boundary
The wiki can say "this resembles an old request-aborted pattern; check these tags and negative controls." It cannot say "this ticket is request-aborted noise because a wiki page says similar tickets were noise."
Safety And Redaction
The suite is intentionally conservative because Linear, PRs, run artifacts, and wiki pages are all potential leak surfaces.
Allowed in handoffs
- Request IDs
- Sentry issue IDs
- Releases and versions
- Log groups
- Redacted source names
Never include
- Secrets, tokens, cookies, auth headers
- Customer contact details
- Payment identifiers and KYC fields
- Copied DB rows
- Full request, response, or vendor payloads
Controls added
The design includes a wiki-specific .gitignore, the docs/learnings/inbox.md
learning path, scripts/prode-wiki-redaction-lint.sh, and the
prode-wiki-redaction GitHub Actions workflow.
Tool And Permission Model
Production engineering agents should be powerful at reading and conservative at mutating. Tool access is granted by role, phase, and evidence source rather than by convenience.
| Capability | Coordinator | Specialists | Automation runner |
|---|---|---|---|
| Linear read/write | Reads and writes progress/handoff. | Read supplied context only; no writes. | Writes idempotent run-owned sections and comments. |
| Sentry | Reads and records evidence; mutations require approval. | Sentry specialist reads issue/events/tags only. | Never resolves or ignores issues without approval. |
| AWS / CloudWatch | Read-only evidence with verified account and region. | AWS specialist uses read allowlist only. | Blocks or requests approval for writes. |
| Production DB | Read-only narrow checks when needed. | DB specialist uses configured SELECT-only path. |
Blocks non-SELECT, locks, migrations, backfills, and side effects. |
| GitHub / PR | Creates and updates PR after evidence gate. | No branch, commit, push, or PR mutation. | Creates or resumes one run-owned branch/PR. |
| Merge / deploy | Never without explicit human instruction. | Blocked. | Blocked unless an explicit approval record exists; default is await human review. |
Production DB Read Access
Production DB read access is part of the workflow, but it is not a blank cheque. The DB specialist confirms narrow production state only when needed to classify, reproduce, or explain a backend symptom.
Connection
Use a physically read-only role when available. If unknown, state the gap before querying.
Query shape
SELECT only, narrow predicates, explicit LIMIT, no SELECT *.
Output
Prefer counts, statuses, timestamps, and aggregate checks. Do not copy rows into handoffs.
Stop condition
If a write, lock, migration, backfill, side-effecting function, FOR UPDATE, or production
mutation seems necessary, stop and request explicit human approval through the coordinator.
Rollout Plan
Phase 1: Canonical suite
prode-triage authoritative, retain specialist and runner references, update repo
entrypoints, and remove or wrap old prod-agent* skills.
Phase 2: Safety substrate
.gitignore, redaction lint script, and GitHub Actions workflow.
Phase 3: Specialist expansion
Phase 4: Automation runner
Phase 5: PRODE Wiki
index.md, and keep promotion
behind review.
Phase 6: Hardening
Acceptance Checklist
Treat the Agent Production Engineer as ready only when these checks can be demonstrated in a dry run or low-risk ticket. A pretty diagram is not enough; the setup must survive retries, missing tools, and review.
sentry-issue labels reliably invoke the skill.SELECT queries only.Open Questions
- Is production DB access physically constrained to read-only credentials in every runner environment?
- Which exact customer/account/entity/KYC identifier patterns should redaction lint block?
- Where should full evidence bundles live when they are too sensitive for git but need durable retention?
- Who owns the review cadence for active PRODE Wiki pages?
- Which CI status should be required before merging PRs that touch wiki or learning paths?
- Which MCP/tool names should be canonical in each host environment?
Source Documents
Read the full ADRs when you want the complete detail: