Agent Production Engineer Setup Guide

Decision In One Page

Consolidate production engineering work into one authoritative agent suite centered on .agents/skills/prode-triage/SKILL.md. The suite replaces split prod-agent* skills with a single evidence-first process, bounded specialists, durable runner contract, redaction controls, and an operational-memory layer.

Problem	Production tickets were easy for agents to approach from the wrong layer: code first, one evidence source, stale memory, duplicated artifacts, or incomplete handoffs.
Decision	Make `prode-triage` the coordinator and bind it to specialist routing, a durable runner contract, PRODE Wiki memory, redaction controls, and Linear-first progress.
Invariant	Current production evidence is required before implementation. Operational memory can route investigation, but it cannot prove the current incident.
Owner Model	The coordinator owns the ticket and final decision; specialists own bounded evidence; the runner owns durable automation; humans own production-write approval and PR merge.
Acceptance	A reviewer can reproduce the evidence path from Linear, specialist outputs, evidence bundle, PR handoff, and post-merge monitor without relying on hidden chat context.

What the suite does

Fetches Linear and live production evidence first, routes specialists, builds an evidence bundle, classifies, implements only when confidence and approvals allow, opens a PR, updates Linear, monitors after merge, and writes candidate learnings.

What the suite must not do

Treat old wiki memory as proof, read broad source code before evidence, mutate production without approval, leak customer or financial data, duplicate Linear/PR artifacts, or merge PRs by itself.

Evidence gate

Linear and live evidence precede source investigation, fix planning, and branch work.

Tool gate

The coordinator inventories MCPs, connectors, CLIs, and skills before routing specialists.

Approval gate

Production writes, PR merges, and unsafe operational actions require fresh human approval.

Memory gate

Wiki pages guide query selection and negative controls, but never count as proof.

Setup Contract

An Agent Production Engineer is not just a prompt. It is a controlled operating environment with a coordinator skill, read-only evidence tools, durable state, redaction controls, and explicit human approval gates. If any part is missing, the agent should degrade to investigation or block instead of pretending it can safely fix production.

Entry point	`.agents/skills/prode-triage/SKILL.md` is the normative workflow. Repo entrypoints such as `AGENTS.md` and `CLAUDE.md` should point to it rather than duplicating rules.
Secrets	Use secure local or platform secret storage for `LINEAR_API_KEY`, `GITHUB_TOKEN`, `SENTRY_AUTH_TOKEN`, AWS credentials, and production read-replica credentials. Never write secret values to prompts, docs, artifacts, git config, or issue text.
Tool map	The coordinator inventories Linear, Sentry, AWS/CloudWatch, GitHub, DB read-replica, vendor, support, release, browser, emulator, and code-review tools before routing specialists.
Runner state	Automated operation requires a durable store for leases, phase transitions, idempotency namespaces, artifact pointers, PR URLs, approval records, and terminal outcomes.
Human gates	Production writes, deploy changes, merges, Done/Cancelled Linear transitions, risky data boundaries, and broad telemetry suppression require explicit fresh approval.

Minimum viable agent

Can fetch Linear tickets and comments.
Can read linked Sentry and production evidence.
Can write Linear progress and open PRs.
Cannot write production or merge PRs.

Minimum viable runner

Has ticket leases and idempotency keys.
Persists every phase transition.
Stores redacted artifact pointers.
Resumes instead of duplicating work.

Minimum viable safety

Production DB is read-only by credential.
Specialists run with read-only tools.
Redaction lint checks wiki paths.
Approvals expire and are action-specific.

Non-negotiable setup rule

If a required evidence tool is unavailable, the run records the missing tool and confidence impact. It must not silently substitute a weaker source and continue as though the same evidence standard was met.

Canonical Files To Install

The setup should keep the operating contract in git so humans and agents read the same rules. Local chat context is not an acceptable source of truth for production engineering.

File	Purpose	Setup requirement
`.agents/skills/prode-triage/SKILL.md`	Coordinator workflow, triggers, Linear requirements, evidence gates, and handoff templates.	Must be the only normative top-level workflow.
`.agents/skills/prode-triage/references/specialists.md`	Specialist role definitions, tool boundaries, input bundle, and output shape.	Must be read whenever a ticket needs source-specific investigation.
`.agents/skills/prode-triage/references/automation-runner.md`	Durable runner contract: leases, state machine, idempotency, audit log, approvals, and post-merge monitor.	Required before any automated execution.
`AGENTS.md` / `CLAUDE.md`	Repo-level routing so production tickets invoke the PRODE workflow.	Should point to the skill instead of copying sections that can drift.
`docs/prode-wiki/`	Reviewed operational memory: failure modes, query recipes, and negative controls.	Must distinguish memory from current evidence.
`docs/learnings/inbox.md`	Candidate learnings from completed runs.	Automation writes candidates here, not directly into durable wiki pages.
`scripts/prode-wiki-redaction-lint.sh`	Prevents common sensitive artifact formats and risky content from entering the wiki.	Must run in CI for wiki and learning paths.

Architecture

The suite is deliberately split into one coordinator skill, specialist references, durable automation rules, and a separate operational-memory ADR.

Diagram 1: Suite Components

Correctness check

prode-triage is shown as a coordinator boundary, not as a node fanning out to peers. Inputs flow into it; controlled artifacts flow out.

Important boundary

The dashed wiki arrow is deliberate: operational memory can influence routing, but only the evidence bundle can support classification.

Coordinator

Owns the ticket, Linear updates, classification, implementation, PR, handoff, and final decision.

Specialists

Return bounded evidence. They do not mutate Linear, GitHub, AWS, Sentry state, files, or production data.

Runner

Makes automation retry-safe through leases, idempotency keys, artifacts, and explicit terminal states.

Evidence-First Workflow

The most important design choice is the ordering. Agents do not start from code. They start from the ticket and production evidence, then consult memory, then route specialists.

Diagram 2: Ticket To Handoff

Correction made

The workflow now includes tool/MCP inventory before wiki and specialist routing, matching the suite contract.

Stop paths are first-class

The red lane names non-happy exits so automation does not collapse every ticket into "fix and PR".

Fetch

Read Linear and linked production evidence. No source-code investigation yet.

Route

Consult PRODE Wiki as memory, then route the minimum required specialists.

Prove

Build a live evidence bundle with classification, confidence, and gaps.

Deliver

Implement, verify, review, open PR, update Linear, monitor after merge.

Ticket runbook

For each PRODE ticket: fetch Linear; inventory tools; consult PRODE Wiki; route specialists; persist specialist outputs; classify with confidence; update Linear; only then read code, create a worktree, implement, verify, review, open a PR, write the handoff, await human merge, monitor, and propose learning.

Specialist Routing

Specialists are bounded roles. They gather evidence from one source or one path, return structured findings, and stop. The coordinator owns the ticket.

Specialist	Phase	Use when	Boundary
Sentry	Before code	Sentry issue, crash, trace tags, production exception.	Read-only Sentry.
AWS / CloudWatch	Before code	5xxs, ECS/RDS/ELB/Batch, log correlation, infra symptoms.	Read-only AWS allowlist.
Vendor / Integration	Before code	Webhook, provider, payment, KYC, banking, broker, notification failures.	No provider-side mutation.
Product / Support	Before code	Screenshots, support reports, user-visible symptoms, expected behaviour.	Ticket-linked context only.
Release / Deploy	Before code	Deploy spikes, app versions, feature flags, Sentry releases.	Read-only release metadata.
DB Read-Replica	Before code	Production state must be confirmed to classify or reproduce.	Configured read-only `SELECT` only.
Reproduction	After evidence source known	Behaviour can be safely reproduced outside production writes.	Local, staging, emulator, browser, or read-only API paths.
Code Path	After evidence bundle	Root cause needs source-code grounding before implementation.	Source reads only; no edits.
Review	After diff	PR readiness needs independent challenge.	Diff review only unless explicitly delegated.

How specialists know which tools to read

Specialists do not discover tools broadly. The coordinator first inventories available MCP tools, connectors, CLIs, and skills, records the chosen tool map in the run artifact, then passes each specialist only the read-only tools relevant to its evidence source.

Linear: ticket snapshot, labels, comments, attachments, status.

Sentry: issue details, representative events, tags, releases, breadcrumbs.

AWS: STS identity, CloudWatch logs, ECS/RDS/ELB/Batch read-only evidence.

Vendor: provider status, webhook IDs, idempotency, retries, correlation IDs.

Release: deploy windows, Sentry releases, GitHub commits, feature flags.

DB: configured read-only SQL path, narrow SELECT queries only.

Reproduction: local, staging, emulator, browser, or read-only API paths.

Code path: local source reads only after the evidence bundle exists.

Automation Runner

The runner is the durable control plane. It prevents duplicate comments, branches, PRs, and half-finished hidden-context work.

Diagram 3: Runner State Bands

State grouping

The bands compress the full runner enum into operating phases; the canonical state names remain in the runner reference.

Human boundary

Automation may await review and monitor after merge, but merge and production-write approvals remain human-owned gates.

Durable by design

Ticket lease
Idempotency namespace
Persisted phase state
Audit log and artifact pointers

Stops instead of guessing

Confidence below 70%
Evidence missing
Approval required
Specialists materially conflict

Evidence Bundle Contract

The evidence bundle is the handoff between investigation and implementation. The agent cannot implement until this bundle exists, confidence is at least 70%, and required approvals are resolved.

Bundle field	Required content	Why it matters
Linear snapshot	Title, labels, status, comments, attachments, reporter context, linked resources.	Prevents the agent from solving a stale or partial version of the ticket.
Tool map	Available MCPs, connectors, CLIs, skills, blocked tools, and selected routes.	Shows whether the run had the right evidence sources.
Operational memory	Wiki pages read, skipped, stale, or absent; query recipes and negative controls reused.	Explains routing decisions without converting memory into proof.
Specialist outputs	Required/skipped status, sources, findings, correlation IDs, confidence, gaps, next pivot.	Makes investigation reproducible by a reviewer.
Classification	Allowed classification, severity, confidence, root-cause hypothesis, and evidence gaps.	Determines whether the run may fix, block, no-code, vendor-route, or escalate.
Approval state	N/A, pending, approved, approver, timestamp, exact action, expiry.	Prevents implicit approval from leaking across risky actions.

Implementation may start when

The bundle exists, required specialists have returned or been explicitly skipped with reason, confidence is at least 70%, approvals are satisfied, and the fix scope is narrow enough for a PR.

Implementation must stop when

Evidence is missing, tool access is blocked, specialists conflict materially, confidence is below 70%, production mutation is required, or the next action belongs to a human, vendor, or incident process.

PRODE Wiki: Memory, Not Evidence

ADR-001 introduces a Karpathy-style LLM-readable wiki for operational memory. It helps route the investigation, but live systems remain canonical.

Diagram 4: Evidence Versus Memory

Memory lane

Wiki pages can supply known failure modes, query recipes, and negative controls, but they stay outside classification evidence.

Evidence lane

Current Linear, Sentry, AWS, DB, vendor, release, support, and reproduction facts feed the bundle and decision.

Hard boundary

The wiki can say "this resembles an old request-aborted pattern; check these tags and negative controls." It cannot say "this ticket is request-aborted noise because a wiki page says similar tickets were noise."

Safety And Redaction

The suite is intentionally conservative because Linear, PRs, run artifacts, and wiki pages are all potential leak surfaces.

Allowed in handoffs

Request IDs
Sentry issue IDs
Releases and versions
Log groups
Redacted source names

Never include

Secrets, tokens, cookies, auth headers
Customer contact details
Payment identifiers and KYC fields
Copied DB rows
Full request, response, or vendor payloads

Controls added

The design includes a wiki-specific .gitignore, the docs/learnings/inbox.md learning path, scripts/prode-wiki-redaction-lint.sh, and the prode-wiki-redaction GitHub Actions workflow.

Tool And Permission Model

Production engineering agents should be powerful at reading and conservative at mutating. Tool access is granted by role, phase, and evidence source rather than by convenience.

Capability	Coordinator	Specialists	Automation runner
Linear read/write	Reads and writes progress/handoff.	Read supplied context only; no writes.	Writes idempotent run-owned sections and comments.
Sentry	Reads and records evidence; mutations require approval.	Sentry specialist reads issue/events/tags only.	Never resolves or ignores issues without approval.
AWS / CloudWatch	Read-only evidence with verified account and region.	AWS specialist uses read allowlist only.	Blocks or requests approval for writes.
Production DB	Read-only narrow checks when needed.	DB specialist uses configured `SELECT`-only path.	Blocks non-SELECT, locks, migrations, backfills, and side effects.
GitHub / PR	Creates and updates PR after evidence gate.	No branch, commit, push, or PR mutation.	Creates or resumes one run-owned branch/PR.
Merge / deploy	Never without explicit human instruction.	Blocked.	Blocked unless an explicit approval record exists; default is await human review.

Production DB Read Access

Production DB read access is part of the workflow, but it is not a blank cheque. The DB specialist confirms narrow production state only when needed to classify, reproduce, or explain a backend symptom.

Connection

Use a physically read-only role when available. If unknown, state the gap before querying.

Query shape

SELECT only, narrow predicates, explicit LIMIT, no SELECT *.

Output

Prefer counts, statuses, timestamps, and aggregate checks. Do not copy rows into handoffs.

Stop condition

If a write, lock, migration, backfill, side-effecting function, FOR UPDATE, or production mutation seems necessary, stop and request explicit human approval through the coordinator.

Rollout Plan

Phase 1: Canonical suite

Keep prode-triage authoritative, retain specialist and runner references, update repo entrypoints, and remove or wrap old prod-agent* skills.

Phase 2: Safety substrate

Add the learning inbox, wiki .gitignore, redaction lint script, and GitHub Actions workflow.

Phase 3: Specialist expansion

Keep Sentry/AWS first-class and add vendor, support, release, DB, reproduction, code-path, and review specialists.

Phase 4: Automation runner

Implement durable state, leases, idempotency, evidence bundles, approval gates, post-merge monitoring, and candidate-learning writes.

Phase 5: PRODE Wiki

Seed query recipes and failure modes, register active pages in index.md, and keep promotion behind review.

Phase 6: Hardening

Confirm physically read-only DB credentials, add platform-specific redaction patterns, and define where full evidence bundles live outside git.

Acceptance Checklist

Treat the Agent Production Engineer as ready only when these checks can be demonstrated in a dry run or low-risk ticket. A pretty diagram is not enough; the setup must survive retries, missing tools, and review.

Triggering: PRODE tickets and sentry-issue labels reliably invoke the skill.

Tool inventory: The run records available and missing MCPs/connectors/CLIs before specialists run.

Evidence gate: Source reads and implementation do not begin before Linear and required live evidence.

Specialists: Sentry, AWS, vendor, support, release, DB, reproduction, code-path, and review roles have bounded outputs.

DB safety: Production SQL uses read-only credentials and narrow SELECT queries only.

Idempotency: Retrying a run does not duplicate Linear comments, branches, PRs, or specialist artifacts.

Approval gates: Risky actions pause with exact approval records and expiry.

Redaction: Handoffs and wiki paths reject secrets, direct identifiers, raw rows, and full payloads.

Review: The PR explains evidence, fix, tests, residual risk, and does not claim more than the bundle proves.

Monitoring: Post-merge monitoring is read-only and records resolved, still recurring, or inconclusive.

Open Questions

Is production DB access physically constrained to read-only credentials in every runner environment?
Which exact customer/account/entity/KYC identifier patterns should redaction lint block?
Where should full evidence bundles live when they are too sensitive for git but need durable retention?
Who owns the review cadence for active PRODE Wiki pages?
Which CI status should be required before merging PRs that touch wiki or learning paths?
Which MCP/tool names should be canonical in each host environment?

Source Documents

Read the full ADRs when you want the complete detail: