# ADR-002: Consolidate PRODE Engineering into an Evidence-First Agent Suite

## Status

Proposed

## Date

2026-07-01

## Context

Production engineering work in this repository is high-risk because it touches live customer behaviour, production observability, production infrastructure, vendor integrations, financial data, KYC/compliance surfaces, and production database state.

Historically, agent support for this work was split across multiple smaller Claude skill wrappers and references. That shape made it easy for agents to start at the wrong layer:

- reading code before pulling the Linear ticket
- treating a Sentry issue title as the diagnosis
- using one evidence source when the ticket required several
- writing incomplete Linear handoffs
- duplicating branches, PRs, and comments on retry
- conflating "known pattern" with proof of the current production incident
- treating production DB read access as an ordinary debugging tool
- over-trusting hidden chat context instead of durable run artifacts

The suite now needs to support two operating modes:

1. **Interactive engineer-agent mode:** a human asks an agent to triage, fix, verify, and hand off a PRODE ticket.
2. **Automated runner mode:** an automated service executes the same workflow durably, with idempotency, explicit terminal states, approval gates, specialist routing, review, PR creation, Linear handoff, and post-merge monitoring.

Both modes need the same core contract: production evidence first, implementation second.

## Decision

Create a single PRODE engineering agent suite centered on `.agents/skills/prode-triage/SKILL.md`.

The suite consists of:

- one authoritative top-level skill: `.agents/skills/prode-triage/SKILL.md`
- one specialist protocol: `.agents/skills/prode-triage/references/specialists.md`
- one automation runner contract: `.agents/skills/prode-triage/references/automation-runner.md`
- repo-level agent entrypoints in `AGENTS.md` and `CLAUDE.md`
- a redaction and learning substrate:
  - `docs/learnings/inbox.md`
  - `docs/prode-wiki/.gitignore`
  - `scripts/prode-wiki-redaction-lint.sh`
  - `.github/workflows/prode-wiki-redaction.yml`
- an operational-memory design captured separately in [ADR-001: Introduce a PRODE Wiki for Production Engineering Operational Memory](./ADR-001-prode-wiki-operational-memory.md)

The old split `.claude/skills/prod-agent*` suite is superseded by this consolidated suite. Compatibility can be provided by a symlink or wrapper only if it points agents back to the authoritative `.agents/skills/prode-triage/` implementation.

## Primary Principle

The suite is evidence-first.

For any matching production engineering ticket, agents must fetch the Linear ticket and linked production evidence before reading source code, proposing a fix, or editing files.

Evidence includes, when relevant:

- Linear ticket content, labels, comments, screenshots, attachments, status, and linked resources
- Sentry issue details, representative events, tags, stack frames, breadcrumbs, request context, releases, affected users, and recurrence shape
- AWS/CloudWatch/ECS/RDS/ELB/Batch evidence, using narrow time windows and read-only commands
- vendor/integration evidence, such as provider status, webhook delivery, idempotency, retries, and provider correlation IDs
- product/support context, such as user-visible symptoms, screenshots, customer timestamps, and expected behaviour
- release/deploy correlation, such as app versions, Sentry releases, backend deploy windows, feature flags, and suspect commits
- configured production read-only DB checks, using narrow `SELECT` queries only
- safe reproduction evidence from local, staging, emulator, browser, or read-only API paths

Source-code investigation starts only after the evidence gate.

## Scope

The suite applies when:

- a Linear ticket has prefix `PRODE-`
- a Linear ticket carries the `sentry-issue` label
- a Sentry production issue under `islamicfinanceguru-limited` needs diagnosis
- a CloudWatch, ECS, RDS, AWS Batch, ELB, API, vendor, or production DB signal needs diagnosis
- the user explicitly asks to reuse the PRODE production engineering process

The suite does not apply to ordinary feature work, non-production bug fixes, repo hygiene, or exploratory refactors unless explicitly requested.

## Architecture

### 1. Authoritative Skill

`.agents/skills/prode-triage/SKILL.md` is the primary human-readable workflow. It defines:

- triggers
- vocabulary
- MCP/connector discovery
- Linear requirements
- Sentry requirements
- AWS/CloudWatch requirements
- PRODE Wiki operational-memory consultation
- workflow steps
- classification vocabulary
- Linear progress and comment templates
- branch/worktree rules
- code investigation gates
- implementation rules
- verification rules
- review requirements
- PR and Linear handoff requirements

All repo-level agent instructions should point to this skill rather than duplicating its content.

### 2. Specialist Protocol

`.agents/skills/prode-triage/references/specialists.md` defines bounded investigation roles.

Specialists do not own the ticket. They do scoped evidence work and return a structured output. The coordinator owns classification, Linear updates, code changes, PRs, and final decisions.

Current specialist types:

| Specialist | Phase | Purpose |
| --- | --- | --- |
| Sentry | before code | establish what production Sentry actually observed |
| AWS / CloudWatch | before code | correlate backend and infrastructure symptoms |
| Vendor / Integration | before code | determine external provider or webhook involvement |
| Product / Support Context | before code | convert free-form reports into bounded investigation windows |
| Release / Deploy Correlation | before code | test deploy/regression/version hypotheses |
| DB Read-Replica | before code | confirm narrow production state through read-only SQL |
| Reproduction | after evidence source is known | safely reproduce outside production writes |
| Code Path | after evidence bundle exists | map production evidence to source files and functions |
| Review | after diff exists | challenge whether the fix follows from the evidence |

Specialists must receive a narrow input bundle and a tool allowlist from the coordinator. They must not discover broad new tools on their own, mutate Linear/GitHub/files/AWS/Sentry/production data, or edit code unless explicitly delegated a disjoint review/fix scope.

### 3. Automation Runner Contract

`.agents/skills/prode-triage/references/automation-runner.md` defines the durable control plane for automated execution.

The runner must provide:

- ticket-level leases
- idempotency namespaces
- durable phase state
- retry-safe activities
- explicit terminal states
- audit logs
- persisted artifacts
- tool policy
- human approval gates
- evidence bundle contract
- feedback-loop limits
- post-merge monitoring

The runner is the only automated component allowed to mutate Linear, GitHub, branches, worktrees, or run artifacts. Specialists are read-only.

### 4. Operational Memory

ADR-001 defines the PRODE Wiki as operational memory.

The PRODE Wiki may be consulted after Linear fetch and before specialist routing. It can suggest known failure modes, query recipes, negative controls, and approval boundaries.

The PRODE Wiki is not evidence. It must not raise classification confidence by itself, and it must not appear in the classification evidence section.

Candidate learnings from PRODE runs are written to `docs/learnings/inbox.md` with `destination: prode-wiki`. Durable wiki pages require review, source links, redaction checks, metadata, and staleness handling.

## Workflow

The intended flow is:

```text
Linear ticket
  -> tool/MCP inventory
  -> PRODE Wiki operational-memory read, if available
  -> specialist routing
  -> evidence bundle
  -> classification
  -> Linear progress update
  -> code path investigation
  -> implementation
  -> verification
  -> review
  -> PR
  -> Linear handoff
  -> human review and merge
  -> post-merge monitor
  -> candidate learning
```

Automated runner state:

```text
QUEUED
PRECHECK
FETCH_LINEAR
READ_PRODE_WIKI
ROUTE_SPECIALISTS
WAIT_FOR_EVIDENCE
CLASSIFY
BLOCKED_NEEDS_INFO
APPROVAL_REQUIRED
CREATE_WORKTREE
IMPLEMENT
VERIFY
REVIEW
OPEN_PR
LINEAR_HANDOFF
AWAIT_HUMAN_REVIEW
POST_MERGE_MONITOR
WRITE_CANDIDATE_LEARNING
RUN_COMPLETE
NO_CODE_CHANGE
NOISE_NO_FIX
VENDOR_OR_INFRA
CANNOT_AUTO_FIX
ESCALATED_INCIDENT
SUPERSEDED
FAILED_RETRYABLE
FAILED_FINAL
```

The runner may only enter implementation after the evidence bundle exists, confidence is at least 70%, and required approvals are satisfied.

## Evidence Bundle Contract

Before implementation, every run must have an evidence bundle that records:

- Linear ticket snapshot
- operational memory consulted, or none
- required/skipped status for every specialist
- specialist outputs
- classification
- severity
- confidence
- evidence gaps
- human approval state

Operational memory may explain routing and query selection only. It is not classification evidence.

## Classification

Allowed classifications are intentionally constrained:

- `bug`
- `regression`
- `noise:user-cancel`
- `noise:request-aborted`
- `noise:validation`
- `noise:platform`
- `vendor-incident`
- `infra-capacity`
- `needs-investigation`

If confidence is below 70%, the suite must stop implementation and ask for more evidence.

Noise classifications require narrow predicates and negative controls. A known cause is not automatically noise. A user-facing failure in auth, login, payments, onboarding, money movement, compliance, or KYC is not noise merely because the cause is understood.

## Tooling and MCP Discovery

The coordinator inventories available MCP tools, plugins, connectors, CLIs, and skills before routing specialists.

Preferred mappings:

| Evidence source | Preferred route |
| --- | --- |
| Linear | Linear MCP/app first |
| Sentry | Sentry MCP/plugin first |
| AWS / CloudWatch | AWS CLI or AWS MCP with the read allowlist |
| Product/support | Linear attachments/comments first, then linked support/Slack/Notion/Drive/CRM sources |
| Release/deploy | Sentry releases, Linear releases, GitHub/CI/deploy metadata, feature flags |
| Vendor/integration | provider dashboards/APIs/status pages/webhook logs/application logs |
| DB read-replica | configured production read-only SQL path |
| Code/review | local filesystem and git after evidence gate |

Tool discovery is coordinator-owned. Specialists receive only the relevant read-only tool allowlist.

If a mapped tool is unavailable, the suite must not silently substitute a weaker source. It must record the missing tool, confidence impact, and safe fallback.

## Data and Redaction Rules

The suite is allowed to summarize production evidence. It is not allowed to leak sensitive material into git, Linear, PRs, comments, docs, prompts, or run artifacts.

Never include:

- secrets
- tokens
- cookies
- authorization headers
- customer email, phone, or name
- account or entity IDs unless explicitly approved and redacted
- balances
- payment identifiers
- KYC fields
- full request/response bodies
- copied DB rows
- full vendor payloads
- complete stack traces containing sensitive payloads

Linear and PR text may include:

- request IDs
- Sentry issue IDs
- releases
- versions
- log groups
- redacted source names

The redaction substrate includes:

- `docs/prode-wiki/.gitignore`, which allowlists curated Markdown and blocks common raw/export formats
- `scripts/prode-wiki-redaction-lint.sh`
- `.github/workflows/prode-wiki-redaction.yml`
- existing gitleaks checks

Gitleaks is not sufficient for PRODE Wiki safety. The PRODE-specific lint is mandatory for wiki and learning paths, but it is still a backstop, not proof of safety.

## Production DB Read Access

Configured production read access is available for this workflow.

The canonical DB investigation boundary lives in `.agents/skills/prode-triage/references/specialists.md`.

The core rules are:

- use a physically read-only role when available
- if the current connection is not known to be read-only, state that gap before querying
- `SELECT` only
- prefer read replicas
- no writes
- no locks
- no migrations
- no backfills
- no advisory lock calls
- no side-effecting functions
- no `FOR UPDATE`
- no stacked statements
- narrow predicates based on persisted evidence
- explicit `LIMIT` on row-returning queries
- no `SELECT *`
- prefer counts, statuses, timestamps, and aggregate checks
- do not copy row output into Linear, PRs, wiki pages, or run summaries

If a write, lock, migration, backfill, or production mutation seems necessary, the suite must stop and request explicit human approval through the coordinator.

## Linear and PR Handoff

Linear is the canonical progress log.

For every routed ticket, the suite must update or append the run-owned `## Agent Progress` section and create required triage and final handoff comments.

Required handoff content:

- Sentry evidence or access blocker
- AWS evidence when relevant
- other evidence when relevant
- classification
- severity
- confidence
- root cause
- decision
- fix plan
- verification results or blockers
- PR link when present
- next step

For `sentry-issue` tickets, Sentry fields are mandatory unless Sentry access is blocked and recorded.

PRs are the delivery unit. The runner may open and update PRs; humans review and merge. The runner must never merge a PR or move a Linear ticket to Done/Cancelled without fresh explicit approval.

## Automation Safety

Automated runs must be safe to retry.

All mutation activities must check for an existing run-owned object before creating a new one:

- Linear progress section
- Linear comments
- branch/worktree
- PR
- candidate learning

Retries must not duplicate Linear comments, branches, PRs, or specialist findings.

The runner must stop at explicit terminal states rather than silently continuing when:

- confidence is too low
- evidence is missing
- credentials are unavailable
- production approval is required
- specialist outputs conflict materially
- tests fail because of the agent diff
- merge conflicts need product judgment
- a human or newer PR supersedes the run
- the fix crosses product, migration, data, vendor, or infrastructure boundaries the runner may not own

## Review Strategy

Every fix should receive a critical review pass before PR handoff.

The reviewer should challenge:

- whether diagnosis follows from current evidence
- whether operational memory was used only for routing
- whether code scope is minimal
- whether risky domains and approval gates were respected
- whether telemetry/noise handling has a negative control
- whether tests cover the production payload shape or the blocker is documented
- whether Linear/PR output is redacted

Review findings must cite files and lines when reviewing code.

## Alternatives Considered

### Keep the Split `prod-agent*` Skills

Pros:

- lower migration cost
- smaller individual files
- familiar to existing Claude workflows

Cons:

- duplicated rules drift
- agents may choose the wrong entrypoint
- production evidence gates are easier to bypass
- hard to add automation-runner durability coherently
- hard to share specialist routing and redaction policy

Rejected. Production engineering needs one authoritative workflow.

### Use Only Human-Run Checklists

Pros:

- simple
- no automated runner design needed
- less infrastructure

Cons:

- not retry-safe
- hidden chat context is lost
- handoffs vary by agent
- automation cannot safely resume
- duplicated Linear/PR artifacts are more likely

Rejected. The stated goal is an automated agent that can execute skills for solving tickets.

### Let Specialists Own Full Ticket Execution

Pros:

- specialists could independently investigate and fix
- simpler coordinator at first glance

Cons:

- unclear ownership
- higher mutation risk
- harder to enforce Linear and PR idempotency
- harder to enforce evidence gates
- harder to prevent duplicate branches and comments

Rejected. Specialists are bounded and read-only; the coordinator owns the run.

### Treat the PRODE Wiki as Evidence

Pros:

- faster classification for known patterns
- less live querying

Cons:

- stale memory could misclassify active incidents
- source systems remain canonical
- historical similarity is not proof
- agents would over-trust prior patterns

Rejected. The wiki is operational memory only.

### Depend Only on Gitleaks for Redaction

Pros:

- already present
- catches many secret-shaped values

Cons:

- does not reliably detect PII, financial data, KYC fields, or business identifiers
- can be bypassed locally
- existing configuration is not sufficient for PRODE Wiki content

Rejected. PRODE-specific redaction lint and human review are required.

## Consequences

### Positive

- one canonical production engineering workflow
- fewer stale or competing agent entrypoints
- evidence-first discipline becomes explicit
- specialists become reusable and bounded
- automated runs become resumable and auditable
- Linear handoffs become consistent
- PR creation and post-merge monitoring are part of the same lifecycle
- production DB read access is usable without normalizing unsafe broad queries
- operational memory can accumulate without replacing evidence
- redaction policy has actual guardrail files behind it

### Negative

- larger skill surface area
- more process for simple tickets
- more documentation to keep current
- runner implementation must respect a non-trivial state machine
- redaction lint can produce false positives and false negatives
- physical read-only DB enforcement remains an open operational dependency if not already guaranteed
- PRODE Wiki pages can become stale if review metadata is ignored

### Mitigations

- use the smallest set of specialists needed for each ticket
- keep specialists read-only and bounded
- keep Linear as the canonical progress log
- keep operational memory out of classification evidence
- require confidence threshold before implementation
- require explicit approval for risky domains
- maintain redaction lint and wiki `.gitignore`
- revisit the suite after real ticket runs

## Implementation Plan

### Phase 1: Canonical Suite

- keep `.agents/skills/prode-triage/SKILL.md` as the only authoritative PRODE skill
- keep `references/specialists.md` as the specialist protocol
- keep `references/automation-runner.md` as the runner contract
- point `AGENTS.md` and `CLAUDE.md` to the consolidated suite
- remove or wrap old `.claude/skills/prod-agent*` entrypoints

### Phase 2: Safety Substrate

- create `docs/learnings/inbox.md`
- create `docs/prode-wiki/.gitignore`
- create `scripts/prode-wiki-redaction-lint.sh`
- create `.github/workflows/prode-wiki-redaction.yml`
- document that gitleaks is not sufficient for PRODE Wiki redaction

### Phase 3: Specialist Expansion

- keep Sentry and AWS specialists as first-class evidence branches
- add vendor/integration, product/support, release/deploy, DB read-replica, reproduction, code-path, and review specialists
- keep coordinator-owned tool discovery and specialist allowlists
- require structured specialist outputs

### Phase 4: Automation Runner

- implement durable state, leases, idempotency namespaces, retry-safe activities, and terminal outcomes
- implement `READ_PRODE_WIKI`
- implement `WRITE_CANDIDATE_LEARNING`
- enforce approval gates
- persist evidence bundles and audit logs
- implement post-merge monitoring

### Phase 5: PRODE Wiki

- implement ADR-001
- seed high-value query recipes and failure modes
- require active pages to be registered in `docs/prode-wiki/index.md`
- keep candidate learning promotion behind review

### Phase 6: Hardening

- confirm production DB access is physically read-only
- add platform-specific redaction patterns for customer/account/entity/KYC identifiers
- add an incident response runbook for leaked sensitive data
- evaluate whether markdown retrieval is sufficient or whether indexed retrieval needs a future ADR

## Open Questions

1. Is production DB access physically constrained to read-only credentials in every environment the runner will use?
2. What are the exact customer/account/entity/KYC identifier patterns the redaction lint should block?
3. Where should full evidence bundles live when they are too sensitive for git but need durable retention?
4. What is the owner and review cadence for active PRODE Wiki pages?
5. Which CI status should be required before merging PRs that touch `docs/prode-wiki/**` or `docs/learnings/**`?
6. Which MCP/tool names should be treated as canonical in each host environment?

## Acceptance Criteria

This ADR is implemented when:

- old `prod-agent*` entrypoints are removed or redirect to `prode-triage`
- `AGENTS.md` and `CLAUDE.md` name `prode-triage` as the authoritative workflow
- `prode-triage` includes MCP/tool discovery, PRODE Wiki consultation, specialist routing, classification, implementation, verification, review, PR, Linear handoff, and candidate-learning steps
- `specialists.md` defines every specialist with phase, boundaries, required checks, output shape, and completion criteria
- `automation-runner.md` defines durable state, idempotency, tool policy, approval gates, evidence bundle, feedback limits, and post-merge monitoring
- production DB read-only boundaries are documented and enforced at least by workflow, with physical read-only credentials confirmed or tracked as an open operational dependency
- Linear/PR redaction rules are explicit
- `docs/learnings/inbox.md` exists
- `docs/prode-wiki/.gitignore` exists
- `scripts/prode-wiki-redaction-lint.sh` exists and passes locally
- `.github/workflows/prode-wiki-redaction.yml` exists
- ADR-001 covers the PRODE Wiki operational-memory layer

## Related Files

- `.agents/skills/prode-triage/SKILL.md`
- `.agents/skills/prode-triage/references/specialists.md`
- `.agents/skills/prode-triage/references/automation-runner.md`
- `AGENTS.md`
- `CLAUDE.md`
- `docs/decisions/ADR-001-prode-wiki-operational-memory.md`
- `docs/learnings/inbox.md`
- `docs/prode-wiki/.gitignore`
- `scripts/prode-wiki-redaction-lint.sh`
- `.github/workflows/prode-wiki-redaction.yml`

## Decision Owner

Production engineering maintainers for this repository.

## Review Notes

Review this ADR after the first five PRODE tickets are run through the consolidated suite.

The review should answer:

- Did agents obey the evidence-first gate?
- Did specialists reduce or increase investigation quality?
- Did Linear comments contain enough evidence without leaking sensitive data?
- Did retries avoid duplicate comments, branches, PRs, and candidate learnings?
- Did the runner stop appropriately at approval gates?
- Did DB read access remain narrow and safe?
- Did the PRODE Wiki improve routing without being treated as evidence?
- Which parts of the suite were too heavy for low-risk tickets?