Home / Production engineering

Agent production engineering

Build an evidence-first agent that investigates production tickets, prepares safe fixes, and stops before irreversible actions.

Core model

The suite is a ticket-solving system, not a single prompt. It separates orchestration, evidence gathering, memory, implementation, and approval so each run can be reviewed.

Coordinator

Owns the Linear ticket, decides what evidence is needed, and chooses stop, handoff, or PR.

Specialists

Read bounded sources such as Sentry, AWS, the DB replica, vendors, releases, and code paths.

Evidence bundle

Combines facts, source links, timestamps, confidence, gaps, classification, and approval state.

PRODE Wiki

Stores operational memory that routes future investigations but never replaces current evidence.

Approval gates

Stop writes, risky deploys, broad suppression, PR merge, and final ticket closure until a human approves.

Get started with a prompt

Paste this into a coding agent to start a PRODE run. It gives the agent the right shape before it touches code: read Linear, inventory tools, gather live evidence, then decide.

Starter prompt

Use for a dry run, ticket investigation, or runner implementation task.

You are running the PRODE production engineering workflow.

Start from the Linear ticket, not the source code.

1. Read the Linear ticket, labels, comments, attachments, linked Sentry/AWS/request IDs, and timestamps.
2. Inventory available tools: Linear, Sentry, AWS/CloudWatch, GitHub, production DB read replica, vendor/support sources, browser/emulator, and repo skills.
3. Read only relevant PRODE Wiki pages. Treat them as routing memory, not evidence.
4. Route read-only specialists for the evidence sources required by the ticket.
5. Build one evidence bundle: sources, facts, confidence, gaps, classification, and approval state.
6. If confidence is below 70%, evidence is missing, specialists conflict, or approval is needed, stop and write the handoff.
7. Only after the evidence gate may you read source code, patch the smallest safe path, verify, ask for review, open a PR, and update Linear.

Never perform production writes, deploy-risk changes, broad Sentry suppression, Linear Done/Cancelled transitions, or PR merge without fresh explicit human approval.

Install the coordinator workflow at .agents/skills/prode-triage/SKILL.md.
Connect read-only evidence tools for Linear, Sentry, AWS/CloudWatch, GitHub, and the production DB read replica.
Run one low-risk ticket in dry-run mode and check that the evidence bundle is reproducible from artifacts.

How the run works

The runner owns durability. The coordinator owns the ticket decision. Specialists answer bounded evidence questions. Humans approve irreversible actions.

Intake

Lease the Linear ticket, read the ticket snapshot, and collect linked production evidence.

Investigate

Inventory tools, read relevant memory, and route read-only specialists.

Decide

Build the evidence bundle, set confidence, classify, and choose stop, handoff, or PR.

Deliver

Patch only after the evidence gate, verify, open a PR, update Linear, and monitor after human merge.

Source-code reads are not the first move. They begin only when current evidence identifies the likely system, path, or failure mode.

Agent developer tools

Build the suite from small, inspectable pieces. Each piece has one job and a clear boundary.

Coordinator skill

Owns the ticket decision, evidence gate, classification, PR handoff, and Linear updates.

Specialist catalogue

Defines read-only roles for Sentry, AWS, DB, vendor, support, release, reproduction, code path, and review.

Automation runner

Persists leases, phase state, idempotency keys, approval waits, artifact pointers, and post-merge monitoring.

PRODE Wiki

Stores reviewed operational memory that can route investigation but cannot prove the current incident.

Redaction controls

Keep secrets, direct identifiers, copied rows, payloads, and customer-sensitive data out of git and Linear.

Verification harness

Checks that a run is reproducible from Linear, specialist outputs, evidence bundle, PR, and monitor result.

Machine-readable resources

The agent should not rely on conversation memory. Give it structured resources it can read and write in every run.

Tool inventory

Available MCPs, connectors, CLIs, credentials, blocked tools, and selected routes for this ticket.

Evidence bundle

Current sources, facts, confidence, gaps, classification, and approval state.

Specialist output

Question asked, sources queried, findings, confidence, gaps, and next pivot.

Runner state

Lease, phase, idempotency namespace, artifact pointers, approval records, PR URL, and terminal outcome.

Learning inbox

Redacted candidate learnings from completed runs. Humans review before promotion to the PRODE Wiki.

Plain-text docs

Keep ADRs and workflow files readable by agents without requiring browser-only context.

Specialists

Specialists are not mini-coordinators. They answer one evidence question using read-only tools and return a structured result.

Sentry

Read issues, events, tags, releases, breadcrumbs, and recurrence signals.

AWS and CloudWatch

Read logs, metrics, service health, queue age, task state, and account identity.

Production DB read

Run narrow, read-only aggregate checks. No row copying, locks, writes, migrations, or backfills.

Vendor and support

Check provider status, webhook IDs, support reports, screenshots, expected behavior, and customer-safe summaries.

Release and reproduction

Compare deploy windows, feature flags, app versions, local reproduction, staging, browser, and emulator evidence.

Code path and review

Read source after the evidence bundle exists. Review the final diff before handoff.

End-to-end examples

Each example has the same rhythm: intake, evidence, decision, outcome. Not every run should produce a PR.

SentryPR

Backend 500 to PR

Sentry shows onboarding API 500s after a deploy.

Read Linear, Sentry issue, release, timestamps, and request IDs.
Route Sentry, AWS, release, and DB read specialists.
Evidence points to one backend path and a release-correlated null case.
Code path read begins after the bundle. The agent patches, tests, reviews, opens a PR, and updates Linear.

Outcome: PR opened. Human owns merge.

AWSInfra

Queue alarm to infra handoff

CloudWatch creates a ticket for queue saturation.

Read alarm metadata, service name, time window, and related Sentry noise.
AWS specialist checks queue age, task health, and recent deploy correlation.
Evidence shows capacity exhaustion, not an application regression.
The runner writes an infra handoff and stops before AWS or Terraform mutation.

Outcome: Escalated to infrastructure. No code change.

DB readApproval

DB state check to approval pause

A user-visible count disagrees with backend totals.

Read Linear, affected cohort, expected behavior, and request IDs.
DB specialist runs aggregate SELECT checks using a read-only role.
Evidence proves stale derived rows. Code fix is narrow, remediation needs a production backfill.
The agent may open a code PR but pauses any data correction for exact human approval.

Outcome: PR plus approval request. Production write blocked.

No-codeNoise

Expected abort noise to safe stop

A burst of request-aborted errors appears after mobile clients close a long-running screen.

Read Linear, Sentry tags, affected endpoint, release, and platform split.
Wiki suggests a prior abort pattern, so specialists run negative controls.
Sentry and AWS show client disconnects without 5xx increase or deploy correlation.
The coordinator writes a no-code explanation and stops. It does not suppress Sentry automatically.

Outcome: Safe stop. Optional telemetry change needs approval.

Standards and safety rules

These rules are the operating contract. They make automation useful without letting it become unsafe.

Evidence before code

Current production evidence must exist before source investigation and implementation.

Memory is not proof

PRODE Wiki pages can route the investigation, but only current evidence can classify the ticket.

Read-only by default

Sentry, AWS, vendor, support, release, browser, and DB specialists run with read-only boundaries.

Human approval gates

Production writes, deploy-risk changes, broad telemetry suppression, PR merge, and Linear Done/Cancelled are human-owned.

Production DB access is read-only. If a query would write, lock, backfill, migrate, call a side-effecting function, or require FOR UPDATE, the run stops and asks for approval.

Decision records

The page is the front door. The ADRs remain the source of detailed rationale and tradeoffs.

ADR-001

PRODE Wiki Operational Memory
Explains the Karpathy-style operational memory layer and why memory cannot replace current evidence.

ADR-002

PRODE Engineering Agent Suite
Explains the coordinator, specialists, runner contract, safety gates, and rollout decision.

Acceptance checks

Use these checks before calling the suite ready.

Dry run

A low-risk ticket produces a reproducible evidence bundle without reading source code first.

Idempotency

Retrying a run does not duplicate Linear comments, branches, PRs, specialist artifacts, or candidate learnings.

Safety

Production writes, merge, deploy-risk actions, Sentry suppression, and Linear Done/Cancelled all pause for exact human approval.

Review

A reviewer can reconstruct the run from Linear, tool inventory, specialist outputs, evidence bundle, PR, and monitor result.