Voltar ao blog
|6 min read

Incident Postmortem Template: A Blameless Approach

Your service went down. Users were affected. The team scrambled, fixed it, and now everyone wants to move on. But if you skip the postmortem, you'll be fixing the same class of issue six months from now. A good postmortem turns an outage into a lasting improvement.

What is a postmortem?

A postmortem (also called an incident review or retrospective) is a structured analysis of what went wrong, why, and what to do about it. The goal is learning, not punishment.

The "blameless" part matters. When people fear blame, they hide information. They leave details out of the timeline. They describe what happened in vague terms to avoid implicating themselves or teammates. Blameless doesn't mean accountabilityless — it means focusing on systems and processes rather than individual mistakes.

If a junior engineer can bring down production with a single command, the problem isn't the junior engineer — it's that a single command can bring down production.

The template

Here's a postmortem structure that works for incidents of any size. Scale the depth to match the severity — a 5-minute blip needs a paragraph, not a 10-page report.

1. Incident summary

Start with the facts. Anyone reading this section should understand what happened without reading further:

  • Date and time — When did the incident start and end? Include timezone.
  • Duration — Total time from first user impact to full resolution.
  • Impact — What was affected? How many users? What functionality was broken? Quantify: "12% of API requests returned 503 for 47 minutes" is better than "some users experienced errors."
  • Severity — Use your incident severity scale (SEV-1 through SEV-4 or whatever your team uses).

2. Timeline

Build a minute-by-minute timeline from detection to resolution. Include:

  • Detection — When and how was the incident first noticed? Monitoring alert, customer report, internal engineer?
  • Response — When was the on-call paged? When did they start investigating?
  • Diagnosis — Key moments in the investigation. What hypotheses were tested? What was the "aha" moment?
  • Mitigation — What was done to stop the bleeding? Rollback, config change, scaling up?
  • Resolution — When was the root cause actually fixed (vs. mitigated)?

Write the timeline in UTC or your team's standard timezone. Be specific: "14:23 — Alert fired: API error rate exceeded 5% threshold" not "Around 2pm an alert went off."

3. Root cause analysis

Go beyond the surface. The root cause is rarely "a server crashed" — it's why the server crashed and why the crash caused user impact. Use the 5 Whys technique:

  • Why did users get errors? — The API returned 503s.
  • Why was the API returning 503s? — The database connection pool was exhausted.
  • Why was the pool exhausted? — A query was holding connections open for 30+ seconds.
  • Why was the query so slow? — A missing index caused a full table scan on a table that grew past 10M rows last week.
  • Why wasn't the missing index caught? — We don't have automated slow-query alerting or index coverage checks in CI.

The fifth "why" is usually where the actionable fix lives. The first "why" is a symptom; the last one is the systemic gap.

4. What went well

This is often skipped but it matters. Recognizing what worked reinforces good practices:

  • Monitoring detected the issue within 2 minutes
  • On-call responded within 5 minutes of the page
  • The rollback procedure worked as documented
  • Status page was updated within 10 minutes

5. What went wrong

Be honest about what failed beyond the root cause:

  • The runbook for this service was outdated
  • It took 20 minutes to identify which service was the source of errors
  • The database dashboard didn't show connection pool metrics by default
  • We didn't update the status page until 25 minutes in

6. Action items

Every action item needs an owner and a deadline. Without both, action items are wishes.

  • Immediate — Add the missing database index (owner: Alex, done in the fix)
  • This week — Set up slow-query alerting for all production databases (owner: Sam, by Friday)
  • This sprint — Add connection pool metrics to the default database dashboard (owner: Jordan, sprint 24)
  • Backlog — Evaluate adding index coverage checks to CI pipeline (owner: Taylor, next planning)

Tips for effective postmortems

  • Write within 48 hours — Memories fade fast. The timeline becomes fuzzy. Slack messages get buried. Do it while it's fresh.
  • Include all responders — Everyone who touched the incident should review the postmortem. Different perspectives catch blind spots.
  • Focus on systems, not people — "The deployment pipeline didn't run integration tests" not "Dave forgot to run tests." Systems should prevent human error, not depend on humans being perfect.
  • Follow up on action items — Review open postmortem action items in your weekly team meeting. An action item that never gets done is a future incident waiting to happen.
  • Share broadly — Postmortems are most valuable when other teams can learn from them. Publish internally. For customer-facing incidents, publish a summary on your status page.

Built-in postmortem support

PulseAPI includes structured postmortem fields directly on each incident — summary, timeline, root cause, and action items. When you resolve an incident, the postmortem is attached to the same record your team and your users can reference. No separate document to maintain, no context lost in a Google Doc nobody can find later.

The best postmortem is the one that actually gets written. Keep the barrier low, tie it to the incident record, and make follow-through visible.

Comece a monitorar suas APIs

Configure monitoramento de uptime, páginas de status e gestão de incidentes em menos de um minuto.