Overview
Post-mortem analysis examines software incidents, outages, and failures after they occur to understand what happened, why it happened, and how to prevent similar issues. This retrospective investigation process originated in medical practice and aviation safety before becoming standard in software engineering during the early days of distributed systems at companies like Google and Amazon.
The analysis creates a detailed incident timeline, identifies contributing factors, determines root causes through investigation techniques like the Five Whys, and produces actionable remediation items. Organizations conduct post-mortems for incidents that significantly impact users, violate service level objectives (SLOs), or expose systemic weaknesses even without customer impact.
Post-mortem analysis differs from incident response. Incident response focuses on detection, triage, and resolution during an active incident. Post-mortem analysis occurs after resolution when teams can investigate thoroughly without time pressure. The analysis examines not just the technical failure but also gaps in monitoring, documentation, testing, and communication that allowed the incident to occur or delayed resolution.
The core value emerges from organizational learning rather than individual blame. Teams treat incidents as opportunities to improve systems, processes, and practices. Well-executed post-mortems identify latent issues before they cause larger failures, document institutional knowledge about system behavior under stress, and create feedback loops that continuously improve reliability.
# Example incident severity classification
class IncidentSeverity
CRITICAL = {
sev: 1,
criteria: "Complete service outage affecting all users",
response_time: "Immediate",
post_mortem_required: true
}
HIGH = {
sev: 2,
criteria: "Major functionality degraded for significant user percentage",
response_time: "Within 15 minutes",
post_mortem_required: true
}
MEDIUM = {
sev: 3,
criteria: "Minor functionality affected or small user subset impacted",
response_time: "Within 1 hour",
post_mortem_required: false # Optional based on learning value
}
end
Organizations typically require post-mortems for severity 1 and 2 incidents, with optional analysis for severity 3 incidents that reveal systemic issues or near-misses that could have escalated.
Key Principles
Post-mortem analysis operates on several fundamental principles that distinguish effective analysis from superficial incident reviews. The blameless principle prohibits attributing incidents to individual error or negligence. System design creates conditions where mistakes become incidents, and effective analysis examines why those conditions existed rather than who made the mistake. This principle enables honest discussion about failures without fear of punishment.
The focus on systems over individuals recognizes that production incidents result from complex interactions between software, infrastructure, processes, and organizations. A database query causing an outage reflects not just a poorly optimized query but also inadequate query review processes, missing performance testing, insufficient monitoring, and capacity planning gaps. The analysis traces contributing factors across multiple layers rather than stopping at the proximate cause.
Timeliness ensures analysis occurs while details remain fresh in participants' memories. Organizations typically complete post-mortems within three to seven days after incident resolution. Delays reduce accuracy as participants forget subtle details, conversations, and decision-making context that influenced the incident timeline.
Action items transform learning into improvement. Each post-mortem identifies specific, actionable tasks to address root causes and contributing factors. These items receive owners, target completion dates, and tracking mechanisms. Post-mortems without actionable outcomes waste effort documenting problems without driving improvement.
The shared document principle requires writing comprehensive incident documentation accessible to the entire engineering organization. This documentation becomes institutional knowledge, preventing other teams from repeating similar mistakes and informing architecture decisions based on real operational experience. Teams reference historical post-mortems when designing systems or debugging new incidents.
# Post-mortem document structure
class PostMortem
attr_reader :incident_id, :severity, :start_time, :end_time, :impact,
:timeline, :root_causes, :contributing_factors, :action_items
def initialize(incident_id)
@incident_id = incident_id
@timeline = []
@root_causes = []
@contributing_factors = []
@action_items = []
end
def add_timeline_event(timestamp, description, actor)
@timeline << {
timestamp: timestamp,
description: description,
actor: actor,
evidence: nil # Logs, screenshots, metrics
}
end
def add_root_cause(description, category)
@root_causes << {
description: description,
category: category, # :code, :config, :capacity, :process
detection_method: nil, # How identified during analysis
why_chain: [] # Five Whys progression
}
end
def add_action_item(description, owner, priority, target_date)
@action_items << {
description: description,
owner: owner,
priority: priority, # :critical, :high, :medium
target_date: target_date,
prevents_recurrence: false,
improves_detection: false,
improves_response: false
}
end
end
The distinction between root causes and contributing factors clarifies the analysis. Root causes directly led to the incident - removing a root cause would have prevented the incident entirely. Contributing factors made the incident more likely or more severe but did not directly cause it. A deployment process that skips integration tests is a contributing factor; the specific bug in deployed code is the root cause.
Implementation Approaches
The structured post-mortem meeting approach brings together incident participants for a facilitated discussion. The facilitator guides the group through incident timeline reconstruction, root cause analysis, and action item identification. This approach works well for complex incidents requiring multiple perspectives and creates shared understanding across teams. The meeting typically lasts 60-90 minutes and includes the incident commander, responding engineers, on-call engineers, and relevant stakeholders.
The facilitator starts by establishing psychological safety, explicitly stating the blameless principle and focusing discussion on systems and processes. The group reconstructs the incident timeline chronologically using chat logs, monitoring data, deployment records, and participant memory. Each timeline entry includes a timestamp, description, the actor or system involved, and supporting evidence. The group identifies decision points where responders chose between different approaches and documents the information available at each decision point.
After completing the timeline, the group identifies what went well during response - effective debugging techniques, helpful monitoring, clear communication. This positive framing balances the problem-focused analysis and identifies practices to preserve. The group then analyzes what went poorly, focusing on gaps in detection, unclear escalation paths, missing runbooks, or misleading metrics.
The asynchronous written approach creates post-mortem documents without requiring simultaneous participation. The incident commander drafts the initial document including timeline, impact summary, and preliminary root cause analysis. Team members review and comment asynchronously, adding details, corrections, and perspectives. This approach accommodates distributed teams across time zones and allows deeper individual reflection. The document undergoes several review iterations before publishing.
# Asynchronous post-mortem workflow
class PostMortemWorkflow
def self.create_draft(incident, commander)
post_mortem = PostMortem.new(incident.id)
post_mortem.severity = incident.severity
post_mortem.start_time = incident.detected_at
post_mortem.end_time = incident.resolved_at
# Commander drafts initial timeline from incident logs
incident.response_log.each do |entry|
post_mortem.add_timeline_event(
entry.timestamp,
entry.action,
entry.responder
)
end
# Request input from participants
incident.participants.each do |participant|
notify_for_review(participant, post_mortem)
end
post_mortem
end
def self.incorporate_feedback(post_mortem, comment)
case comment.section
when :timeline
post_mortem.timeline.insert(
comment.position,
comment.event_details
)
when :root_cause
post_mortem.add_root_cause(
comment.cause_description,
comment.category
)
end
end
end
The hybrid approach combines structured meetings with written documentation. The incident commander drafts the timeline and initial analysis. The team holds a 45-minute meeting to review the draft, discuss root causes, and identify action items. After the meeting, the commander updates the document with meeting outcomes and publishes it for broader review. This approach balances thoroughness with efficiency.
The Five Whys technique drills into root causes by repeatedly asking why an issue occurred. Start with the incident symptom and ask why it happened. Take that answer and ask why again. Continue for five iterations or until reaching a systemic issue that action items can address. The technique prevents stopping at surface-level causes.
# Five Whys analysis example
def five_whys_analysis(symptom)
chain = [symptom]
# Example progression:
chain << "Database queries exceeded connection pool limit"
chain << "New feature created N+1 query pattern"
chain << "Code review did not identify query performance issue"
chain << "Code review checklist does not include database query review"
chain << "No standard checklist exists for reviewing database changes"
# Actionable root cause:
# "Create database change review checklist and make it required for PRs affecting models"
chain
end
The Fishbone (Ishikawa) diagram approach categorizes contributing factors into predefined categories: people, process, technology, and environment. This structured categorization ensures comprehensive analysis across different dimensions and prevents fixating on a single factor type. Teams draw a diagram with the incident as the "head" and major categories as "bones" extending from the spine, then add specific factors to each category.
Common Patterns
The blameless post-mortem pattern creates psychological safety by explicitly prohibiting blame and focusing on systemic improvements. Documents avoid passive voice constructions that obscure responsibility ("the database was misconfigured" becomes "the deployment script misconfigured the database") while simultaneously avoiding blame ("engineer X misconfigured the database" becomes "the deployment script lacked validation for database configuration"). This framing acknowledges human actions while examining why systems allowed errors to propagate.
# Blameless language transformation
class BlamelessLanguage
# Transform passive constructions
def self.clarify_agency(passive_text)
# "The wrong configuration was deployed"
# becomes
# "The deployment automation deployed the wrong configuration"
identify_actor_and_action(passive_text)
end
# Transform blame constructions
def self.systemize(blame_text)
# "Engineer deployed bad code"
# becomes
# "The deployment process allowed untested code to reach production"
identify_process_gap(blame_text)
end
private
def self.identify_actor_and_action(text)
# Extract what happened and what system/process was involved
# Focus on automation, tools, or processes rather than individuals
end
def self.identify_process_gap(text)
# Identify what control, check, or process should have prevented the issue
end
end
The learning review pattern publishes completed post-mortems organization-wide and discusses notable incidents in engineering meetings. Teams present post-mortems to share learnings, especially when incidents reveal patterns applicable to other systems. This pattern transforms individual team incidents into organizational learning and builds institutional knowledge about system reliability patterns.
The near-miss analysis pattern applies post-mortem analysis to incidents that did not impact users but could have under different circumstances. A database reaching 85% capacity might not cause an immediate outage but represents a near-miss worth analyzing. These analyses catch issues before they become customer-impacting incidents and demonstrate proactive reliability work.
The incident pattern recognition pattern identifies recurring incident types and creates specific remediation tracks. After conducting multiple post-mortems, teams notice patterns: deployment-related incidents cluster around configuration management, database incidents relate to capacity planning, API failures stem from circuit breaker configuration. Recognizing patterns enables systematic improvements addressing categories of incidents rather than individual occurrences.
# Pattern recognition across incidents
class IncidentPatternAnalyzer
def analyze_patterns(post_mortems)
patterns = {}
post_mortems.group_by { |pm| pm.primary_category }.each do |category, incidents|
patterns[category] = {
frequency: incidents.count,
common_causes: extract_common_causes(incidents),
systemic_issues: identify_systemic_issues(incidents),
recommended_improvements: []
}
# If same category appears frequently, identify systemic fix
if incidents.count >= 3
patterns[category][:recommended_improvements] =
suggest_systemic_improvements(category, incidents)
end
end
patterns
end
private
def extract_common_causes(incidents)
incidents.flat_map(&:root_causes)
.group_by { |cause| cause[:category] }
.transform_values(&:count)
.sort_by { |_, count| -count }
end
def identify_systemic_issues(incidents)
# Look for gaps appearing across multiple incidents
# - Missing monitoring
# - Inadequate testing
# - Unclear documentation
incidents.flat_map(&:contributing_factors)
.select { |factor| factor[:type] == :systemic_gap }
.group_by { |factor| factor[:gap_category] }
.select { |_, factors| factors.count >= 2 }
end
end
The action item tracking pattern creates a standardized process for tracking post-mortem action items to completion. Each action item receives an owner, priority, target date, and tracking ticket. Teams review outstanding action items weekly and report completion rates as a reliability metric. This pattern ensures post-mortems drive actual improvements rather than creating documentation that teams ignore.
The pre-mortem pattern inverts post-mortem analysis by conducting it before launching new systems or features. Teams imagine the new system has failed catastrophically and work backward to identify what could have caused the failure. This prospective analysis identifies risks and mitigation strategies during design and development rather than after production failures.
Practical Examples
A deployment incident demonstrates comprehensive post-mortem analysis. On March 15, 2024, a mobile API experienced complete service disruption for 23 minutes affecting 100% of mobile users. The incident began at 14:37 UTC when automated deployment tooling deployed version 2.8.0 to production. Request success rate dropped from 99.9% to 0% within two minutes.
# Incident timeline reconstruction
incident_timeline = [
{
time: "14:37:00 UTC",
event: "Deployment pipeline initiated rollout of v2.8.0",
actor: "Automated deployment system",
evidence: "Deployment logs show successful image push"
},
{
time: "14:37:30 UTC",
event: "First production pod running v2.8.0",
actor: "Kubernetes",
evidence: "Pod event logs show container started successfully"
},
{
time: "14:38:45 UTC",
event: "Error rate alerts triggered for mobile-api service",
actor: "Monitoring system",
evidence: "PagerDuty incident #4521 created"
},
{
time: "14:39:10 UTC",
event: "On-call engineer acknowledged alert",
actor: "Engineer A",
evidence: "PagerDuty acknowledgment timestamp"
},
{
time: "14:40:30 UTC",
event: "Engineer identified 100% error rate from new pods",
actor: "Engineer A",
evidence: "Slack message in #incidents channel"
},
{
time: "14:42:15 UTC",
event: "Rollback initiated to v2.7.5",
actor: "Engineer A",
evidence: "Deployment command in terminal history"
},
{
time: "14:52:00 UTC",
event: "Rollback complete, traffic recovering",
actor: "Kubernetes",
evidence: "Pod metrics showing successful requests"
},
{
time: "15:00:00 UTC",
event: "Service fully recovered, incident resolved",
actor: "Engineer A",
evidence: "Error rate returned to baseline"
}
]
The root cause analysis revealed that version 2.8.0 included a database migration adding a new column with a NOT NULL constraint but no default value. The migration ran successfully in staging where the database contained only test data. In production, the migration attempted to add the NOT NULL column to 15 million existing rows without default values, causing the migration to fail. The application code assumed the column existed and crashed when the schema did not match expectations.
The Five Whys analysis progressed: (1) Why did the service crash? The code referenced a database column that did not exist. (2) Why did the column not exist? The database migration failed to add it. (3) Why did the migration fail? The migration added a NOT NULL column without a default value to a table with existing data. (4) Why did this pass staging testing? The staging database had no existing rows in the affected table. (5) Why did staging not have production-like data? The staging data refresh process only copied schema, not data.
# Action items from deployment incident
action_items = [
{
description: "Update migration tooling to require explicit default values for new NOT NULL columns",
owner: "Platform team",
priority: :critical,
target_date: "2024-03-22",
prevents_recurrence: true
},
{
description: "Implement staging data refresh to include production-scale anonymized data",
owner: "DevOps team",
priority: :high,
target_date: "2024-04-05",
prevents_recurrence: true
},
{
description: "Add pre-deployment migration validation check that runs migrations against production-scale data copy",
owner: "Platform team",
priority: :high,
target_date: "2024-03-29",
prevents_recurrence: true
},
{
description: "Create runbook for rollback procedures during deployment incidents",
owner: "Engineer A",
priority: :medium,
target_date: "2024-03-20",
improves_response: true
}
]
A capacity incident illustrates different analysis considerations. On April 3, 2024, the background job processing system experienced severe queue backlog, with job processing time increasing from 2 minutes average to over 4 hours. The incident lasted 6 hours and affected asynchronous features including email delivery, report generation, and data exports.
Investigation revealed that job processing capacity had remained static at 50 workers while job volume grew 300% over three months. Monitoring tracked queue depth but not processing velocity or wait time, so the gradual degradation went unnoticed until queue depth exceeded millions of jobs. The incident resolved after engineers manually scaled workers to 200, clearing the backlog over six hours.
Root cause analysis identified inadequate capacity planning and monitoring gaps. The system lacked autoscaling configuration, capacity projections, or alerts on processing velocity degradation. Action items included implementing queue-based autoscaling, creating capacity planning reviews for high-growth features, and adding monitoring for job wait time and processing velocity.
A third-party dependency incident demonstrates external failure analysis. On May 12, 2024, a payment processing integration failed for 45 minutes affecting checkout completion. The payment gateway experienced an outage returning 503 errors for all requests. The application lacked circuit breaker patterns, so it continued sending requests and waiting for responses, exhausting connection pools and blocking other operations.
The post-mortem examined not just the external outage (outside the team's control) but the system's response to that outage. Root causes included missing circuit breakers, no timeout configuration for external service calls, and insufficient redundancy in payment options. Action items focused on defensive programming patterns, implementing circuit breakers, adding fallback payment processors, and improving graceful degradation.
Tools & Ecosystem
Post-mortem documentation platforms provide structured templates, collaboration features, and historical incident archives. PagerDuty Post-mortems integrates with incident management, automatically populating incident metadata including timeline, responders, and resolution time. The platform supports collaborative editing, action item tracking, and organization-wide incident pattern analysis.
# Integration with incident management tools
class IncidentManager
def create_post_mortem_draft(incident_id)
incident = fetch_incident(incident_id)
draft = {
incident_id: incident.id,
severity: incident.severity,
detected_at: incident.created_at,
resolved_at: incident.resolved_at,
responders: incident.responders.map(&:name),
timeline: extract_timeline_from_logs(incident),
affected_services: incident.impacted_services,
customer_impact: calculate_customer_impact(incident)
}
# Push to document management system
PostMortemRepository.create(draft)
end
private
def extract_timeline_from_logs(incident)
incident.response_log.map do |entry|
{
timestamp: entry.created_at,
action: entry.action,
actor: entry.responder.name,
notes: entry.notes
}
end
end
end
Confluence and Notion serve as common post-mortem repositories using page templates. Teams create page templates with standard sections: incident summary, timeline, root cause analysis, action items. The wiki format facilitates cross-referencing between incidents and searching historical post-mortems. These platforms lack specialized incident management features but integrate with existing documentation systems.
Jeli focuses on learning from incidents through in-depth investigation and cognitive psychology research. The platform emphasizes understanding decision-making under uncertainty rather than just identifying technical root causes. Jeli captures detailed timelines including what information was available at each decision point and what assumptions responders made.
GitHub Issues and Jira provide action item tracking. Teams create tracking tickets for each post-mortem action item, linking tickets to the post-mortem document. This integration enables tracking action item completion rates across incidents and identifying common remediation patterns. Teams report action item velocity as a reliability metric.
# Action item tracking integration
class ActionItemTracker
def create_tracking_tickets(post_mortem)
post_mortem.action_items.each do |item|
ticket = create_ticket(
title: item[:description],
assignee: item[:owner],
priority: map_priority(item[:priority]),
due_date: item[:target_date],
labels: ["post-mortem", "incident-#{post_mortem.incident_id}"]
)
link_to_post_mortem(ticket, post_mortem)
if item[:prevents_recurrence]
ticket.add_label("prevents-recurrence")
end
end
end
def track_completion_rate
items = ActionItem.where("created_at > ?", 90.days.ago)
completed = items.where(status: "closed").count
total = items.count
{
completion_rate: (completed.to_f / total * 100).round(2),
overdue: items.where("due_date < ? AND status != ?", Date.today, "closed").count,
avg_time_to_close: calculate_avg_time_to_close(items)
}
end
end
Log aggregation systems including Elasticsearch, Splunk, and Datadog provide historical data for timeline reconstruction. These systems enable searching application logs, system metrics, and infrastructure events by timestamp to build accurate incident timelines. Query capabilities allow correlating events across multiple services to understand cascading failures.
Distributed tracing systems like Jaeger and Honeycomb track request flows through microservice architectures. During post-mortem analysis, engineers examine traces from failing requests to identify exactly where errors occurred, what services were involved, and how long each operation took. Trace data supplements logs with request-level context.
Observability platforms combine metrics, logs, and traces into unified incident investigation interfaces. New Relic, Datadog, and Grafana provide pre-built incident timelines showing metric anomalies, deployment events, and alert triggers. These platforms reduce time spent gathering data for post-mortem analysis.
Common Pitfalls
Stopping at proximate causes rather than systemic issues produces ineffective post-mortems that do not prevent recurrence. A post-mortem that concludes "the engineer deployed untested code" stops at the proximate cause (the deployment) without examining why untested code reached deployment. Effective analysis asks why the deployment process allowed untested code, why testing coverage did not catch the issue, and why code review did not identify problems.
# Shallow vs deep root cause analysis
class RootCauseDepth
# Shallow analysis - stops at proximate cause
def shallow_analysis
{
conclusion: "Engineer deployed code with a bug",
action_items: ["Engineers should test code before deploying"]
}
end
# Deep analysis - examines systemic issues
def deep_analysis
{
proximate_cause: "Deployment contained uncaught exception in error handler",
contributing_factors: [
"Error handling code path not covered by automated tests",
"Code review did not identify missing test coverage",
"Staging environment did not trigger the error condition",
"No alerts for error handler failures"
],
systemic_issues: [
"No requirement for test coverage on error handling paths",
"Code review checklist does not include error path testing",
"Staging environment does not replicate production error conditions",
"Monitoring gaps in error handler execution"
],
action_items: [
"Add test coverage requirement for all error handling code",
"Update code review checklist with error path verification",
"Configure staging to replay production error conditions",
"Implement monitoring for error handler execution and failures"
]
}
end
end
Blame culture destroys post-mortem value by making participants defensive and preventing honest analysis. When organizations punish people for incidents, engineers hide information, downplay severity, and avoid documenting uncomfortable truths. Post-mortems become exercises in blame deflection rather than learning. Comments like "the engineer should have known better" or "this was obviously a mistake" introduce blame even in supposedly blameless post-mortems.
Lengthy post-mortem timelines delay analysis until participant memories fade and incident details blur. Conducting post-mortems four weeks after incidents produces incomplete timelines, forgotten context, and superficial analysis. Teams should complete post-mortems within one week while details remain clear and motivation for improvement remains high.
Action items without clear ownership, deadlines, or tracking never complete. Post-mortems listing vague action items like "improve monitoring" or "add more tests" without specific owners and dates generate documentation that teams ignore. Effective action items specify exactly what will be done, who will do it, and by when, with tracking mechanisms ensuring accountability.
Excessive action items overwhelm teams and prevent focus on high-impact improvements. Post-mortems identifying 20 action items scatter effort across many small changes rather than addressing core systemic issues. Effective post-mortems prioritize 3-5 critical action items that address root causes and demonstrably reduce incident likelihood.
# Action item quality evaluation
class ActionItemQuality
# Poor action items - vague, no owner, no deadline
def poor_action_items
[
"Improve monitoring",
"Add more tests",
"Better documentation needed"
]
end
# Good action items - specific, owned, dated
def good_action_items
[
{
description: "Add alert for database connection pool utilization exceeding 80%",
owner: "Platform team - Engineer B",
target_date: Date.new(2024, 3, 22),
success_criteria: "Alert fires when connection pool reaches 80% and pages on-call",
prevents_recurrence: true
},
{
description: "Create integration test suite for payment processing error scenarios",
owner: "Payments team - Engineer C",
target_date: Date.new(2024, 3, 29),
success_criteria: "Test suite covers timeout, 503, and malformed response scenarios",
prevents_recurrence: true
}
]
end
end
Over-focusing on technical details while ignoring process and organizational factors produces narrow analysis. Incidents often result from communication breakdowns, unclear ownership, documentation gaps, or organizational pressures that encourage speed over safety. Post-mortems examining only code and configuration miss important contributing factors in how teams work.
Conducting post-mortems only for customer-impacting incidents misses learning opportunities. Near-misses that did not affect users reveal systemic weaknesses before they cause major incidents. An alert firing for database capacity but resolving itself before impact represents a near-miss worth analyzing even without customer impact.
Generic action items that could apply to any incident indicate shallow analysis. Action items like "improve communication" or "better testing" apply broadly but do not address specific incident characteristics. Effective action items tie directly to identified root causes and contributing factors.
Reference
Post-Mortem Document Template
| Section | Required | Description |
|---|---|---|
| Incident Summary | Yes | One-paragraph overview of what happened, impact, and duration |
| Incident Metadata | Yes | Severity, start time, end time, affected services, detection method |
| Impact Assessment | Yes | User impact quantification, business metrics affected, SLO violations |
| Timeline | Yes | Chronological event list with timestamps, actions, actors, evidence |
| Root Cause Analysis | Yes | Direct causes that if prevented would have prevented the incident |
| Contributing Factors | Yes | Conditions that made incident more likely or more severe |
| What Went Well | Yes | Effective practices during detection and response |
| What Went Poorly | Yes | Gaps in detection, response, or system design |
| Action Items | Yes | Specific improvements with owners, dates, tracking |
| Related Incidents | Optional | Similar past incidents or near-misses |
Root Cause Categories
| Category | Description | Example |
|---|---|---|
| Code Defect | Bug in application code | Null pointer exception, logic error, race condition |
| Configuration Error | Incorrect system configuration | Wrong environment variable, misconfigured load balancer |
| Capacity Insufficient | Resource exhaustion | Database connection pool exhausted, disk full |
| Dependency Failure | External service or library failure | Third-party API outage, library bug |
| Process Gap | Missing or inadequate process | Skipped testing step, incomplete code review |
| Operational Error | Manual operation mistake | Wrong command executed, wrong server targeted |
| Design Flaw | Architectural limitation | Single point of failure, missing redundancy |
| Monitoring Gap | Issue not detected or alerted | Missing metric, misconfigured alert threshold |
Action Item Priorities
| Priority | Criteria | Target Completion |
|---|---|---|
| Critical | Prevents recurrence of severity 1-2 incident | Within 1 week |
| High | Significantly reduces incident likelihood or improves detection | Within 2-4 weeks |
| Medium | Incremental improvement to reliability or response | Within 1-2 months |
| Low | Nice-to-have improvement or long-term goal | Within 3+ months |
Post-Mortem Meeting Agenda
| Time | Activity | Facilitator Actions |
|---|---|---|
| 0-5 min | Introduction and psychological safety | State blameless principle, set expectations |
| 5-30 min | Timeline reconstruction | Guide chronological review, fill gaps, note decision points |
| 30-40 min | What went well | Identify effective practices to preserve |
| 40-55 min | Root cause analysis | Guide Five Whys, distinguish root causes from symptoms |
| 55-75 min | Action item identification | Ensure specific, owned, dated items with clear success criteria |
| 75-90 min | Review and next steps | Confirm document owner, publication timeline, follow-up meeting |
Five Whys Progression
| Level | Question Pattern | Goal |
|---|---|---|
| Why 1 | Why did the symptom occur? | Identify immediate technical cause |
| Why 2 | Why did that cause exist? | Identify system condition enabling cause |
| Why 3 | Why was that condition present? | Identify process or design gap |
| Why 4 | Why does that gap exist? | Identify organizational or architectural factor |
| Why 5 | Why was that factor not addressed? | Identify actionable systemic improvement |
Incident Impact Metrics
| Metric | Measurement | Use |
|---|---|---|
| User Impact Percentage | Percentage of users affected | Severity classification |
| Duration | Time from detection to resolution | Response effectiveness |
| Time to Detect | Delay from incident start to detection | Monitoring effectiveness |
| Time to Mitigate | Time from detection to impact reduction | Response speed |
| MTTR | Mean Time To Recovery across incidents | Overall reliability trend |
| SLO Burn Rate | Rate of error budget consumption | Incident priority |
Post-Mortem Quality Checklist
| Criteria | Evaluation Question |
|---|---|
| Blameless | Does document avoid attributing incident to individual failure? |
| Timeline Complete | Does timeline include detection, response actions, and resolution? |
| Root Causes Identified | Are root causes systemic issues rather than symptoms? |
| Contributing Factors Listed | Are conditions that worsened incident documented? |
| Action Items Specific | Does each action item have owner, date, and success criteria? |
| Learning Captured | Does document provide value to teams working on similar systems? |
| Evidence Linked | Are timeline events supported by logs, metrics, or screenshots? |