CrackedRuby - Post-Mortem Analysis

Overview

Post-mortem analysis examines software incidents, outages, and failures after they occur to understand what happened, why it happened, and how to prevent similar issues. This retrospective investigation process originated in medical practice and aviation safety before becoming standard in software engineering during the early days of distributed systems at companies like Google and Amazon.

The analysis creates a detailed incident timeline, identifies contributing factors, determines root causes through investigation techniques like the Five Whys, and produces actionable remediation items. Organizations conduct post-mortems for incidents that significantly impact users, violate service level objectives (SLOs), or expose systemic weaknesses even without customer impact.

Post-mortem analysis differs from incident response. Incident response focuses on detection, triage, and resolution during an active incident. Post-mortem analysis occurs after resolution when teams can investigate thoroughly without time pressure. The analysis examines not just the technical failure but also gaps in monitoring, documentation, testing, and communication that allowed the incident to occur or delayed resolution.

The core value emerges from organizational learning rather than individual blame. Teams treat incidents as opportunities to improve systems, processes, and practices. Well-executed post-mortems identify latent issues before they cause larger failures, document institutional knowledge about system behavior under stress, and create feedback loops that continuously improve reliability.

# Example incident severity classification
class IncidentSeverity
  CRITICAL = {
    sev: 1,
    criteria: "Complete service outage affecting all users",
    response_time: "Immediate",
    post_mortem_required: true
  }
  
  HIGH = {
    sev: 2,
    criteria: "Major functionality degraded for significant user percentage",
    response_time: "Within 15 minutes",
    post_mortem_required: true
  }
  
  MEDIUM = {
    sev: 3,
    criteria: "Minor functionality affected or small user subset impacted",
    response_time: "Within 1 hour",
    post_mortem_required: false # Optional based on learning value
  }
end

Organizations typically require post-mortems for severity 1 and 2 incidents, with optional analysis for severity 3 incidents that reveal systemic issues or near-misses that could have escalated.

Key Principles

Post-mortem analysis operates on several fundamental principles that distinguish effective analysis from superficial incident reviews. The blameless principle prohibits attributing incidents to individual error or negligence. System design creates conditions where mistakes become incidents, and effective analysis examines why those conditions existed rather than who made the mistake. This principle enables honest discussion about failures without fear of punishment.

The focus on systems over individuals recognizes that production incidents result from complex interactions between software, infrastructure, processes, and organizations. A database query causing an outage reflects not just a poorly optimized query but also inadequate query review processes, missing performance testing, insufficient monitoring, and capacity planning gaps. The analysis traces contributing factors across multiple layers rather than stopping at the proximate cause.

Timeliness ensures analysis occurs while details remain fresh in participants' memories. Organizations typically complete post-mortems within three to seven days after incident resolution. Delays reduce accuracy as participants forget subtle details, conversations, and decision-making context that influenced the incident timeline.

Action items transform learning into improvement. Each post-mortem identifies specific, actionable tasks to address root causes and contributing factors. These items receive owners, target completion dates, and tracking mechanisms. Post-mortems without actionable outcomes waste effort documenting problems without driving improvement.

The shared document principle requires writing comprehensive incident documentation accessible to the entire engineering organization. This documentation becomes institutional knowledge, preventing other teams from repeating similar mistakes and informing architecture decisions based on real operational experience. Teams reference historical post-mortems when designing systems or debugging new incidents.

# Post-mortem document structure
class PostMortem
  attr_reader :incident_id, :severity, :start_time, :end_time, :impact,
              :timeline, :root_causes, :contributing_factors, :action_items
  
  def initialize(incident_id)
    @incident_id = incident_id
    @timeline = []
    @root_causes = []
    @contributing_factors = []
    @action_items = []
  end
  
  def add_timeline_event(timestamp, description, actor)
    @timeline << {
      timestamp: timestamp,
      description: description,
      actor: actor,
      evidence: nil # Logs, screenshots, metrics
    }
  end
  
  def add_root_cause(description, category)
    @root_causes << {
      description: description,
      category: category, # :code, :config, :capacity, :process
      detection_method: nil, # How identified during analysis
      why_chain: [] # Five Whys progression
    }
  end
  
  def add_action_item(description, owner, priority, target_date)
    @action_items << {
      description: description,
      owner: owner,
      priority: priority, # :critical, :high, :medium
      target_date: target_date,
      prevents_recurrence: false,
      improves_detection: false,
      improves_response: false
    }
  end
end

The distinction between root causes and contributing factors clarifies the analysis. Root causes directly led to the incident - removing a root cause would have prevented the incident entirely. Contributing factors made the incident more likely or more severe but did not directly cause it. A deployment process that skips integration tests is a contributing factor; the specific bug in deployed code is the root cause.

Implementation Approaches

The structured post-mortem meeting approach brings together incident participants for a facilitated discussion. The facilitator guides the group through incident timeline reconstruction, root cause analysis, and action item identification. This approach works well for complex incidents requiring multiple perspectives and creates shared understanding across teams. The meeting typically lasts 60-90 minutes and includes the incident commander, responding engineers, on-call engineers, and relevant stakeholders.

The facilitator starts by establishing psychological safety, explicitly stating the blameless principle and focusing discussion on systems and processes. The group reconstructs the incident timeline chronologically using chat logs, monitoring data, deployment records, and participant memory. Each timeline entry includes a timestamp, description, the actor or system involved, and supporting evidence. The group identifies decision points where responders chose between different approaches and documents the information available at each decision point.

After completing the timeline, the group identifies what went well during response - effective debugging techniques, helpful monitoring, clear communication. This positive framing balances the problem-focused analysis and identifies practices to preserve. The group then analyzes what went poorly, focusing on gaps in detection, unclear escalation paths, missing runbooks, or misleading metrics.

The asynchronous written approach creates post-mortem documents without requiring simultaneous participation. The incident commander drafts the initial document including timeline, impact summary, and preliminary root cause analysis. Team members review and comment asynchronously, adding details, corrections, and perspectives. This approach accommodates distributed teams across time zones and allows deeper individual reflection. The document undergoes several review iterations before publishing.

# Asynchronous post-mortem workflow
class PostMortemWorkflow
  def self.create_draft(incident, commander)
    post_mortem = PostMortem.new(incident.id)
    post_mortem.severity = incident.severity
    post_mortem.start_time = incident.detected_at
    post_mortem.end_time = incident.resolved_at
    
    # Commander drafts initial timeline from incident logs
    incident.response_log.each do |entry|
      post_mortem.add_timeline_event(
        entry.timestamp,
        entry.action,
        entry.responder
      )
    end
    
    # Request input from participants
    incident.participants.each do |participant|
      notify_for_review(participant, post_mortem)
    end
    
    post_mortem
  end
  
  def self.incorporate_feedback(post_mortem, comment)
    case comment.section
    when :timeline
      post_mortem.timeline.insert(
        comment.position,
        comment.event_details
      )
    when :root_cause
      post_mortem.add_root_cause(
        comment.cause_description,
        comment.category
      )
    end
  end
end

The hybrid approach combines structured meetings with written documentation. The incident commander drafts the timeline and initial analysis. The team holds a 45-minute meeting to review the draft, discuss root causes, and identify action items. After the meeting, the commander updates the document with meeting outcomes and publishes it for broader review. This approach balances thoroughness with efficiency.

The Five Whys technique drills into root causes by repeatedly asking why an issue occurred. Start with the incident symptom and ask why it happened. Take that answer and ask why again. Continue for five iterations or until reaching a systemic issue that action items can address. The technique prevents stopping at surface-level causes.

# Five Whys analysis example
def five_whys_analysis(symptom)
  chain = [symptom]
  
  # Example progression:
  chain << "Database queries exceeded connection pool limit"
  chain << "New feature created N+1 query pattern"
  chain << "Code review did not identify query performance issue"
  chain << "Code review checklist does not include database query review"
  chain << "No standard checklist exists for reviewing database changes"
  
  # Actionable root cause:
  # "Create database change review checklist and make it required for PRs affecting models"
  
  chain
end

The Fishbone (Ishikawa) diagram approach categorizes contributing factors into predefined categories: people, process, technology, and environment. This structured categorization ensures comprehensive analysis across different dimensions and prevents fixating on a single factor type. Teams draw a diagram with the incident as the "head" and major categories as "bones" extending from the spine, then add specific factors to each category.

Common Patterns

The blameless post-mortem pattern creates psychological safety by explicitly prohibiting blame and focusing on systemic improvements. Documents avoid passive voice constructions that obscure responsibility ("the database was misconfigured" becomes "the deployment script misconfigured the database") while simultaneously avoiding blame ("engineer X misconfigured the database" becomes "the deployment script lacked validation for database configuration"). This framing acknowledges human actions while examining why systems allowed errors to propagate.

# Blameless language transformation
class BlamelessLanguage
  # Transform passive constructions
  def self.clarify_agency(passive_text)
    # "The wrong configuration was deployed"
    # becomes
    # "The deployment automation deployed the wrong configuration"
    identify_actor_and_action(passive_text)
  end
  
  # Transform blame constructions  
  def self.systemize(blame_text)
    # "Engineer deployed bad code"
    # becomes
    # "The deployment process allowed untested code to reach production"
    identify_process_gap(blame_text)
  end
  
  private
  
  def self.identify_actor_and_action(text)
    # Extract what happened and what system/process was involved
    # Focus on automation, tools, or processes rather than individuals
  end
  
  def self.identify_process_gap(text)
    # Identify what control, check, or process should have prevented the issue
  end
end

The learning review pattern publishes completed post-mortems organization-wide and discusses notable incidents in engineering meetings. Teams present post-mortems to share learnings, especially when incidents reveal patterns applicable to other systems. This pattern transforms individual team incidents into organizational learning and builds institutional knowledge about system reliability patterns.

The near-miss analysis pattern applies post-mortem analysis to incidents that did not impact users but could have under different circumstances. A database reaching 85% capacity might not cause an immediate outage but represents a near-miss worth analyzing. These analyses catch issues before they become customer-impacting incidents and demonstrate proactive reliability work.

The incident pattern recognition pattern identifies recurring incident types and creates specific remediation tracks. After conducting multiple post-mortems, teams notice patterns: deployment-related incidents cluster around configuration management, database incidents relate to capacity planning, API failures stem from circuit breaker configuration. Recognizing patterns enables systematic improvements addressing categories of incidents rather than individual occurrences.

# Pattern recognition across incidents
class IncidentPatternAnalyzer
  def analyze_patterns(post_mortems)
    patterns = {}
    
    post_mortems.group_by { |pm| pm.primary_category }.each do |category, incidents|
      patterns[category] = {
        frequency: incidents.count,
        common_causes: extract_common_causes(incidents),
        systemic_issues: identify_systemic_issues(incidents),
        recommended_improvements: []
      }
      
      # If same category appears frequently, identify systemic fix
      if incidents.count >= 3
        patterns[category][:recommended_improvements] = 
          suggest_systemic_improvements(category, incidents)
      end
    end
    
    patterns
  end
  
  private
  
  def extract_common_causes(incidents)
    incidents.flat_map(&:root_causes)
             .group_by { |cause| cause[:category] }
             .transform_values(&:count)
             .sort_by { |_, count| -count }
  end
  
  def identify_systemic_issues(incidents)
    # Look for gaps appearing across multiple incidents
    # - Missing monitoring
    # - Inadequate testing
    # - Unclear documentation
    incidents.flat_map(&:contributing_factors)
             .select { |factor| factor[:type] == :systemic_gap }
             .group_by { |factor| factor[:gap_category] }
             .select { |_, factors| factors.count >= 2 }
  end
end

The action item tracking pattern creates a standardized process for tracking post-mortem action items to completion. Each action item receives an owner, priority, target date, and tracking ticket. Teams review outstanding action items weekly and report completion rates as a reliability metric. This pattern ensures post-mortems drive actual improvements rather than creating documentation that teams ignore.

The pre-mortem pattern inverts post-mortem analysis by conducting it before launching new systems or features. Teams imagine the new system has failed catastrophically and work backward to identify what could have caused the failure. This prospective analysis identifies risks and mitigation strategies during design and development rather than after production failures.

Practical Examples

A deployment incident demonstrates comprehensive post-mortem analysis. On March 15, 2024, a mobile API experienced complete service disruption for 23 minutes affecting 100% of mobile users. The incident began at 14:37 UTC when automated deployment tooling deployed version 2.8.0 to production. Request success rate dropped from 99.9% to 0% within two minutes.

# Incident timeline reconstruction
incident_timeline = [
  {
    time: "14:37:00 UTC",
    event: "Deployment pipeline initiated rollout of v2.8.0",
    actor: "Automated deployment system",
    evidence: "Deployment logs show successful image push"
  },
  {
    time: "14:37:30 UTC", 
    event: "First production pod running v2.8.0",
    actor: "Kubernetes",
    evidence: "Pod event logs show container started successfully"
  },
  {
    time: "14:38:45 UTC",
    event: "Error rate alerts triggered for mobile-api service",
    actor: "Monitoring system",
    evidence: "PagerDuty incident #4521 created"
  },
  {
    time: "14:39:10 UTC",
    event: "On-call engineer acknowledged alert",
    actor: "Engineer A",
    evidence: "PagerDuty acknowledgment timestamp"
  },
  {
    time: "14:40:30 UTC",
    event: "Engineer identified 100% error rate from new pods",
    actor: "Engineer A",
    evidence: "Slack message in #incidents channel"
  },
  {
    time: "14:42:15 UTC",
    event: "Rollback initiated to v2.7.5",
    actor: "Engineer A",
    evidence: "Deployment command in terminal history"
  },
  {
    time: "14:52:00 UTC",
    event: "Rollback complete, traffic recovering",
    actor: "Kubernetes",
    evidence: "Pod metrics showing successful requests"
  },
  {
    time: "15:00:00 UTC",
    event: "Service fully recovered, incident resolved",
    actor: "Engineer A",
    evidence: "Error rate returned to baseline"
  }
]

The root cause analysis revealed that version 2.8.0 included a database migration adding a new column with a NOT NULL constraint but no default value. The migration ran successfully in staging where the database contained only test data. In production, the migration attempted to add the NOT NULL column to 15 million existing rows without default values, causing the migration to fail. The application code assumed the column existed and crashed when the schema did not match expectations.

The Five Whys analysis progressed: (1) Why did the service crash? The code referenced a database column that did not exist. (2) Why did the column not exist? The database migration failed to add it. (3) Why did the migration fail? The migration added a NOT NULL column without a default value to a table with existing data. (4) Why did this pass staging testing? The staging database had no existing rows in the affected table. (5) Why did staging not have production-like data? The staging data refresh process only copied schema, not data.

# Action items from deployment incident
action_items = [
  {
    description: "Update migration tooling to require explicit default values for new NOT NULL columns",
    owner: "Platform team",
    priority: :critical,
    target_date: "2024-03-22",
    prevents_recurrence: true
  },
  {
    description: "Implement staging data refresh to include production-scale anonymized data",
    owner: "DevOps team", 
    priority: :high,
    target_date: "2024-04-05",
    prevents_recurrence: true
  },
  {
    description: "Add pre-deployment migration validation check that runs migrations against production-scale data copy",
    owner: "Platform team",
    priority: :high,
    target_date: "2024-03-29",
    prevents_recurrence: true
  },
  {
    description: "Create runbook for rollback procedures during deployment incidents",
    owner: "Engineer A",
    priority: :medium,
    target_date: "2024-03-20",
    improves_response: true
  }
]

A capacity incident illustrates different analysis considerations. On April 3, 2024, the background job processing system experienced severe queue backlog, with job processing time increasing from 2 minutes average to over 4 hours. The incident lasted 6 hours and affected asynchronous features including email delivery, report generation, and data exports.

Investigation revealed that job processing capacity had remained static at 50 workers while job volume grew 300% over three months. Monitoring tracked queue depth but not processing velocity or wait time, so the gradual degradation went unnoticed until queue depth exceeded millions of jobs. The incident resolved after engineers manually scaled workers to 200, clearing the backlog over six hours.

Root cause analysis identified inadequate capacity planning and monitoring gaps. The system lacked autoscaling configuration, capacity projections, or alerts on processing velocity degradation. Action items included implementing queue-based autoscaling, creating capacity planning reviews for high-growth features, and adding monitoring for job wait time and processing velocity.

A third-party dependency incident demonstrates external failure analysis. On May 12, 2024, a payment processing integration failed for 45 minutes affecting checkout completion. The payment gateway experienced an outage returning 503 errors for all requests. The application lacked circuit breaker patterns, so it continued sending requests and waiting for responses, exhausting connection pools and blocking other operations.

The post-mortem examined not just the external outage (outside the team's control) but the system's response to that outage. Root causes included missing circuit breakers, no timeout configuration for external service calls, and insufficient redundancy in payment options. Action items focused on defensive programming patterns, implementing circuit breakers, adding fallback payment processors, and improving graceful degradation.

Tools & Ecosystem

Post-mortem documentation platforms provide structured templates, collaboration features, and historical incident archives. PagerDuty Post-mortems integrates with incident management, automatically populating incident metadata including timeline, responders, and resolution time. The platform supports collaborative editing, action item tracking, and organization-wide incident pattern analysis.

# Integration with incident management tools
class IncidentManager
  def create_post_mortem_draft(incident_id)
    incident = fetch_incident(incident_id)
    
    draft = {
      incident_id: incident.id,
      severity: incident.severity,
      detected_at: incident.created_at,
      resolved_at: incident.resolved_at,
      responders: incident.responders.map(&:name),
      timeline: extract_timeline_from_logs(incident),
      affected_services: incident.impacted_services,
      customer_impact: calculate_customer_impact(incident)
    }
    
    # Push to document management system
    PostMortemRepository.create(draft)
  end
  
  private
  
  def extract_timeline_from_logs(incident)
    incident.response_log.map do |entry|
      {
        timestamp: entry.created_at,
        action: entry.action,
        actor: entry.responder.name,
        notes: entry.notes
      }
    end
  end
end

Confluence and Notion serve as common post-mortem repositories using page templates. Teams create page templates with standard sections: incident summary, timeline, root cause analysis, action items. The wiki format facilitates cross-referencing between incidents and searching historical post-mortems. These platforms lack specialized incident management features but integrate with existing documentation systems.

Jeli focuses on learning from incidents through in-depth investigation and cognitive psychology research. The platform emphasizes understanding decision-making under uncertainty rather than just identifying technical root causes. Jeli captures detailed timelines including what information was available at each decision point and what assumptions responders made.

GitHub Issues and Jira provide action item tracking. Teams create tracking tickets for each post-mortem action item, linking tickets to the post-mortem document. This integration enables tracking action item completion rates across incidents and identifying common remediation patterns. Teams report action item velocity as a reliability metric.

# Action item tracking integration
class ActionItemTracker
  def create_tracking_tickets(post_mortem)
    post_mortem.action_items.each do |item|
      ticket = create_ticket(
        title: item[:description],
        assignee: item[:owner],
        priority: map_priority(item[:priority]),
        due_date: item[:target_date],
        labels: ["post-mortem", "incident-#{post_mortem.incident_id}"]
      )
      
      link_to_post_mortem(ticket, post_mortem)
      
      if item[:prevents_recurrence]
        ticket.add_label("prevents-recurrence")
      end
    end
  end
  
  def track_completion_rate
    items = ActionItem.where("created_at > ?", 90.days.ago)
    completed = items.where(status: "closed").count
    total = items.count
    
    {
      completion_rate: (completed.to_f / total * 100).round(2),
      overdue: items.where("due_date < ? AND status != ?", Date.today, "closed").count,
      avg_time_to_close: calculate_avg_time_to_close(items)
    }
  end
end

Log aggregation systems including Elasticsearch, Splunk, and Datadog provide historical data for timeline reconstruction. These systems enable searching application logs, system metrics, and infrastructure events by timestamp to build accurate incident timelines. Query capabilities allow correlating events across multiple services to understand cascading failures.

Distributed tracing systems like Jaeger and Honeycomb track request flows through microservice architectures. During post-mortem analysis, engineers examine traces from failing requests to identify exactly where errors occurred, what services were involved, and how long each operation took. Trace data supplements logs with request-level context.

Observability platforms combine metrics, logs, and traces into unified incident investigation interfaces. New Relic, Datadog, and Grafana provide pre-built incident timelines showing metric anomalies, deployment events, and alert triggers. These platforms reduce time spent gathering data for post-mortem analysis.

Common Pitfalls

Stopping at proximate causes rather than systemic issues produces ineffective post-mortems that do not prevent recurrence. A post-mortem that concludes "the engineer deployed untested code" stops at the proximate cause (the deployment) without examining why untested code reached deployment. Effective analysis asks why the deployment process allowed untested code, why testing coverage did not catch the issue, and why code review did not identify problems.

# Shallow vs deep root cause analysis
class RootCauseDepth
  # Shallow analysis - stops at proximate cause
  def shallow_analysis
    {
      conclusion: "Engineer deployed code with a bug",
      action_items: ["Engineers should test code before deploying"]
    }
  end
  
  # Deep analysis - examines systemic issues
  def deep_analysis
    {
      proximate_cause: "Deployment contained uncaught exception in error handler",
      contributing_factors: [
        "Error handling code path not covered by automated tests",
        "Code review did not identify missing test coverage",
        "Staging environment did not trigger the error condition",
        "No alerts for error handler failures"
      ],
      systemic_issues: [
        "No requirement for test coverage on error handling paths",
        "Code review checklist does not include error path testing",
        "Staging environment does not replicate production error conditions",
        "Monitoring gaps in error handler execution"
      ],
      action_items: [
        "Add test coverage requirement for all error handling code",
        "Update code review checklist with error path verification",
        "Configure staging to replay production error conditions",
        "Implement monitoring for error handler execution and failures"
      ]
    }
  end
end

Blame culture destroys post-mortem value by making participants defensive and preventing honest analysis. When organizations punish people for incidents, engineers hide information, downplay severity, and avoid documenting uncomfortable truths. Post-mortems become exercises in blame deflection rather than learning. Comments like "the engineer should have known better" or "this was obviously a mistake" introduce blame even in supposedly blameless post-mortems.

Lengthy post-mortem timelines delay analysis until participant memories fade and incident details blur. Conducting post-mortems four weeks after incidents produces incomplete timelines, forgotten context, and superficial analysis. Teams should complete post-mortems within one week while details remain clear and motivation for improvement remains high.

Action items without clear ownership, deadlines, or tracking never complete. Post-mortems listing vague action items like "improve monitoring" or "add more tests" without specific owners and dates generate documentation that teams ignore. Effective action items specify exactly what will be done, who will do it, and by when, with tracking mechanisms ensuring accountability.

Excessive action items overwhelm teams and prevent focus on high-impact improvements. Post-mortems identifying 20 action items scatter effort across many small changes rather than addressing core systemic issues. Effective post-mortems prioritize 3-5 critical action items that address root causes and demonstrably reduce incident likelihood.

# Action item quality evaluation
class ActionItemQuality
  # Poor action items - vague, no owner, no deadline
  def poor_action_items
    [
      "Improve monitoring",
      "Add more tests",
      "Better documentation needed"
    ]
  end
  
  # Good action items - specific, owned, dated
  def good_action_items
    [
      {
        description: "Add alert for database connection pool utilization exceeding 80%",
        owner: "Platform team - Engineer B",
        target_date: Date.new(2024, 3, 22),
        success_criteria: "Alert fires when connection pool reaches 80% and pages on-call",
        prevents_recurrence: true
      },
      {
        description: "Create integration test suite for payment processing error scenarios",
        owner: "Payments team - Engineer C",
        target_date: Date.new(2024, 3, 29),
        success_criteria: "Test suite covers timeout, 503, and malformed response scenarios",
        prevents_recurrence: true
      }
    ]
  end
end

Over-focusing on technical details while ignoring process and organizational factors produces narrow analysis. Incidents often result from communication breakdowns, unclear ownership, documentation gaps, or organizational pressures that encourage speed over safety. Post-mortems examining only code and configuration miss important contributing factors in how teams work.

Conducting post-mortems only for customer-impacting incidents misses learning opportunities. Near-misses that did not affect users reveal systemic weaknesses before they cause major incidents. An alert firing for database capacity but resolving itself before impact represents a near-miss worth analyzing even without customer impact.

Generic action items that could apply to any incident indicate shallow analysis. Action items like "improve communication" or "better testing" apply broadly but do not address specific incident characteristics. Effective action items tie directly to identified root causes and contributing factors.

Reference

Post-Mortem Document Template

Section	Required	Description
Incident Summary	Yes	One-paragraph overview of what happened, impact, and duration
Incident Metadata	Yes	Severity, start time, end time, affected services, detection method
Impact Assessment	Yes	User impact quantification, business metrics affected, SLO violations
Timeline	Yes	Chronological event list with timestamps, actions, actors, evidence
Root Cause Analysis	Yes	Direct causes that if prevented would have prevented the incident
Contributing Factors	Yes	Conditions that made incident more likely or more severe
What Went Well	Yes	Effective practices during detection and response
What Went Poorly	Yes	Gaps in detection, response, or system design
Action Items	Yes	Specific improvements with owners, dates, tracking
Related Incidents	Optional	Similar past incidents or near-misses

Root Cause Categories

Category	Description	Example
Code Defect	Bug in application code	Null pointer exception, logic error, race condition
Configuration Error	Incorrect system configuration	Wrong environment variable, misconfigured load balancer
Capacity Insufficient	Resource exhaustion	Database connection pool exhausted, disk full
Dependency Failure	External service or library failure	Third-party API outage, library bug
Process Gap	Missing or inadequate process	Skipped testing step, incomplete code review
Operational Error	Manual operation mistake	Wrong command executed, wrong server targeted
Design Flaw	Architectural limitation	Single point of failure, missing redundancy
Monitoring Gap	Issue not detected or alerted	Missing metric, misconfigured alert threshold

Action Item Priorities

Priority	Criteria	Target Completion
Critical	Prevents recurrence of severity 1-2 incident	Within 1 week
High	Significantly reduces incident likelihood or improves detection	Within 2-4 weeks
Medium	Incremental improvement to reliability or response	Within 1-2 months
Low	Nice-to-have improvement or long-term goal	Within 3+ months

Post-Mortem Meeting Agenda

Time	Activity	Facilitator Actions
0-5 min	Introduction and psychological safety	State blameless principle, set expectations
5-30 min	Timeline reconstruction	Guide chronological review, fill gaps, note decision points
30-40 min	What went well	Identify effective practices to preserve
40-55 min	Root cause analysis	Guide Five Whys, distinguish root causes from symptoms
55-75 min	Action item identification	Ensure specific, owned, dated items with clear success criteria
75-90 min	Review and next steps	Confirm document owner, publication timeline, follow-up meeting

Five Whys Progression

Level	Question Pattern	Goal
Why 1	Why did the symptom occur?	Identify immediate technical cause
Why 2	Why did that cause exist?	Identify system condition enabling cause
Why 3	Why was that condition present?	Identify process or design gap
Why 4	Why does that gap exist?	Identify organizational or architectural factor
Why 5	Why was that factor not addressed?	Identify actionable systemic improvement

Incident Impact Metrics

Metric	Measurement	Use
User Impact Percentage	Percentage of users affected	Severity classification
Duration	Time from detection to resolution	Response effectiveness
Time to Detect	Delay from incident start to detection	Monitoring effectiveness
Time to Mitigate	Time from detection to impact reduction	Response speed
MTTR	Mean Time To Recovery across incidents	Overall reliability trend
SLO Burn Rate	Rate of error budget consumption	Incident priority

Post-Mortem Quality Checklist

Criteria	Evaluation Question
Blameless	Does document avoid attributing incident to individual failure?
Timeline Complete	Does timeline include detection, response actions, and resolution?
Root Causes Identified	Are root causes systemic issues rather than symptoms?
Contributing Factors Listed	Are conditions that worsened incident documented?
Action Items Specific	Does each action item have owner, date, and success criteria?
Learning Captured	Does document provide value to teams working on similar systems?
Evidence Linked	Are timeline events supported by logs, metrics, or screenshots?

Post-Mortem Analysis