Blameless Post-Mortem Meeting Template

post mortem

1. Initial Statement

This is a blameless post mortem. Our goal is to better understand the system we operate. We follow the Retrospective Prime Directive:

“Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.”
We will not speculate on “would have…”, “should have…”, “could have…”. These are counterfactual statements, they refer to the system as imagined, instead of the system in reality, which is the focus.

Talk about the most important ones and try to understand them and discover others (use brainstorming or “infinite hows”).
All perspectives should be included and respected. Nobody should be permitted to argue with a contributing factor somebody else has identified.
The goal is not to engage in Convergent Behavior such as trying to identify one or more Root Causes.
It’s not about what was the root cause, it’s about several factors and the interactions between them
In a complex system, incidents are often the result of multiple, interacting factors. Focusing solely on root causes can lead to a narrow view that misses important systemic issues.
Sort the contributing factors into categories

Example: Lack of Redundancy - Critical components of the system might not have redundancy, causing a single point of failure.

Example: Delayed Patch Implementation - A known vulnerability might have been patched too late, allowing an incident to occur.

Example: Late Detection by Monitoring Systems - The monitoring systems in place may have failed to detect the issue promptly.

Example: Lack of Clear Procedures - The absence of clear operational procedures might have led to confusion or incorrect actions during the incident.

Example: Integration Issues - Poor integration with other systems or services might have caused unexpected problems.

Example: Inaccurate Information Dissemination - Incorrect or outdated information might have been shared, causing more issues.

Brainstorming
Choosing the best potential actions to either prevent or enable faster detection or recovery.
The goal is to identify the smallest number of incremental steps to achieve the desired outcomes, as opposed to “big bang” changes, which not only take longer to implement, but delay the improvements we need.
Any corrective actions must be assigned to someone and have a deadline. If the corrective action is not a top priority when the meeting is over, then it is not a corrective action (this ensures problems are addressed, and prevents us from generating a list of good ideas that are never implemented)

Create a list of lower priority items and assign an owner. If this happens again, we can use this list as a starting point for future countermeasures.

Event severity: relates to impact on the service and customers
Total downtime: how long were customers unable to use the service to any degree?
Time to detect: how long until we knew there was a problem?
Time to resolve: how long after we knew there was a problem did it take us to solve it?

This material is greatly inspired by: