Learning from Post-Mortems
Building off of last week’s focus on defining incident severity and priority levels, this week we’ll cover post-mortems, why you should use them, and when they are useful. Let’s jump in.
A post-mortem is a written record of a past incident where the team discusses what went wrong (including a timeline), what could have been done better, and what steps will be taken to prevent the incident from happening again in the future. They are used to break down the series of events that lead to an incident, with the end goal of identifying the root cause of the issue. Post-mortems typically include a timeline of the events as they occurred, and descriptions of who was notified, when they were notified, and what actions they took to resolve the issue. Post-mortems also document the impact the incident had on customers, and to the business.
It’s important to remember that post-mortems are not meant to assign blame to anyone. The goal is to identify process failures, not personal failures of individual people. There’s no benefit to pointing fingers at specific people when you’re working together on a team. Instead, focus on identifying how the processes broke down, and what can be done to modify the processes to prevent the issue from happening again.
The ultimate goal of a post-mortems is to manage risk by preventing the same incident from happening again. By carrying out a post-mortem, the team seeks to learn what went wrong, identify the root cause that led to the incident, and add or modify processes so that it can’t happen again. Teams get better through continuous improvement, and post-mortems are one of the most effective ways for the team to collectively learn from their mistakes. By documenting the post-mortem, you’re creating a written record of the incident for others to learn from in the future. You’ll be able to refer to it in the future, and new team members will be able to look back at previous incidents to learn how and why certain processes evolved at your company.
So when should you perform a post-mortem? You should typically carry out a post-mortem shortly after the incident has been resolved, about 24-48 hours. The details of the incident will still be fresh in the minds of those who responded, so you’ll want to make sure you document it as quickly as you can in order to capture as much detail as possible. Post-mortems are usually only carried out for high severity incidents, such as a SEV 1 or SEV 2, but this may vary depending on your team or company’s processes.
There is a Post-Mortem Template available on the Resources & Templates page for you to use and customize to your team. Next time there is a major incident at your company, make sure you’re documenting it through a post-mortem, and if there’s no process in place for one already, suggest adding one so that your team can learn from their mistakes and raise the engineering bar at your company.