Writing a public postmortem regarding an outage is essential to maintaining transparency and accountability when things go wrong in a service or system.
The purpose of writing a postmortem is to analyze and document an incident or event that has occurred, usually with a focus on identifying its root causes, understanding what went wrong, and outlining steps to prevent similar issues from happening in the future.
It helps communicate what happened, why it happened, how it was resolved, and what measures will be taken to prevent similar issues in the future.
As an incident handler, you might be asked to write a postmortem report that can be shared with senior executives, staff, or even customers to know what happened and how the issue was resolved.
To help you with this, here's a step-by-step guide on how to write an effective postmortem.
When you write an incident report about a service outage, answer all the customer questions. Your report should show that your organization knows what it's doing and can handle the situation well.
Let's look at the report structure and what points you should mention.
The postmortem report structure is made up of the following parts:
- Title and Introduction
- Issue summary
- A Timeline
- Root cause analysis
- Impact and Mitigation
- Contributing factors
- Resolution and recovery
- Corrective and preventative measures
- Lessons learned
Title and Introduction:
Start with a clear and concise title that reflects the incident, for example, "Postmortem: Service Outage on [Date]."
Begin the postmortem with an introduction acknowledging the incident and its impact on users.
Begin the report with an issue summary which is short and to the point and
weighs in at just five sentences. Include the issue duration and the time zone since you may have customers that are all over the world. The time zone can be helpful for your customers when they're trying to correlate a problem with your outage.
Next, talk about the outage impact, which resulted in an error response message for most users and at its peak.
Timeline of Events:
Present a chronological sequence of events leading up to and during the outage.
Include timestamps for each event to give readers a clear understanding of the timeline.
The first entry should include the details that ultimately caused the outage, and the final entry is when 100% of the traffic was back online.
If your outage were to span multiple days, include the date stamp.
This part is where you can get users interested in reading the postmortem. Make the tone casual so that users can enjoy the story-like way the incident is explained.
For example, “ABC experiences a surge of pager notifications, encountering numerous HTTP 500 errors from all clusters."
Root Cause Analysis(RCA):
The following heading talks about root cause analysis.
Identify and explain the root cause(s) of the outage. It might be a software bug, hardware failure, human error, or a combination of factors. Be thorough and avoid blaming individuals; instead, focus on systemic issues and process improvements.
Explain the failures encountered while you tried to correct the root cause. You can use the 5 Whys technique to identify the root cause.
Read about data aggregation here:
The 5 Whys technique works as follows:
- Initiating Analysis:The starting point involves identifying the incident or issue for examination.
- Find out the Initial Cause: Begin by asking, "Why did this happen?"Investigate the underlying, core factor that contributed to the issue.
- Going Deeper: Then ask, "Why did that happen?" Keep looking to find what led to the main reason.
- Recurring "Why" Queries: Ask a series of "Why?" inquiries repeatedly.Traverse through each response to progressively unveil the core matter.
Do not sugarcoat any facts that might have caused the outage. For example, if any configuration push skipped testing and went straight to production, mention the cause in detail.
Impact and Mitigation:
Detail the outage's impact on users, customers, and any downstream services. Describe the immediate actions taken to mitigate the issue and restore services to normal.
If there were contributing factors that exacerbated the outage, address them. These could include issues like lack of monitoring, insufficient redundancy, or outdated processes.
Resolution and Recovery:
Discuss how the incident was handled, the teams involved, and the communication process during the outage. Highlight the steps taken to resolve the issue and restore service functionality.
This section is generally divided into three paragraphs:
- The first paragraph should mention when and how internal monitoring systems alerted engineers that there was likely a problem.
- The second paragraph should discuss how engineers noticed that problem and tried to roll back.
- The final paragraph should talk about how engineers finally restored service.
Corrective and Preventative Measures:
This section is an itemized list of how to prevent this type of failure in the future and some critical thinking about what you can do better next time to handle these issues.
Outline the specific actions that will be taken to prevent similar incidents from happening in the future. Include concrete plans for improving monitoring, enhancing redundancy, or implementing new processes.
You can also mention details about the organization’s systems, such as how engineers were alerted through the monitoring tool. This helps customers understand the issue in a better way.
Share any key takeaways and lessons learned from the incident, such as what went well, what went wrong during incident handling etc.
Include any additional technical details, graphs, or charts that support the postmortem findings.
Best practices for writing postmortems
No Blame Game: Instead of blaming people for incidents, blameless reviews focus on understanding what happened without pointing fingers.
Gather Data in a Commonly Accessible Location: During incident investigations, ensure that all team members collect relevant data in a centralized and commonly accessible location, such as a shared document or message feed.
See the Big Picture: Incidents usually involve multiple factors, and it's essential to look at the whole situation to find the root causes.
Encourage Honesty: Create a culture where people can admit mistakes without fear.
Open Communication: Let team members share their errors for valuable insights.
Automate postmortem process: Automating the postmortem generation process significantly reduces the time spent copying and pasting incident data from various sources.
You can use Zenduty's postmortem template feature to add the relevant details, enabling incident handlers to start analyzing the incident promptly.
Learn from Previous Incidents:Living postmortems serve as an ongoing knowledge base. You can refer to past incidents, review the discussions, and learn from previous experiences, ensuring continuous improvement.
Include stats/live graphs: Postmortems are more than static data snapshots. With live charts, responders can interactively explore data trends over different time intervals or isolate specific metrics to gain a contextual understanding of the incident's progression.
Make it easy to find later: It's essential to ensure that the findings included in your postmortems are easy to locate to help team members investigate future incidents or write a runbook down the road.
Identify and include tags: Use descriptive tags and titles in your incidents and postmortems for easy searching. Relying solely on incident IDs or dates might not be sufficient, especially if you want to explore specific failure modes of a particular service. You can quickly find the information you need by tagging postmortems with relevant service names.
Incident Postmortem Example: https://sre.google/sre-book/example-postmortem/
The above example shared by Google helps us understand how to write an incident postmortem.
To know more on blameless postmortems and how do they help organizations, do checkout the below resources:
The chapter by Google SRE's provides us with a comprehensive understanding of what postmortems are, offers best practices, and highlights how cultivating a blameless culture can enhance an organization's incident management process.