Incidents and accidents can occur in various domains, from information technology and cybersecurity breaches to workplace accidents and transportation mishaps.
When faced with such incidents, it becomes crucial to conduct a thorough analysis to understand the underlying causes and implications.
Incident analysis goes beyond problem-solving; it offers valuable insights into preventing future occurrences and improving systems and processes.
Let’s dive deeper into incident analysis and how it benefits an organization.
What is incident?
An incident, as defined by the National Institute of Standards and Technology (NIST), is a circumstance that actually or potentially jeopardizes an information system's confidentiality, integrity, or availability.
Most people get confused between how an incident is defined vs how an event takes place.
Events occur constantly, but an incident is suspicious with unwanted activity.
Event: User logs into the system with authorized access.
Incident: External agent gains access to the internal system through compromised permissions.
Now that you are aware of what the incident is, let's dive deeper into why post incident analysis is important for organizations.
The Power of Incident Analysis: Reasons to Prioritize it
Incident analysis is used to determine the following:
- What happened during an outage
- Who was affected
- What system components were involved
- How the issue was resolved
There are many different approaches to incident analysis. However, its fundamental components often include the following:
- Collecting information on the event
- Analyzing the information
- Insights drawn from the data.
- Improving Resilience for the Future
Benefit of incident analysis
Finding the root cause is just the beginning; incident analysis brings multiple benefits to organizations.
Here are a few ways it adds value and enhances performance:
Helps understand how a system works:
Incident analysis allows engineers to delve deeper into how a system operates under various conditions. Identifying patterns, root causes, and potential vulnerabilities help engineers understand the system's behavior.
Identify Patterns and Trends:
Incident analysis helps in identifying patterns and trends within a system. This way, engineers can spot recurring issues and identify common factors that may have gone unnoticed.
Prepares for Future Surprises:
SREs can improve their ability to handle unforeseen situations in the future by looking back at previous incidents. Although these surprises could seem similar to earlier occurrences, engineers can use post incident analysis to create strategies, processes, and backup plans to reduce risks and guarantee a more effective response.
Incident analysis techniques
Incident analysis involves various techniques and methodologies to identify, understand, and resolve incidents.
Here are some commonly used post incident analysis techniques:
The Tripod Beta method is a systematic approach used to analyze incidents and understand the underlying causes.
The analysis process involves three steps:
- Identifying the sequence of events that led to the consequences of the incident.
- Identifying the barriers that were supposed to prevent or stop this sequence of events from unfolding.
- Identifying the specific reasons for the failure of each broken barrier, which are categorized into human failures (active failures), working environmental aspects (preconditions), and latent failures in the organization.
The analysis considers the Human Error theory to determine why the barriers failed.
It investigates the specific errors or mistakes made by individuals, the failures in the working environment that contributed to these errors, and the latent failures within the organization that allowed these conditions to persist.
The analysis is visualized using a "tree" diagram, representing the incident mechanism and its relationships, providing a clear overview of the events and their connections.
Unraveling Root Causes: The Power of the 5 Whys
The 5 Whys accident investigation technique is a simple approach that helps us understand why something went wrong, or a problem occurred.
Here's how the technique works:
- Finding the incident or issue to analyze is the initial step.
- Ask, "Why did this happen?" and explore the underlying root cause.
- Then, inquire, "Why did that happen?" and explore more to discover the true cause.
- Ask "Why?" repeatedly, then go further into each response to find the core issue.
Remember, you can continue after five questions. You can ask for fewer or more, depending on the situation.
After identifying the primary contributing factors, consider what steps can be performed to stop future occurrences of this type.
The key is to keep asking "Why?" to uncover the underlying causes and not just focus on the surface-level reasons.
To help you better comprehend the 5 Whys accident investigation technique, consider the following example:
Scenario: Your company's website is down.
Why did it happen?
Ans: It ran out of memory
Ans: Because the configuration was incorrect.
Ans: The site administrator made a mistake
Ans: Because the development team needed to provide clear instructions.
Ans: Because they assumed the administrator might know how to configure correctly.
So, this is how the technique works; for every cause of the problem, you have to take countermeasure steps.
For instance, Cause 1: If the website ran out of memory
Countermeasure1: Get the site up and running asap and so on.
Note: When conducting the five whys accident analysis technique, remember it cannot be applied to all kinds of issues. It can be applied to simple or moderately difficult problems and might not be a good option to resolve complex IT incidents.
Root Cause Analysis Technique(RCA)
The primary goal of RCA is to identify the root cause of a problem occurred.
RCA involves drawing a diagram that displays the relationships between the causes of an event. This visual representation helps in understanding the contributing factors and their connections.
The Root Causes Analysis diagram distinguishes between three types of causes:
- Immediate Causes
- Underlying Causes
- Root Causes
Each cause type provides insights into different levels of the problem.
SRE’s use the "Why?" question to move through the causes in the diagram.
The diagram created during RCA forms a Cause-Consequence tree. This tree structure resembles an Event tree, displaying the causal relationships between various factors and their consequences.
Infinite Hows: Unlocking Effective Problem-solving
The infinite hows technique is a problem-solving approach that improves the traditional 5 Whys technique.
Instead of asking "why," it encourages asking "how" to explore the problem and its causes in more detail.
The infinite hows technique emphasizes exploring symptoms in-depth.
Let's say a company experiences a network outage, and the IT team wants to quickly analyze and fix the problem.
You can use the infinite hows technique in place of asking, "Why did the network outage occur?"
Here’s an example:
- Initiate the investigation by asking, "How did the network outage impact different services and departments?"
- To understand the root causes, ask, "How did the network infrastructure handle the increased traffic load? How were network devices configured to handle such a load?"
- Study specific symptoms by asking, "How did the outage affect specific applications or devices? How did it impact user experience?
Continue this technique until you find the root cause and then apply a countermeasure for it.
3x5 Why Analysis: Exploring Comprehensive Problem Resolution
3x5 is about examining three different iterations of a problem.
The 3x5 Why technique involves three complete iterations of the 5 Whys style analysis, allowing for a more comprehensive exploration of the problem and its underlying causes.
Iteration 1-Specific Category: The first iteration focuses on the specific question of why the problem occurred. It aims to identify the immediate cause or root factor contributing to the issue.
Iteration 2-Detection Category: The second iteration digs into the detection aspect of the problem. It addresses the question of why the problem was overlooked, highlighting potential gaps in quality control or access control that may have contributed to the issue.
Iteration 3-Systemic Category: The final iteration takes a systemic approach by examining the systems in place that allowed the problem to occur. It targets larger organizational or process-oriented challenges that might have played a role in the problem's occurrence.
By categorizing the Analysis into specific, detection, and systemic aspects, the 3x5 Why Analysis provides a more comprehensive understanding of the problem, enabling organizations to identify multiple contributing factors.
Blameless postmortems are a useful way to analyze incidents, no matter which technique you use.
The technique creates a safe environment where everyone can openly discuss the incident analysis without fear of blame.
In these discussions, the focus is on understanding the situation and the decisions made rather than pointing fingers.
This encourages transparency, and participants willingly share details about what happened, what they observed, and their expectations.
By combining blameless postmortems with other analysis techniques, one can find the real causes of incidents and come up with effective solutions.
Visualizing Cause-and-Effect: Ishikawa or Fishbone Diagrams
Ishikawa diagrams, also known as fishbone diagrams, are diagrams that visually depict potential causes of a specific event or problem.
They help in identifying a variety of possible causes and manage brainstorming sessions by classifying ideas into relevant groups.
The diagram features:
- The incident as the fish's head
- Causes represented as fish bones
- Major contributing factors are depicted as ribs branching off the backbone, while sub branches represent root causes at various levels.
For instance, the sales of a company dropped by 30%.
Step 1: Identify the problem
There are 3 channels the company relies on:
- Direct website traffic
- Sales through calls
- Sales through e-commerce website
Step 2: Identify the categories or causes
In our instance, the major categories are people, call prompts and technology.
(Note: Limit yourself to less than 10 categories to make it less complex)
Step 3: Identify the actual causes
Here we dive deeper into each category and discover what caused the problem.
For category people, the possible causes are : new sales people unable to close the deals or sales people struggling to follow up. Now, identify the sub causes of the category.
This way, continue for each category and draw a diagram representing the incident, categories, causes and sub-causes.
Mastering Kepner-Tregoe: Enhancing Decision-making and Problem-solving
The Kepner-Tregoe technique is structured for gathering, prioritizing, and evaluating information to solve problems effectively.
Here's how the Kepner-Tregoe approach works:
- Problem Analysis: First, you identify and describe the problem. You gather all the necessary information and understand how the problem affects things.
- Decision Analysis: In this step, you develop different solutions or options to solve the problem. Then, you compare these options by looking at their advantages, disadvantages, risks, and benefits. This helps you understand which solution is the best.
- Potential Problem Analysis: The next step is to plan and look for any potential issues or dangers that might arise if you use the best possible solution. In this manner, you can avoid any problems by taking preventive measures.
- Solution Analysis: You select the best solution based on the previous steps. You put the chosen solution into action, monitor how it's working, and evaluate its effectiveness.
By following the Kepner-Tregoe method, you can approach problems logically and organized.
This method promotes critical thinking, assists in evaluating possibilities, and directs you towards wise judgements.
Many organizations and professionals apply the technique worldwide to sharpen their problem-solving abilities and get better results.
Causal Mapping: Navigating Relationships Strategically
Causal mapping is a technique to determine why things go wrong and how different factors contribute to IT incidents.
It helps us understand the causes and effects of incidents clearly and organized.
Here's how causal mapping works:
- Incident Analysis: When an IT incident happens, we use causal mapping to investigate and understand what went wrong. In this step, analysts gather information about the incident, like what happened, when, and how it impacted the system.
- Identifying Factors: Causal mapping helps us identify the factors such as technical issues, engineering mistakes, process gaps, or even external events. This step involves connecting different factors and how they are interrelated.We analyze if one thing led to another or if multiple factors combined to cause the incident.
- Preventing Future Incidents: With this knowledge, we may take preventive measures to stop recurrence of the same situations. We might upgrade systems, alter practices, or offer further training to address the root reasons.It helps us design targeted actions to fix the issues and prevent future incidents.
- Communicating and Documenting: Causal maps are visual representations that help us explain the incident and its causes to others. They make it easier for SREs involved to understand the problem better. Causal maps also act as a record or documentation of the incident, so we can refer back to them in the future and learn from past experiences.
Incident analysis aims to uncover what an incident means in the IT industry alongwith factors behind an issue, what we learn from it, and focuses on improving service reliability. Instead of blaming who was responsible behind an incident, a positive approach like blameless postmortem, focuses on understanding how the issue occurred.
By using effective techniques, we can detect issues earlier, prevent future occurrences, and reduce recovery time. Choose techniques that work for your team and prioritize learning and service reliability.
If you are searching for a platform to automate your incident management lifecycle, Zenduty is the one stop solution. Start your free trial today to explore the benefits of Zenduty!