Incident Response - how great companies do it
An incident response plan is a pre-devised action stratagem for IT teams on how to respond to critical IT events efficiently. As modern applications continue to grow in scale and complexity, there will be more people working on more interdependent systems, consequently, the question is not if a system will fail, but when, and how best to respond. A part of what makes systems more complex is customers demanding 24x7 availability from businesses today, reliability has become a competitive edge in an always-on world.
Critical events are high-pressure situations with high stakes, incident response processes help curtail some of the pressure. A well-rounded incident response plan ensures that members of incident response teams can organize all facets of incident response (stakeholder communication, public response, etc) while also focusing on the resolution of the issue. The primary goals of efficient teams should be to curtail damage, reduce MTTR, and bring down the loss of revenue.
The four aspects of incident management to keep in mind while formulating an IRP are: detect, acknowledge, triage and learn.
Detection- Identifying the issue is the first step, communicating effectively is the second. Teams that detect incidents early usually end up resolving them before it snowballs into a high-level catastrophe. Not all obscure incidents can be given individual attention, which is where log data comes in, a good log management application is essential to the tool kit of every production team. Log data can be analyzed after incidents to find out if there were minor indicators before something crashes and also tracked for the future. Log monitoring can be integrated with a good alert management application for instantaneous alerts when there are irregularities in the data.
Acknowledgment- Teams that detect issues quickly but ultimately cannot get the message across the right people fail to thrive. According to a SysAid survey named the future of ITSM, 92% of people in IT departments feel uninvolved in their organization’s DevOps activities due to lack of proper communication between teams. Notification chatter from different quarters of an organization during a major incident can make processes chaotic and slow. Breaking down team siloes through effective communication is still as relevant as ever, especially for incident management. Establishing communication protocols ensures that information is relayed through predetermined channels to principal stakeholders, business leaders and vested customers putting them at ease ensuring that core technology team members can focus on fixing the issue at hand without hindrance. The use of an alert management platform helps streamline the communication process and guarantees that the right subject matter experts are notified at the right time.
Resolution (triage)- Write, run and constantly update the game plan. Large companies like Netflix and Amazon regularly test the stability of their systems by deliberately injecting chaos. This helps teams stay fresh and improve their response times. Do not let your documentation get stale and have refresher sessions when new members of the team join so that everyone is aware of the plan. Effective games plans should also have:
- Incident roles: The importance of this cannot be stressed enough, having predefined incident roles like Incident commander, communication lead, etc ensures no contradictory decisions are made during the storm, there is bolstering of stakeholder trust and improved team cohesion.
- Well defined escalation policies: Based on the system being monitored, automated escalation policies will have to be defined for notifying one or several people. Customizable escalation policies that can easily be configured with schedules for optimum speed.
- Task Templates: Templates ensure that every member of the team, even in the absence of senior, more experienced guidance will have an idea of how to go about getting started.
Learn- Blameless post mortems help teams understand the root cause of major incidents without finding fault with individual members. Blamelessness encourages people to come forward and talk about their mistakes without fear of repercussions. This fosters an internal culture of trust and learning.
Google’s famed SRE book highlights what are the types of triggers that warrants a detailed study:
User-visible downtime or degradation beyond a certain threshold Data loss of any kind On-call engineer intervention (release rollback, rerouting of traffic, etc A resolution time above some threshold A monitoring failure (which usually implies manual incident discovery)
Pioneers of blameless post mortems, Etsy encourages organizations to document everything. The discussion sessions during post-mortems will bring about new perspectives and ideas on how to best implement changes to strengthen stability. To this end they have even created a tool called Morgue to help production teams with the documentation process. It allows for adding information about the event with graphs, timelines, log data and communication screenshots.
Incident management runbooks are unique to every organization, factors may include tools being used, number of teams and type of business. Finding a one size fits all plan is not possible, organizations need to be open to change, to constantly learn and upgrade. Document everything and unify data from all services for maximum overview.