Defining your Sev-1s

One of the primary things you need to figure out whenever your team is formulating your incident management process is describing in words what a Sev0(your highest incident priority) looks like. “Website doesn’t work” is certainly no enough. “Website is up but a key resource (ie CSS file) is missing, rendering the website unusable” is still not enough. “A single page on the website is 404’ing” is not a major but could be a minor incident.

Sev-1’s are incidents where level of impact is such that the company will go out of business if you don’t fix the issue soon. The first thing you need to do is define the company commitments first. Every view of the website will complete in X ms. It will return a status code in the 2XX range (or <500 or whatever). Describe whatever commitments you want to make to the customers of that website, and then measure those. Then, when any one of those is violated, count the number of times any violation happens. That count of violations then drives how severe an incident is. (ie, define SLOs, measure SLIs, examine the ratio of requests that violate one or more SLOs out of all requests received).

Depending on the site it might be necessary to slice things up more, measuring all of those things by route or service or whatever, but I think fundamentally always defining explicitly what the commitments are first–with actual numbers–is the right approach. There can always be other key metrics too beyond the basics. For example, if the site sells things, counting the value of failed transactions might be important too, so you can understand how much a problem is costing you in actual dollars. It really depends on thinking about the service being provided first and working back through what commitments you want to make to customers from there. But again, with actual numbers that are measuring the customer experience. Depending on that, you can take the most important metrics and aggregate the number of violations and then formulate your severities/priorities for different ranges.

Defining your Sev-1s

Vishwa Krishnakumar

[New] Schedule Overrides is now live for every team member!

OpenTelemetry, AI, and the Future of Observability with Andreas Grabner

The Projects and People that Shaped My Career at Zenduty

Why First-Call Resolution Is Non-Negotiable in Modern Business

Be Prepared for Incident Response with Zenduty