Sprint planning - How to prioritize urgent production issues?

Small engineering team members wear a lot of hats while working on a product. It becomes hard to prioritize and deal with issues that arise during production when a sprint is already planned and put in place. This not only makes sprints harder to plan but also reduces accountability.

How do you tackle this problem and make sure your engineering team does not burn out at the same time?

Let’s list down a couple of characteristics of this engineering team that is quite common across the board.

Team size of 5-10 people
Frequent on-call rotations
About 2 urgent/blocker issues on a typical day
When things go south, 6-8 issues per day
Gets sev-01 alerts from customer feedback to support team

Some reasons why these urgent/blocker issues arise across companies:

Legacy software with fewer tests
Buggy mobile application pending new app release
Lack of admin tools

Off the bat, quite intuitively the following suggestions are brought to the table usually.

Keep things as is, assign multiple devs to everyday shifts, and as the hiring scales minimize rotations
One dev for the whole day - work on admin tools in the meantime
One dev for a whole week - work on admin tools in the meantime

Every team is different, prefers working differently so it is important to offer a range of solutions to pick from.

📑

Learn what is SRE and what are its benefits?

The SuperMan Rotation

Define a new role within the team, which we like to call “Superman”, you can choose a support hero name of your liking. Assign Superman a long period of on-call shift, like a week or throughout a duration of a sprint.

The goal here is to minimize the level of disruption to both teams by ensuring that the support engineer had somebody to lean on while making sure that the rest of the development team can work free from distractions.

Additionally, do explore the idea of working with the person on admin tools when not overburdened by the support role. But if having 20% of your dev resources sunk into that seems like too much, having them working off the regular backlog with the understanding that delivery may be less consistent for Superman can work.

Always consult with your engineering team, find out what they are comfortable with and make a decision on the duration post that.

The key here is to set realistic expectations for your team. The team is likely diverse in the amount of experience each engineer has. The newer or less experienced engineers for the time being may not be able to solve all the problems independently.

Their job entails triage, asking for more information when needed, and ensuring that support teams are aware that the engineering team has acknowledged the issue/request. Set SLAs based on the history of response or acknowledgment. In all likelihood, smaller teams don’t have a resolution time SLA because of issues or bugs or some other reason so try setting that at the best of your abilities.

Conduct a biweekly meeting to discuss any outstanding issues and give the support team a chance to talk about what bugs are causing the most customer pain to prioritize what to fix first. Recognize patterns in the bugs and over time prioritize the most critical areas.

Have a clear differentiation between a bug and an incident. This will help you invest more time in product fixes and hardening!

Bug: When a defect found by a tester is accepted by a developer it is called a bug. The process of rectifying all bugs in the system is called Bug-Fixing.
Incident: Incident is an unplanned interruption. When the operational status of any activity turns from working to failed and causes the system to behave in an unplanned manner it is an incident. A problem can cause more than one incident that is to be resolved, preferably as soon as possible.

💡

What is incident analysis? Checkout the details here!

Adhoc Tasks

Ad hoc tasks are typically defined as work items that business users (who interact with dashboard spaces) can create extemporaneously that are not initially part of a modeled process flow. Ad hoc tasks can be within the context of a process (subtask), outside the context of a process (follow-on task), or just a one-time task. When an ad-hoc task is a subtask to a modeled task, it must be completed before the parent task can be considered “complete”. The creation of ad hoc tasks indicates that the related process flows should be evaluated for efficiency.

Add ad-hoc tasks that you can story point based on the average effort/time of the last 3 sprints and make some adjustments based on the history of delivery, holidays, or code freeze.

Then any urgent/blocker issue coming from the chat would either be a less than 2 hours task that would be added to that ticket or a greater than two-hour task that would require a bug to be opened and therefore could be handled in the sprint or a story that would have to forcefully go through prioritization in the next sprint at best.

Review ad-hoc tasks plus all the bugs after every 2 sprints and put them in a prioritization matrix.

Then place everything on the group and create clusters. When the tickets are similar then slowly delete and replace them with actions on how to reduce/avoid recurrence of these issues. These actions will get the same priority as pain with an estimate for complexity. At the end of the process, add 3 high impacts, low effort to the tech queue.

Practically, the sprint gets divided into the following, 30% tech queue, 60% Business priorities, and 10% ad-hoc tasks.

Note: Bugs would eat both from tech queue and business priorities.

💡

Learn the difference between SLA vs SLO vs SLI here!

Prioritize what you want to achieve. At an early stage, it is more important to get your features up and running so achieve that and don’t shy away from taking a hit on inconsistency.

The cost approach

You must be feeling a little overwhelmed now, don’t worry, you are not alone. It is a long process to get things under control but at the same time something that is needed. How do you go from putting out multiple fires a day to just a few a month?

Being able to show measurable impact to the business is important in helping you prioritize and deal with urgent issues raised by the customer.

For example: Assume that each call into customer support costs $5 for “XYZ” issue. Measure how many calls are coming for the same issue and measure the time an engineer needs to resolve the issue to get a real dollar impact on the business.

Conduct a meeting twice a month with the customer support team to get an idea of the trends in complaints. Using this intelligence push the fixes for prioritized issues before they turn into a production issue. It is easier said than done as some product managers just want to focus on their feature delivery while others are more collaborative and willing to work to resolve these issues.

What do bigger teams do differently that you can adopt?

Incident alerting tools like Zenduty flag the issue before it reaches the customer support team to the on-call engineer via slack, MS teams, email, call, etc through something called an escalation policy (user-defined). The on-call engineer then acknowledges the incident usually within a minute. And then a triage or investigation is conducted.

During the triage, if a familiar incident comes up, the on-call engineer begins to perform tasks required to resolve the incident. If it is something they aren’t aware of, the person assigns an engineer with the domain knowledge to help fix the problem. The domain knowledge bearer is then alerted by the Zenduty platform to help out in the incident. This process helps shield the team more often than not from getting involved.

An end-to-end incident management tool like Zenduty helps reduce the Mean Time To Resolve (MTTR) and Mean Time To Acknowledge (MTTA) significantly.

💡

What is incident response lifecycle? Checkout the phases involved in NIST and SANS framework.

Post this a Root Cause Analysis (RCA) meeting is conducted to understand the problem more deeply, the team tries to figure out how it happened, what the fix was, is it a bigger problem than what was seen on the surface, do they need to do more work and how can they prevent these issues in the future, i.e., add test coverage, communicate changes to customer support, etc.

The outcome of the meeting is then used to create task templates, a step-by-step guide on how to deal with similar incidents in the future, on the Zenduty platform. Finally, a postmortem is created which gives insights on all the discussion points of the meeting, helps keep a record of the same and share information about the incident with the team and the management.

Smaller teams have also started adopting Zenduty as a tool to help them improve their MTTA and MTTR. You can also try the platform for free and automate most of the workflows mentioned above.

Do give us a follow on LinkedIn if you found the blog helpful.

Sprint Planning - How to Prioritize Urgent Production Issues?

The SuperMan Rotation

Adhoc Tasks

The cost approach

What do bigger teams do differently that you can adopt?

Aman

[New] Schedule Overrides is now live for every team member!

OpenTelemetry, AI, and the Future of Observability with Andreas Grabner

The Projects and People that Shaped My Career at Zenduty

Why First-Call Resolution Is Non-Negotiable in Modern Business

Be Prepared for Incident Response with Zenduty

The SuperMan Rotation

Adhoc Tasks

A quick cookie for even smaller teams at start-ups (1-3 engineers)

The cost approach

What do bigger teams do differently that you can adopt?

Aman

Be Prepared for Incident Response with Zenduty