Building a world-class service is as much about maintaining software as it is about developing it. On-call engineers are typically responsible for ensuring the reliability and availability of your service i,e your reputation, and source of revenue. Robust on-call schedules ensure that the right people are ready-to-go during times of crisis.
Organizations continue to depend on on-call schedules and incident response processes that are a source of stress/anxiety or panic to employees. In the long run, they lose great employees due to redundant on-call strategies, which ultimately affect their turnover. Protecting your employees from unnecessary drama and noise has to be prioritized while supporting maximum growth and maintainability.
Some of the common organization mistakes are:
- Lack of flexibility- Incidents happen when you least expect them to, which means that there will be times when schedule changes will have to be made to accommodate employees who are unavailable due to personal emergencies. Having a schedule with flexibility means that employees will feel happier at the same time making the team more adaptable to different kinds of situations.
- Knowledge Silos- Ownership of different aspects of a service is divided across teams with limited internal communication. Lack of subject matter expertise within on-call teams leads to engineers tackling issues they are not experienced in, a sure-fire recipe for disaster.
- Leaning on operations- This is another common mistake that happens to organizations that scale too quickly without plans for maintaining growth. Reliance on one team to handle the maintenance of infrastructure leads to pager fatigue and burnout.
- Work-life imbalance- Employees are not encouraged to have a healthy work-life balance leading to stressful, unhappy employees. Balanced employees lead to productive work, leading to rapid decision making in high-intensity situations.
An effective on-call schedule is one that values employee time and their happiness with a focus on functionality. A team with an effective on-call strategies views every event as a learning experience. Here are some best practices:
- Have primary and secondary responders, followed by higher level executives like managers. Define escalation policies for contingencies like family emergencies or if one responder sleeps through their alerts. Everyone is human and individuals are going to miss notifications on occasion.
- Make a playbook to help future responders to scrutinize to understand impact and steps taken to resolve it. Accuracy is key while writing these, conduct 30 minute review meetings every week to find ways to improve your incident response plans.
- Use the right tools, utilize automated monitoring tools which unify notifications coming from all of your alerting tools. This will help your engineers cut down on noise and focus on making snap decisions.
- Listen to your engineers, smart employees tend to be strong willed and opinionated. Since they are also the ones who work intimately with the infrastructure, their insights will be valuable.
- Encourage employees to have a healthy work-life balance, you are inherently responsible for their well being. Ultimately the performance of an employee is dependant on their happiness. Spread awareness about the importance of mental health days and channelling their energies to hobbies
- Embrace DevOps as a philosophy rather than just DevOps titles with traditional Ops. DevOps encourages shared responsibilities and knowledge within organizations. Internal collaboration reduces friction within teams, improving communication and visibility, reducing overall stress.Not everyone is an expert on DevOps ideology and not everyone will be versed with every single software in your system. However, support each other as a team and help channel personal grow, overcome challenges, and ultimately build great tech. Not everyone is an expert on DevOps ideology and not everyone will be versed with every single software in your system. However, support each other as a team and help channel personal grow, overcome challenges, and ultimately build great tech.
Further reading sources:
[On-Calliday: A guide to unsucking your on-call experience](https://about.gitlab.com/blog/2017/06/14on-calliday-unsucking-your-on-call-experience/)
[Being an On-Call Engineer: A Google SRE Perspective](https://research.google/pubs/pub44813/)
[Rob Ewaschuk’s Philosophy on Alerting](https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit)