“Being on-call is a critical duty that many operations and engineering teams must undertake to keep their services reliable and available. However, there are several pitfalls in the organization of on-call rotations and responsibilities that can lead to serious consequences for the services and the teams if not avoided.
This chapter describes the primary tenets of the approach to on-call that Google’s Site Reliability Engineers (SREs) have developed over the years, and explains how that approach has led to reliable services and sustainable workload over time.” - Google SRE book
On-call schedules are a necessary evil while maintaining the reliability of your services. Being on-call is a foreboding experience, even when things are going well, it can end up making you feel drained and exhausted. The culture of DevOps fosters rapid deployment, SRE and continuous integration which creates a need for quick remediation of incidents when they occur. Several organizational initiatives help make the on-call experience more bearable like flexible working hours, remote-friendly offices, etc. There are several measures that you can take to make your experience better:
Optimize your environment Ensure that your home is a well organized on-call environment, with all of your devices at easily accessible places. Even if you are woken up from deep sleep, your brain shouldn’t have to waste time figuring out where everything is. This will vastly improve your response time when notifications are blowing up your sleep.
Clean up your notifications Scrubbing up your notifications is an effective tactic for on-call as notifications will be flying at you from one or multiple sources. Set up customized notification rules depending upon the severity of the incidents and SLAs.
Urgent notifications from family members can be filtered through based on rules you’ve set eliminating chaos during times of duress.
Personal Care On-call shifts can stress you out, even when everything is going according to plan. Prioritize your mental health, take mini-breaks whenever you can, stretch, grab a snack or get some fresh air. Night shifts can throw a spanner into your sleep schedule, however, regular sleep is vital. Your brain will incrementally slow down without proper sleep which might cause you to make more mistakes, ultimately affecting your response efficiency.
Distressed routines can also lead to stress eating which will ruin your physical well being in the long run. Make sure to get enough rest, sunlight and exercise to help you find your zen during on-call schedules.
Having your incident management procedure Additionally, with your organization’s incident management plan, have your plan during contingencies. An incident management plan is only as strong as the sum of all it’s parts.
Have a personal checklist or questionnaire to tick off when you go through the motions. Post-incident focus on the end-user experience and work backward, quickly identify what went wrong, time to detection and resolution for future post-mortem meetings. A big part of making the on-call experience better is building more resilient infrastructure through learning from our experiences.
Set up the right tools Using the right tools will go a long way in helping you keep your focus on the issues that need to be prioritized. Zenduty is a state of the art incident management platform which gives you a unified view of all your notifications from your infrastructure where every member of the team is aware of the incident no matter where they are.
The platform also provides incident analytical metrics to help you prepare for post-mortem meetings and leverage your incident management plans. With Zenduty you can improve your team’s MTTR while helping them focus on alert that matter.
Custom escalation routes can be defined pre-incident depending on work schedules along with pre-defined incident roles so that everyone is aware of who is working on what eliminating chaos.
Zenduty has support for 100+ services like Sentry, Freshdesk, AWS cloud watch, New Relic and Zabbix.
Ultimately it all comes down to your physical and mental well-being, prioritizing this will make on-call much easier to manage. A perfect system is not one that never fails, but one which provides a sustainable environment for response and restoration.