This article talks about why you should spend the next 1 hour setting up alerting and on-call management systems for your startups.
If you’re a startup that moves fast and breaks things, you need to know when things are broken before your customers find out so that you don’t lose out on those precious early customers and revenue. While your customers might be raving to their friends about your service, you do not want them also talking about the frequent glitches in your service.
Remember - it’s okay if your service breaks in your initial phase. When you allow yourself to build imperfect systems, you start to work differently - faster, more ambitiously. Your users understand that. Heck, everybody knows you’re working long hours, coding features, talking to your customers, managing ops, constantly thinking of engagement and revenue, while at the same dealing with a shortage of engineering resources who can invest time in writing and running comprehensive test cases and building a reliable delivery pipeline. Having said that, you need to deal with and resolve incidents with rapid speed and agility. This means setting up alerts for whenever your services crash and finding the right people to fix the bugs, pronto.
Here are a few alerting tools you absolutely must setup right now to deal with unexpected service glitches, if you are not already doing so(most of them are either free or light on your pocketbook):
- If you’re on the AWS stack, setup Cloudwatch Alarms to monitor your server resources. For GCP use, Stackdriver alerts. If your servers are low on memory and do not have autoscaling enabled, you need to be alerted pronto so that you can scale manually.
- If you’re shipping mobile apps, install an error monitoring system like Crashlytics (Freemium), Firebase (Freemium), Sentry (Freemium), Rollbar (Freemium) or Bugsnag (Freemium), so that you get app crash alerts immediately.
- Customer support and management service like Intercom (Paid), Freshdesk (Freemium) or Zendesk (Paid) so that users can easily raise descriptive support queries.
- An analytics tool like Google Analytics (Free), Mixpanel (Freemium), Intercom for cohort analysis and mapping the customer journeys.
- Github issues (Paid), Bitbucket issues (Freemium), Jira (Paid), Asana (Freemium) for bug tracking
- Uptime monitoring tools like Pingdom (Paid), UptimeRobot (Freemium)
It is practical to assume that after you set up the above alerting systems, you may not have all of their dashboards open all the time. Most of the tools alert send you alerts on email, which might take a while to spot. Also, downtimes can occur in the middle of the night. Who will resolve an incident outside of office hours? Here’s where an incident management system and an on-call rotation system comes in.
Incident Management Fundamentals
What is incident management? What are on-call rotations? What are escalation policies?
An incident is an event that is not part of normal operations that disrupts operational processes. An incident may involve the failure of a feature or service that should have been delivered or some other type of operation failure.
An on-call rotation(or on-call schedule) is an agreement between team members stipulating that atleast one person will be available at any given time(day or night) to whom an incident alert will be sent and it will the responsibility of those on-call engineers to fix the incident. Typically, engineers in a team “rotate” every day or week, some rotate between day and night, some between weekday and weekends. An engineer is said to be “on-call” or “on-duty” or “on-page” whenever it’s their turn in the rotation and they are the primary point of contact for the alerts (or primary on-call).
An escalation policy is an arrangement that says that if the primary on-call engineer does not respond to the incident within a certain period of time(5–10 minutes), then the system should alert another team member(or a backup/secondary on-call schedule) who may or may not be on primary call. This ensures redundancy in the system and allows you to “escalate” an incident whenever your primary line of defense is unavailable or unreachable. Escalations can span multiple levels(L1 - Primary on-call schedule, L2 - Backup on-call schedule and senior engineer, L3 - CXO).
What incident management and on-call tool should you use?
Zenduty is your best option for managing incidents and on-call scheduling. It serves as a single pane of glass for alerts from all your monitoring tools. Zenduty allows you to create flexible on-call schedules and escalation policies and dispatches downtime alerts across multiple channels like Email, SMS, Phone call, Slack and MS Teams, making sure you never miss an incident. Its context-rich alerts help you zero-in on potential root causes within seconds. Advanced service analytics and context helps you identify services that are frequently down and also identify folks within your team that can fix a service downtime quickly.
Zenduty is Freemium and will remain that way until you reach Series-A level growth. It currently has over 40 integrations(including most of the above) and is adding 5–10 new integrations every week. Sign up here!
How should startups do on-call?
Being on-call is tough, especially if you do not have a geographically diverse team but have geographically diverse end users. Downtimes can occur anytime. Nobody likes being woken up in the middle of the night, or not being able to take a trip to the movies or any loud, noisy places, or places with potential low network connectivity. Therefore, setting up an on-call should be a well thought out process. Here are a few tips:
- Founders should always be on-call, at least in the primary(1–2 weeks per month) or secondary level of escalation(2–3 weeks per month). Your startup is your baby, and you should face your customer issues headlong. This will not only help you get valuable face time with your customers but also help you understand the gaps in your growth story. As a technical co-founder, you will be able to measure your technical debt and plan your next releases more prudently. Not to mention, you will continue to inspire your team to move faster.
- If your engineering team size is less than 10(including the CTO), weekly on-calls(one engineer per week primary) will work well. Zenduty allows you to create schedule restrictions to weekly rotations and also have multiple schedule layers. For larger teams(involving slightly mature stacks), the weekly rotation would also serve well, except you might need multiple engineers on-call simultaneously, covering all services.
- Make sure every engineer in your team has been on-call atleast once every 2 months. This will give them valuable exposure to the entire stack, foster teamwork, and instill a sense of ownership in them.
- More often than not, the engineer on-call may not have the answer to a problem. It’s okay to wake up somebody in your team in the middle of the night if the issue is critical. Zenduty allows on-call engineers to rope in people with answers and dispatches information about the incident via Emails, SMA, Phone calls and brings them into the triaging process.
- Empathize with your on-call folks. It’s a tough spot to be in - alerts firing, users complaining, servers on fire. It can feel like juggling a gazillion balls. Empathizing with your on-call engineers can go a long way in building a great workplace and boosting morale. Give them the day off or the freedom to come in late for work if they spent hours last night resolving an incident at 3am. It’s hard work!
- Talk to your users! When a major disruption occurs, immediately communicate to the affected users(via email or social media) about the issue and outline the steps you’re taking to make sure such issues do not occur. This will not only help you cultivate a sense of trust and loyalty amongst your user base but also communicate to your users that you are on the path to building a reliable service that they can depend on.
- Your post-mortem should be absolutely blameless. Analyze the root causes of your incidents, have a timeline to fix them permanently. That’s it. Some engineers might break the code every now and then, and that’s okay. Building a great and exciting company and having the freedom to break things is why people join startups in the first place.
And finally, be Zen! No matter what happens, treat on-call as a challenge rather than a crisis. Remember - incidents do not destroy companies, but the lack of will to resolve them does. As engineers in a startup, you should “move fast and break things”, but also “fix things after you break them”. That’s how you build great startups.