Bob Lee's Proven Strategies for Scaling Systems Reliably

I was out there in sunny Austin this February, speaking at Civo Navigate 2024. The event was jam packed with amazing talks, and it was great meeting so many people with long and fascinating careers in engineering and Site Reliability.

I had the privilege of meeting Bob Lee, who currently leads DevOps at Twingate — a cloud-based service that provides secured remote access, and poised to replace VPNs. Bob has had a long career spanning over 25 years at orgs like MariaDB, Fairwinds and long stints at healthcare leaders like Abbott and Olive.

Naturally, I was curious about the nitty gritties of his career in reliability and had a few questions in store for him.

Q1) You’ve been in the tech industry for over 25 years. A career this long is bound to have its ups and downs, but there must have been ‘aha!’ moments where you’ve been truly glad to be doing what you do. Could you share a few such instances?

A) Scaling is now something we take for granted.

In the past, smaller sites had to worry about the "Slashdot effect", where a massive increase in traffic overloads a site. However, adding capacity no longer involves waiting days or weeks for someone to physically configure and add another server to a rack.

The hyperscalers have made it remarkably easy for hobbyists and startups to get started and handle large surges of traffic.

The disadvantage? Costs are skyrocketing.

I would not be surprised to see more of the larger companies move back to on-premises data centers.

I have been fortunate enough to witness the evolution in infrastructure deployment models over the years.

From standalone servers, virtualization with VMware, containers, and container orchestration with Kubernetes, the landscape has continuously evolved. I'm looking forward to working more with edge computing and anticipating what the next trend will be. Hopefully, it will involve quantum computing.

🎧

Learn all about BookMyShow's Cinematic Product Journey with Viraj Patel!

Q2) You spent some time leading Site Reliability teams at MariaDB — one of the most popular open source relational databases. What were the stakes like when managing reliability for massive enterprise clients, where even a small hiccup could potentially result in data loss for your customers?

How did your team approach fault tolerance?
What did the SLO structure at your team look like?

A) Data can either make or break a company.

Losing or corrupting a client's data can have significant financial implications. To mitigate this risk, we ensure that we maintain multiple backup copies and replicas running in different zones or regions.

When a database experiences downtime, it affects virtually everything downstream as well. Maintenance windows can be tricky to navigate. It's not just about finding the best time for your organization but also for your clients and their users.

Coordination with clients is essential to determine when they can begin testing their applications.

For fault tolerance, we ensure that enterprise databases have at least one primary and two replicas across three zones. Running the databases on Kubernetes enables us to scale up and out quickly.

Additionally, we ensure that only one primary or replica instance is running on a node. This approach minimizes downtime if a node needs to be replaced. Our Kubernetes clusters are deployed in regions closest to the customer and on the cloud provider of their choice, whether GCP or AWS.

Given our presence on multiple cloud providers and regions, we closely monitor the status pages of each provider. This distributed setup minimizes the impact if a cloud provider experiences an issue.

In terms of our SLO (Service Level Objective) structure, we have multiple tiers with different SLAs, all guaranteeing at least 99.95% uptime.

We utilize Prometheus and have developed our custom exporter to monitor various metrics such as latency, throughput, error rates, disk space, query performance, and open connections.

We maintain a global team of DBAs available 24/7 to address any issues. When an issue arises, a ticket and alert are generated for the DBA team. If necessary, an SRE (Site Reliability Engineer) is alerted. While most issues originate from the client side, we provide assistance in tuning their settings and queries to resolve them.

🎧

Tune in to find out what reliability means to the modern consumer, why SREs make excellent decision-makers, and the current state of observability with Co-Founder and CTO of Last Nine!

Q3) Whenever we talk to the ‘build fast, break fast’ teams out there, they mention ‘shifting left’ as a remedy to ensure reliability during phases that require high feature velocity. What does shifting left mean to you and how would you approach such a transformation at an organization with long-established engineering and DevOps practices?

A) To me, 'shifting left' means initiating automation and testing early in the development process.

Instead of waiting to set up Infrastructure as Code (IaC) or CI/CD pipelines in staging or production to match development, begin in the development phase from the outset.

Plan ahead to accommodate potential traffic growth of 2x, 5x, or even 10x in the coming months.

It's crucial that security, performance, and scalability considerations are consistent across lower environments and production. Automation should be standardized across all environments, with only variable adjustments for machine types and cluster sizes as necessary.

To implement this approach in a long-established engineering and DevOps practice, I would advocate for incremental changes.

Identify areas that would benefit the most from early quality checks, such as adding unit tests for critical functions.
Begin integrating automation and tools like static code analyzers, linters, and IaC into existing workflows.
Gradually incorporate unit tests into the CI/CD pipeline. Prioritize the developer experience, ensuring that the new processes and tools enhance rather than hinder development workflows.

Q4) You currently lead DevOps at Twingate — an org making Zero Trust Network Access more secure and accessible. Competing with age-old VPNs that most enterprises are familiar with, the error margins must be slim to instill trust within these giants. What does your observability stack look like to help you be on top of your operations at all times?

How do you conduct handoffs between on-call engineers during complex incidents?
Any unique steps or processes your team at Twingate has adopted to streamline your incident management or production operations?

A) I'm confident that once someone tries Twingate compared to their age-old VPNs, they'll want to convert. Having been at many companies using a combination of VPNs, jump hosts, and even BeyondCorp at Google, I can attest that Twingate is by far the easiest to set up, maintain, and use.

Similar to a database, it's crucial for network access to remain up.

We target 99.99% uptime and run our infrastructure globally across multiple regions and zones to enhance performance and reliability.

Observability Stack:

Our observability stack includes Prometheus, Alertmanager, GCP alert policies, Grafana, ELK for logs and APM, and Sentry for error tracking.

Issues are posted to #ops in Slack, while Opsgenie handles on-call scheduling and notifications. We're also exploring Zenduty for its promising features. All of this is automated with Terraform and Flux.

How do we operate?

We operate a 24/7 on-call rotation with three shifts per 24-hour period, ensuring that each engineer is on call for only eight hours per day.

Given our global presence and engineers in multiple time zones, our shifts occur during the day rather than in the middle of the night.

Runbooks are attached to all our alerts, and we hold monthly debriefs to discuss alerts and reduce non-actionable or irrelevant alerts to alleviate alert fatigue.

Alerts to On-call engineers:

In the event of an issue, the on-call engineer will escalate, opening a dedicated Slack channel and Zoom meeting. Every team member has a role, whether it's investigating the issue, maintaining the timeline in Notion, or managing communications to keep the status page, AEs, and support team updated.

🎧

"Tech is Easy, People are Hard" — with Suresh. Do you agree? Watch the full episode with Suresh Khemka here!

Q5) Any amusing near-miss incidents or war room stories from your time at these orgs that you’d like to share with our readers?

A) 1.Pre-Hyperscaler Hosting Challenges

Before hyperscalers gained widespread popularity, I worked for a large healthcare company that hosted our site at one of the many managed hosting providers available at the time.

2.Website Meltdown

We encountered a significant incident involving a recall on our baby formula. As I was preparing to leave work, I noticed our website becoming unresponsive. This was around the time when people relied more on the evening news.

Around 5 pm, the recall was announced on the news, leading to a sudden surge in traffic to our website. Visitors rushed to our site to check if their product's lot number was affected by the recall. “

I empathized with the people who couldn't check if their baby formula was affected by the recall. The influx of visitors overwhelmed the servers and network, resembling a DDOS-like scenario.

Despite working overnight and adding more servers and capacity, we couldn't resolve the issue. It wasn't until the morning that we realized we had saturated the data center's network.

Despite efforts to scale up on the backend, the bottleneck was at the incoming bandwidth entering the data center. We had to remove some of the heavier functionality of our site, such as the community forums, and temporarily migrate to another data center with more capacity.

We were already in the process of seeking a new data center and hosting provider. However, this incident highlighted the critical importance of our hosting infrastructure, leading to an increase in our hosting budget.

3. Traffic Surge: Server Crash Night

The lesson learned is that you can outgrow your hosting provider. Unfortunately, at that time, moving to AWS was not feasible, as they only offered S3 and EC2 services. It would have been interesting to see how modern infrastructure would have managed that level of traffic back then.

With today's advancements, cloud load balancers could efficiently handle the traffic, and Kubernetes would automatically scale the pods and nodes to accommodate the surge.

The journey of scalability and reliability is never-ending, but with Bob's insights, you're well on your way.

If you're fascinated by reliability and the intricate process of digital recovery from downtime, checkout our podcast - Incidentally Reliable, where veterans from Amazon, Walmart, BookMyShow, and other leading organizations, share their experiences, challenges, and success stories!

Available on all your favourite platforms

Strategies for Scaling Systems Reliably by Bob Lee

Tags

Shubham Srivastava