Behind the Scenes with an Observability Advocate- Akshay

At a recent SRE Meetup in Bangalore, we had the pleasure of meeting Akshay Deshpande. During our conversation, Akshay, who manages a Performance/Observability Engineering team at Smarsh discussed his passion for observability and his constant drive to improve the field.

Smarsh helps companies gain valuable insights from their communication data, enabling them to proactively identify potential regulatory and reputational risks before they escalate.

Akshay Deshpande, Managing Performance/Observability Engineering team at Smarsh

Akshay's enthusiasm for the space was contagious, and we were eager to learn more about his insights and observations. Here's what he had to share.

Q1) What were the biggest challenges you faced in building and managing the Observability team at Smarsh, an organization that primarily deals with large financial enterprises with little room for error?

A) About two years ago, our company went through a big change in how we operate.

We shifted from the old way of "We build, you handle it" to a more team-oriented approach of "We build, we own."

The "We" in this context shifted from a "Dev vs. SRE" mindset to an "Engineers First" mindset.

This meant that everyone, including developers and Site Reliability Engineers (SREs), had a stake in how well our systems ran. It wasn't just about creating the product anymore – developers started sharing the responsibility of making sure it worked smoothly by taking turns being on call.

At the same time, our SREs moved beyond just fixing problems to making sure our systems were solid from the get-go by integrating their insights into the product development process.

A) It was a team effort with three of us.

We faced a challenge with our storm cluster, where all topologies used the same big EC2 instances, regardless of size. So, we tweaked the topology tags in Storm to include a vm_type parameter. This allowed teams to adjust their topologies appropriately, like switching from r6i.4xlarge to c6i.large instances.

It saved us a lot!

Additionally, while the engineering solution was straightforward, promoting the idea across the organization and encouraging teams to adopt the change proved to be the most challenging part.

I had to create cost-saving visualization dashboards on Datadog, which demonstrated the financial benefits for the organization (for each team current spend vs post-feature-adoption). Once they saw the numbers, they were all in!

Q3) You’re an active advocate for OpenTelemetry and have helped your org avoid lock-in with traditional o11y tools. What do you feel about the current O11Y landscape that’s getting busier everyday, and what do you want to see next?

A) I have a bit of a controversial take on this.

I'm a big fan of Open Telemetry and how it's made things more standard for observability. But lately, it seems like MELT is becoming almost like a religion.

I think we need to remember that one size doesn't fit all when it comes to solving observability problems.

Trace has become the new “hammer” and we are trying to hit every bolt with it. But in my opinion, tracing really shines when you're passing context across different services and doing distributed tracing.

If you're just using it within a small, simple microservice, it might not add much value. After all, we already have metrics for keeping an eye on individual services.

What's next on my radar?

Well, I'm really eager to see a new development in the o11y space, something I'm actually working on right now.

I've got this idea cooking up in the o11y realm that I'm pretty stoked about.

Imagine if developers could treat MELT components like one big event, sort of like a giant JSON document. Then, depending on what you need and how you want to use it later, you could slot it into the right part of the MELT components. I think this could really help with the cost issues we're facing in o11y right now.

It's something I'm diving into because it's directly tied to the challenges I'm tackling at work. And hey, I'm optimistic that the o11y landscape will settle down and become more standardized over the next couple of years.

Q4) Can you describe some of the most common performance bottlenecks you've encountered in cloud applications, and what’s your approach for identifying their root causes?

A) Cloud applications are interesting, because scale is the cause of(and solution to) all our problems.

It's not magic.

When we rely solely on scaling to solve performance issues, we often find ourselves facing financial operation (FinOps) challenges.

But hey, both areas have their place and work together for a reason, right?

In my experience, one of the most common issues with application performance isn't just about performance itself, but rather - Scaling inefficiently.

In containerised cloud applications, Operational Efficiency is most critical. What I mean by that is - “Is my application spending the least possible $$ on infra and serviceability - without breaking the SLOs?”

The solution I tend towards is - Production feedback. The feedback loop has to be the shortest. Engineers love momentum.

Startups are popular because they remove the corporate processes and increase the momentum.

🎯

Measure —> Plot/Forecast —> Course correct —> Measure

I am a fan of learning from production and improving. Lower environment testing are great, but unknown-unknowings only come from production. Correcting the scaling parameters based on Performance of the application in production has direct cost benefits and I can’t stop ranting about it.

Q5) As someone who has managed and mentored engineers, particularly ones just entering Site Reliability and Performance Engineering, how can organizations ease onboarding for younger engineers onto their observability and performance engineering teams — a domain that often throws you into knee deep work right off the bat?

A) I have to acknowledge - I haven’t done a great job at improving the onboarding process myself.

While there are a set of onboarding protocols with clear definitions of (1week, 1 month, 3 months) directives, I believe these tend to change very quickly as the Org grows/evolves.

We've got these guidelines for the first week, first month, and three months, but they change so fast as our team grows.

I encourage new members to get their hands dirty.

Cut that first PR. Get involved in that first incident.
And more importantly don’t be shy about asking for help. Everyone here wants to see you succeed.
Use those Slack channels—they're like gold mines for expertise. And forget about DM’s; the group chat's where it's at.

It might not work for every company, but it's been a game-changer for us.

A) It wasn't just another project; it was a rapid migration task that kept us on our toes.

Though I can't reveal the specifics due to NDA, we had to upgrade a migration service to handle over 1000 transactions per second (TPS).

At the start, we were only managing around 60 TPS, but within three weeks, we hit over 1000 TPS. It was a tough journey, but incredibly fulfilling.

What made it enjoyable was that every decision was driven by data.

Our process was straightforward:

📑

Build ➡️ Deploy ➡️ Measure ➡️ Change ➡️ Repeat

I still cherish the energy the team had and the clarity the entire team brought it with Data driven decisions.

With that, we wrap up our insightful conversation about observability, processes for new engineers, and more in this current space.

If you're fascinated by reliability and the intricate process of digital recovery from downtime, checkout our podcast - Incidentally Reliable, where veterans from Amazon, Walmart, BookMyShow, and other leading organizations, share their experiences, challenges, and success stories!

If you're someone who is looking to streamline your incident management process, Zenduty can enhance your MTTA & MTTR by at least 60%. With our platform, engineers receive timely alerts, reducing fatigue and boosting productivity.

Insights of an Observability Advocate: The Challenges and Rewards

Q1) What were the biggest challenges you faced in building and managing the Observability team at Smarsh, an organization that primarily deals with large financial enterprises with little room for error?

Q3) You’re an active advocate for OpenTelemetry and have helped your org avoid lock-in with traditional o11y tools. What do you feel about the current O11Y landscape that’s getting busier everyday, and what do you want to see next?

Q4) Can you describe some of the most common performance bottlenecks you've encountered in cloud applications, and what’s your approach for identifying their root causes?

Anjali Udasi

[New] Schedule Overrides is now live for every team member!

OpenTelemetry, AI, and the Future of Observability with Andreas Grabner

The Projects and People that Shaped My Career at Zenduty

Why First-Call Resolution Is Non-Negotiable in Modern Business

Be Prepared for Incident Response with Zenduty

Q1) What were the biggest challenges you faced in building and managing the Observability team at Smarsh, an organization that primarily deals with large financial enterprises with little room for error?

Q2) You implemented automated sizing of storm clusters — an effort that helped reduce over $1mn in yearly spend. Would you like to share how the need for this solution arose and you honed in on it?

Q3) You’re an active advocate for OpenTelemetry and have helped your org avoid lock-in with traditional o11y tools. What do you feel about the current O11Y landscape that’s getting busier everyday, and what do you want to see next?

Q4) Can you describe some of the most common performance bottlenecks you've encountered in cloud applications, and what’s your approach for identifying their root causes?

Q6) What’s the most interesting war room story you’d like to share with our readers?

Anjali Udasi

Be Prepared for Incident Response with Zenduty