Securing reliable system operation necessitates building a formidable Site Reliability Engineering (SRE) team. However, a critical strategic decision confronts every organization: do we cultivate SRE talent internally or venture into the external talent pool?

Both approaches possess distinct advantages and disadvantages, each impacting the composition, skillset, and overall effectiveness of the SRE team.

We reached out to some experts to get insights into their preferred approaches for hiring SREs, and their responses offer valuable guidance for any organization in search of a skilled Site Reliability Engineer.

Hiring SREs: Tips and Qualities to Look For

When we asked the experts, a common recommendation was - prioritize internal talent for the role of SRE.

If there's interest within the organization, exploring team members eager to transition into an SRE position is a valuable first step.

Let's check what others have to say on this:

Bartek Antoniak

Bartek is the Head of Cloud Engineering at VirtusLab, bringing extensive experience in software development, architectural design, and leading remote engineering teams in complex enterprise environments.

When asked about hiring SRE's, he shares an incident that happened in the context of one customer engagement.

Building SRE Capability From the Ground Up

The relatively small engineering team built and instantiated the platform, and eventually we moved them to the next customer engagement.

Initially we tried to hand over this to the existing IT ops team, however, it didn't go well. As a result, I had to hire and train people to operate and continuously improve the platform.
🔖
Reduce your MTTA & MTTR by ~60% with our incident management software!

Time Consuming but Efficient and Organized

It was a difficult and time consuming activity, but eventually we got to the point it was efficient and self-organizing. It involved a large number of knowledge sharing activities, fire drills, travels, etc.

The most important aspect is to make this decision at the relatively early stage of development. You need to invest more into runbooks, developer guides, establishing processes around monitoring, release management, disaster recovery, etc.

On Promoting SRE's internally

This happens organically, as some people naturally gravitate towards this kind of work (it's always good to do this kind of assessment in the existing team). However, not everyone on the team is interested/capable of doing this (they'd rather focus on engineering work more).

It's understandable because it requires a different mindset.

At the End It’s Impossible to Fill the Gap

I think it's not fully possible, you will have to hire someone anyway to fill the gap. Especially someone who will be managing this team, establishing measurements (SLA, SLO, etc.).

The biggest advantage in this scenario is that people already know how to navigate and can distribute the knowledge in a more effective way by doing pair programming, etc.

Key Takeaway:

To summarize, I'd start internally and eventually fill gaps over time by external hiring.

🔖
Get to know about everything related to SRE Culture here!

Steve Fenton

Steve Fenton is an Octonaut at Octopus Deploy. He's known as a Software Punk, author, programming architect, pragmatist/abstractionist, and a generalizing generalist.

Unless there's a big skills gap, I'd look at the existing team for folks to lead on SRE or similar roles. I've always found someone on a team who wants to grow beyond their role.

Luca Galante

Luca serves as VP, Product & Growth at Humanitec. He enjoys sharing his insights on DevOps, Platform Engineering, and Cloud-native topics.

Luca agrees with Steve about hiring approach for SRE's.

It does depend on what the skills gap might be. But you're usually going to be better off with internal moves there.

Someone who is already familiar with your systems and ways of working will have an easier time stepping up to that plate and piecing things together.

Though I definitely don't discount the power of bringing in outside knowledge and ideas, even just the impact of a fresh set of eyes on things can help a lot.

Key Takeaway:

Don't discount the power of bringing in outside knowledge and ideas, even just the impact of a fresh set of eyes on things can help a lot.

Viraj Patel

Viraj currently holds the position of Senior Vice President at Axis Bank and has previously worked with renowned organizations such as Flipkart and BookMyShow.

In our 4th episode of "Incidentally Reliable" podcast, we discussed about how BookMyShow enabled developers to transition into SRE roles.

Here's what he shares:

Pawel Rusakiewicz

Pawel holds the role of Engineering Manager at Nobl9 and has a strong passion for the entire spectrum of software development.

As per Pawel, it depends on number of factors such as:

Organization Model:

  • Type
  • Number of Products

Team Structure:

  • Current Structure
  • Skillset
  • Identified Gaps

Infrastructure/Architecture:

  • Level of Complexity

SRE Model:

  • Implemented Model
  • Training Affordability
  • Preferred Experience Level

SRE Availability:

  • Regional/Time Zone Considerations
  • Ease of Finding SREs

Other Factors:

  • Product Complexity
  • Additional Factors

Key Takeaway:

In general the approach should be adjusted to the surrounding reality as much as possible and there is no single rule of thumb here.

Omkar Kadam

Omkar, in his role as a Lead DevOps Engineer at Cactus Communications, provides insights and guidance on his cloud transformation journey.

We interacted with him, delving into topics related to Platform Engineering, DevOps, and SRE.

When asked about hiring SREs, he says: There is a quote I appreciate -

Success is a matter of will or skill.

In practical terms, the decision boils down to this: If your organization already houses the needed talent and lacks significant skill gaps, it's advisable to focus on fostering the career growth of your internal team members.

On the other hand, if there are substantial skill gaps and agility and speed are paramount, I lean towards bringing in external subject matter experts.

His Perspective on Hiring Leaders Externally:

It's crucial for leaders to stay hands-on, understanding the internal workings and groundwork. This sentiment is widely shared in the industry.

Key Takeaway:

Striking a balance between these two approaches is essential, and I don't advocate for one over the other.

Manoj Sebastian

Manoj is Technology Leader and has worked with amazing brands like Flipkart, Atlassian, Intuit and many more.

In our 3rd episode of "Incidentally Reliable" podcast we spoke about various aspects of SREs and reliability.

One key topic was determining when organizations should begin hiring SREs and the essential qualities to consider in candidates.

As organizations grow, reliability becomes non-negotiable.

In startups, even without a dedicated reliability team, having someone to mentor the squad in reliability practices is important.

When hiring Site Reliability Engineers (SREs), experience matters. These pros bring tools know-how and architectural smarts from previous roles.

When someone on our team wants to switch to an SRE role, we don't just look at their experience. We also check for "SRE instincts."

Instead of specific tasks they've done, we ask questions and observe how they think about system design.

Do they consider:

  • Reliability: How can we make this system withstand any unexpected hiccups?
  • Rate limiting: Can we prevent overload and keep things running smoothly?
  • Fallbacks: What happens if something goes wrong? Do we have a backup plan?

If they're thinking about these things, even without direct SRE experience, it shows they have the right mindset for the job. We can then help them develop their skills and become great SREs!

Key Takeaway:

Building a great SRE team means finding people who can craft reliable systems, have an eye for good design, and are masters of the SRE toolkit. This goes for both internal and external engineers.

That wraps up the insights from our expert discussions.

When it comes to hiring SREs, the crucial factor is evaluating "SRE crafts and instincts," whether within the internal team or when looking externally.

If you're fascinated by reliability and the intricate process of digital recovery from downtime, checkout our podcast -  Incidentally Reliable, where veterans from Amazon, Walmart, BookMyShow, and other leading organizations, share their experiences, challenges, and success stories!

Did we overlook any aspects in this discussion? Share your stories and approaches with us—we're eager to hear from you!