The pressure to constantly innovate and release new features can often clash with the need for a stable and reliable product.

While there might be some temporary cutbacks in testing time to achieve high feature velocity, ensuring reliability doesn't have to be an afterthought.

We reached out to industry experts to gather their insights on ensuring reliability during phases that demand high feature velocity. Here's what they had to say.

Steve Fenton

Steve Fenton is an Octonaut at Octopus Deploy. He's known as a Software Punk, author, programming architect, pragmatist/abstractionist, and a generalizing generalist.

He cites research from DORA and SlashData, indicating that following technical practices can enhance both throughput and stability simultaneously. This suggests that there's no need to compromise one for the other.

If stability is compromised for speed, it indicates a problem with the current approach.

It's okay to "break things" in terms of trying an idea and finding it's not helping you achieve your goal. It's not as much about bringing services to their knees with every deployment.

💡
Key Takeaway: Throughput and stability both enhance your ability to run more experiments with your software. That's the kind of "move fast" that's really desirable.

Jordan Chernev

Jordan is a Sr. Director of Platform Engineering at Wayfair, with experience in building high-performing teams and systems.

When asked about reliability vs. feature velocity, he shares one of his most important strategies.

I've seen it used at scale as a production readiness scorecard or checklist that needs to be satisfied before a new service gets deployed to production. You pay for the reliability upfront, so expect an initial “slowdown” as a one-time start-up cost that will pay you dividends in the long run.

Example: https://sre.google/sre-book/launch-checklist/

💡
Key Takeaway: Develop a checklist outlining the tasks and processes required for your project. Once the checklist is finalized, work on creating templates and automating these tasks wherever possible. This will streamline your workflow, making it easier to follow and reducing the likelihood of errors or oversights.

Scott Hiland

Scott is an Operations and Infrastructure expert and a security architect with a background in culture change, infrastructure, and integration planning.

He says,

Toil reduction through automation is one of the best technical practices to follow as it aids consistency.

When it comes to checklists, Scott holds a different view.

He believes checklists can become problematic, as teams may manipulate, ignore, or sideline them, particularly when unexpected events arise.

Instead, we should focus on building automation and standardized paths.

Building automation and templated paths to production that test go-live capabilities at agreed-upon service level objectives is more effective, efficient, and reduces latency in feedback loops.
💡
Key Takeaway: The best approach is to collaborate with developers to build or enhance production processes. Relying solely on checklists creates a barrier between platform teams and developers.

Sandor Szuecs

Sandor is a Teapot Engineer at Zalando SE. He loves talking about reliable approaches, load balancing, and everything around Kubernetes.

He shares his current approach:

We use blue-green deployments to switch traffic as quickly or slowly as feature teams prefer, along with visibility and automation triggering alerts if error rates spike.

Clusters are categorized into channels, such as 'beta' for test clusters and'stable' for production clusters.

We follow a dev->alpha->beta->stable channel hierarchy, with dev serving as infrastructure test clusters and a playground and alpha as infrastructure production clusters.

We also track metrics by version to monitor any increases in memory usage, latency, error rates, etc.

💡
Key Takeaway: The deployment strategy should include blue-green deployments, clusters organized by categories, version-based metrics monitoring, and stack sets for holistic management beyond mere deployment procedures.

Kingdon Barrett

Kingdon is an Open-source Support Engineer at FluxCD, DX. His research primarily focuses on exploring innovative methods for building reliable, redundant, cost-effective, and efficient systems.

According to Kingdon,

If your services are instrumented, you can add canaries to prevent a bad deploy from going out to all downstream simultaneously; monitor the key performance indicators for the types of failures that you are most concerned about, and when they start to appear in a new deploy, they can be rolled back before 15-20% of clients can even reach them.

If they are reliable metrics and you use a load tester (the way that Flagger canaries are described in the documents), you can even prevent failures before 1% of users see them with pre-rollout hooks.

💡
Key Takeaway: Using instrumented services and reliable metrics, along with canaries, helps prevent deployment failures before they affect users.

David A. Symons

David is passionate about assisting businesses in their journey to initiate, develop, and expand, particularly in collaboration with developers and engineers. His expertise lies in navigating the realms of containers, cloud technology, and Kaizen methodologies.

David keeps it succinct, and simply recommends the "Shift Left" approach. The technique involves integrating testing and quality assurance processes earlier in the software development lifecycle, typically starting from the initial stages of design and coding.

This approach aims to identify and address potential issues and defects as early as possible, thereby reducing the likelihood of costly fixes later in the development process.

Breaking Down Team Overlap: Platform Engineering, DevOps and SRE’s

We also spoke about Platform Engineering, DevOps, and SRE’s role in reliability and feature releases.

According to Scott, “I don't agree that mature SRE, DevOps, and Platform Engineering teams shouldn't have full overlap. The divisions should be business lines, not technical lines.

DevOps and SRE encompass practices that platform engineering teams must also become proficient in. It's one of the reasons why a strong platform engineering team is not easy to build.
It's like Maslow's hierarchy of needs. For platform engineers, you start at technical proficiency, grow into operational efficiency, and eventually on to fully realized developer experience.
💡
Key Takeaway: While automating parts of the checklist with code that validates features during deployment can be beneficial, relying solely on a static checklist can hinder progress.

When a platform team becomes overly focused on checklists, they risk neglecting user needs and hindering continuous improvement. This "wall" of manual tasks becomes a burden for developers and lacks adaptability to changing needs.

And, that wraps up the insights from our expert discussions.

If you're fascinated by reliability and the intricate process of digital recovery from downtime, checkout our podcast -  Incidentally Reliable, where veterans from Amazon, Walmart, BookMyShow, and other leading organizations, share their experiences, challenges, and success stories!

Did we overlook any aspects in this discussion? Share your stories and approaches with us—we're eager to hear from you!