Site Reliability Engineer – Necessity or Luxury?

Since the inception of Google in 1998, it has consistently revolutionized the digital space by launching some of the most impactful applications such as Google Docs, Gmail, and Google Maps to name a few.  So, when it pioneered an application development-based movement called Site Reliability Engineering (SRE) it automatically generated massive interest in the industry, leading to high demand for Site Reliability Engineers.

SRE’s ability to end the unending battles between the operations and the developers, while boosting accountability, reliability and innovation have gained significant momentum recently. However, there are still numerous IT managers who are unsure about the true value of a Site Reliability Engineer, and the benefits of adding them to their teams. So, let us deep dive into the concept of Site Reliability Engineering and find out if hiring a Site Reliability Engineer is a necessity or just another trending concept in the industry.

What is SRE?

Ben Treynor, Vice-President of Engineering at Google, simply defines Site Reliability Engineering as “what happens when a software engineer is tasked with what used to be called operations”. However, this simple process has the potential to solve one of the biggest conflicts faced by IT managers.

  • Dev vs Ops – Conflict

Typically, the conflict arises between the developers and operations teams as the former primarily focuses on making new features available to the general public and see them through to success, whereas the latter’s primary intent is to ensure that the addition of new features doesn’t break the structure. This often leads to a constant tug of war between the two teams as operations try to stall  many new features pushed, while the developers are in a constant pursuit of finding new and creative ways to get around the roadblocks posed by operations.

Devs vs Ops Battle Solved by SRE

  • Site Reliability Engineer to the Rescue

This conflict can be resolved by implementing a mathematical formula and completely removing subjectivity as to what features can be released and when. Further, the role of a Site Reliability Engineer is introduced to ensure product reliability.

How SREs Work

It is widely accepted that the majority of applications launched do not achieve complete uptime. It is here, that the SRE team is instrumental in setting up a Service-Level Agreement, commonly known as SLA. SLAs set the benchmark for the reliability of a system to the end-users. Based on the SLA, an error budget is determined. For example – if the SLA is set at 98%, then the remaining 2% is automatically assumed to be the error budget. An error budget like the name suggests is the threshold for the maximum outages and errors that are allowed during a launch.

Such a system gives significant autonomy to the developers’ team as they are free to use the error budget, meaning, as long as the product is running seamlessly with no or negligible errors, the developers are free to add any new features whenever they want. On the other hand, if the error budget as set by the SLA has been exceeded or fully met, then all other new launches will automatically be temporarily frozen until the number of errors is reduced to a level under the SLA-determined error budget. As a result, SREs, as well as the developers, have an incentive to reduce the errors and make the product reliable.

What Does a Site Reliability Engineer Bring?

Although Site Reliability Engineering (SRE) is still at a nascent stage, some of the biggest technology giants like Dropbox, Reddit, GitHub, and Netflix are already using it to their advantage, particularly during the design and development phases of large systems. This is because Site Reliability Engineers:

Merits of a Site Reliability Engineers

  • Innovate

Site Reliability Engineers are intensely focused on reliability, even though they are aware that not all products they work on will be launched in the market. Hence, they constantly work towards discovering new and innovative approaches to solving issues than to have a narrow view of the end product only. This increases the possibility of creating disruptive products.

  • Automate

SREs write and manage code that supports the system while also automating several processes to maintain and improve reliability. Automation eliminates laborious and time-consuming processes such as manual reprogramming, thus reducing the pressure on the in-house team. SREs use automation to identify and cater to the flaws without any type of human interference .

  • Collaborate

Site Reliability Engineers bring a collaborative mindset which is particularly critical during system or development failures. Moreover, they help move development process more quickly without impinging the rules of system administration.  In addition to that, they are excellent communicators who unite the team to provide solutions, away from divisive habits that are prevalent in teams.

  • Have Multiple Skills

The presence of an SRE is particularly valuable because the role is a combination of a developer and a systems administrator. This eliminates the possibility of conflicts during development and troubleshooting processes owing to different methodologies and ideals. Site Reliability Engineers bring the perspectives of developers and systems administrators to create a balanced operational system

SRE vs DevOps – Difference

The functions of SRE and DevOps may seem similar to many as both are methodologies that highlight the needs of the organization’s production operations management, however, the difference between the two concepts is quite important to understand.

DevOps typically raise an issue and send it back to the Devs for a solution, the SRE on the other hand not only identifies the issues but can also solve some of them. Hence, it can be ascertained that in comparison to the DevOps, the SREs are less conservative when making any change to the production environment. This is because SREs are usually more confident and capable of making quick changes and updates, and maintaining a stable environment for production.

Therefore, while DevOps strictly focus on maintaining a stable production environment, the SREs put equal emphasis on maintenance of a stable production environment, and improvement of efficiency and performance of operations.

 Conclusion

A Site Reliability Engineer’s main responsibility concur with day-to-day DevOps – the development of software with limited technical debt, with greater emphasis on fixing production issues. They play a vital role between development and operations teams to ensure timely releases, avoid disrupting operations, according to the operation team’s specifications.

In fast-digitizing enterprises, building reliable software efficiently is central to organizational operations and offerings. Site Reliability Engineers support sustainable growth as a digital enterprise by monitoring performance, incident, response, and application availability. The role serves to centralize accountability for these functions. Hence, the importance of the value Site Reliability Engineers offer is arguably aligned to the maturity of an organization’s digital transformation.