In 2021 Is A Site Reliability Engineer A Necessity Or A Luxury?

Take a look through the tech job sites, and a large number of firms are hiring site reliability engineers for good money. From Apple to Netflix and JP Morgan, down to high-growth fintechs and many smaller firms, any firm with a key online presence is recruiting specialists to keep it that way. 

Site reliability engineers work within an SRE model and are responsible for, drum roll, site reliability engineering (generally known as SRE). The term was first coined by Google in 2003 as a task to make the search engine company’s already large-scale sites and services more reliable, efficient, and scalable. 

site reliability engineer

According to the 2020 DevOps Skill Survey, SRE adoption grew from 10% in 2019 to 15% in 2020, demonstrating rapid growth that will only accelerate as SREs become a common part of the digital landscape. In the UK, typical roles offer around £60-65K in remuneration, but the SRE is a depressingly 98.4% male-dominated profession. 

Where Google led, many followed. Now, most firms with a substantial online presence use the wizardry of SRE to keep things functioning and drive business output growth in response to digital innovation. In wider terms, the role supports companies developing automated solutions for operational issues, including:

  • service monitoring, 
  • error budget management and disaster response, 
  • performance and capacity planning, 

In the modern digital business environment, the SRE works between development and in concert with core DevOps roles. Their combined aim is to deliver continuous service delivery, infrastructure automation and other key elements beyond keeping the lights on.

Today’s SREs collaborate closely with a range of key roles, including product developers, to ensure that IT solutions support availability, performance, security, and maintainability across the business. They also work with release engineers to create an efficient software delivery pipeline.

Of course, firms without SRE in-house are now hiring, which explains the goldrush of highly-paid career opportunities. This answers the original question, if all these firms are hiring them, SREs are very much a necessity in the evolving IT landscape. 

SREs Help Solve IT Managers’ Greatest Issues 

The rush to deliver new services and features to clients or customers puts constant pressure on developers, DevOps and related roles. They have to maintain the cadence of updates without breaking products and their supporting services. In the tug of war between the roles, the SRE’s main task is to ensure that a company’s sites and services:

  • Offer consistent performance, uptime and availability.
  • Ensure site security and redundancy.
  • Develop ways to detect issues early. 
  • Use a measurement framework to track reliability.
  • Use automation to reduce hands-on management.
  • Comprehensive understanding of current and future needs.

That is a lot of hats for one role to wear. Various elements are often managed through collaboration with DevOps engineers and other roles or are shared out while the SRE takes ownership of the portfolio and reports to the IT leadership. 

The key north star goal of an SRE is to:

“Perform continuing reliability analysis of existing infrastructure, focused on removing performance bottlenecks while optimising the infrastructure and workflows to deliver operational resilience and long-term digital growth.”

They achieve this goal in a number of ways through their own expertise and collaboration with engineers, customers, and product owners. 

SREs help set the uptime and availability targets through service-level agreements or indicators. For SREs, the key factor is in “error budgets” to strike the right balance between the need for feature development and availability. 

The automation effort is common across many business areas, with the SRE tasked with reduced mundane IT tasks across maintenance and operations. Automating these and providing dashboard overviews enable engineers and other roles to focus on critical tasks and strategic-level planning. 

Those strategies and based on the technology and services currently in use, and those coming from vendors or in-house development across future product cycles. The use of AI and machine learning to detect and predict errors helps with the early identification of issues and taking preventative measures.

Site Reliability Engineers Are The Voice Of Reason

When taking the middle ground between developers and operations, SREs use logic and evidence to build their arguments. These help resolve issues by implementing a mathematical formula to cut out subjectivity around releases.

A service-level objective sets the benchmark for the reliability of a system to end-users. Based on the SLA, the error budget is established. For example, 99.9% is a familiar service uptime metric, leaving 0.1% errors. But that could vary to 98%/2% depending on the value and volume of work and throughput and the tolerance of the users. 

Developers can build and update their products and deploy them within that error budget. As long as the product is running with few negligible errors, they are free to add new features at a pace that suits the business. 

Conversely, when the error budget is exceeded, other updates or launches are frozen until the number of errors is reduced, with all developer effort focused on the required fixes. This gives developers the incentive to reduce errors and improve reliability across all stages of a product life-cycle.

The Tools Of A Site Reliability Engineer

As with any technology role, the SRE has a bag of tricks to help deliver the strategy for success and ensure its goals are met. 

Innovation

Many IT and DevOps roles are focused on the day-to-day issues in managing projects through to launch and successful operations. That can leave little time for looking at business or service innovation, where the SRE can bring new ideas and solutions to the table, looking beyond that product-focus. Their focus in line with digital business goals increases the opportunity of delivering disruptive products to market.

Automation 

While early-generation SREs wrote and maintained code that supported production systems, the latest generation of digital natives look to automate processes wherever possible to maintain and improve reliability. They understand the business environment better and can focus on the metrics that matter to the business, not just the IT teams.

Collaboration

Collaboration is a key part of the SRE’s mindset, helping smooth the transition to SRE models and during system or development failures. SREs have strong communication skills and the ability to win over business leaders who might not see the benefits of SR engineering and IT traditionalists who might prefer legacy and existing services.

Broad Skill Set

Part systems administrator, part developer, the successful SRE needs to understand both sides of that equation and the conflicts that arise during development and operations. Being able to troubleshoot processes owing to different methodologies is all part of the SRE’s job, using their perspective to create a balanced operational system.

The Difference Between SRE Vs DevOps

While many of these attributes might seem similar to a DevOps role, the functions of an SRE differ in several important ways. Understand these are key to getting the best from SRE models, teams and the wider business.

IN a traditional environment, DevOps can raise an issue and send it back to the developers to solve. However, the SRE can identify the issue and advise on a solution to speed up the resolution. As SREs are typically more confident and capable of making quick changes or suggestions, it helps create a stable developer, production and operational environment.

Concluding Thoughts

The Site Reliability Engineer’s role may be a 21st-century creation, but it is built on the skills and knowledge of development and production environment skills. While some companies may shy away from trendy roles and technologies, SREs play a key role in business improvement and building better IT services. 

They play a vital role between development and operations to ensure timely releases, avoid disrupting operations, according to the operation team’s specifications. As the demand for those new features or services increases, the role of the SRE becomes more important, and as more firms become digital businesses, SREs support sustainable growth.

[a-z]
[a-z]
[type='submit']
[type='submit']
[a-z]
[a-z]
[type='submit']
[type='submit']