Are Site Reliability Engineers (SRE) needed?

Site Reliability Engineering (SRE) – defined simply as “what happens when a software engineer is tasked with what used to be called operations” – sounds like a natural advancement of DevOps; blending software development operations with the broader operations of an organisation. Since Ben Treynor established this definition at Google in 2003, SRE has been on the rise, embraced by industry leaders, placing demand on this specialism. So, with Site Reliability Engineers commanding a median salary of $140k1, is hiring them necessary, or just trendy?

A Site Reliability Engineer’s main responsibilities dovetail with day-to-day DevOps; the development of software with limited technical debt, with greater emphasis fixing production issues. They play a vital role between development and operation teams to ensure timely releases avoid disrupting operations, according to the operation team’s specifications.

The remit of a Site Reliability Engineer is to:

  • Manage and improve the software development life cycle.
  • Support system design, capacity planning and launch reviews.
  • Maintain, support and monitor live services for availability, latency and overall health.
  • Advise around and design system improvements to reliability.
  • Provide defensible incident response.

In a digitised enterprise, building reliable software efficiently is central to organisations’ operations and offerings. Site Reliability Engineers help sustainable growth as a digital enterprise by monitoring performance, incident, response and application availability. The role serves to centralise accountability for these functions. The importance of the value Site Reliability Engineers offer is arguably aligned to the maturity of an organisation’s digital transformation.