What is Site Reliability Engineering?

By | September 30, 2021

According to HOMETHODOLOGY, Site Reliability Engineering or SRE for short is one of service management models developed by Google. The development and operation of large distributed systems are closely linked. The regulatory processes represent a concretization of the DevOps philosophy.

Site Reliability Engineers build a bridge between development and operations by applying a software engineering mindset to system administration issues. They divide their work equally between operational and development tasks. The ideal candidate for an SRE position is therefore either a software engineer with a resilient background as an administrator or a highly qualified system administrator with experience in programming and automation.

SRE employees examine the resilience and weaknesses of the systems in operation in order to find optimization and scaling options. In addition, the search for solutions to simplify the handling of the systems has the same priority. Experience in the operation of large IT infrastructures flows directly into the further development of the structures.

Part of the concept is that SREs do not spend more than half their time operating. Violation of this rule is considered a sign of poor implementation.

origin

The SRE concept became successful and known through the company Google, which cultivated it long before the company made the principles public. It has the same process improvement goals as the DevOps philosophy. DevOps was coined around 2008 and stands for a corporate culture of cross-team collaboration. All instances should be brought into line with the same vision, joint responsibility for success should be established.

SRE and DevOps

Before the spread of DevOps, the development and operations teams worked independently, each with their own goals and specifications. In order to communicate better and work together more smoothly, DevOps teams have become the most important in any company.

DevOps and SRE serve equally to reduce the gap between software development and software operation, with the aim of improving the release cycle in complex distributed systems. The DevOps concept defines what the results should look like; it stands for a cultural change within a company. SRE is about designing the theoretical DevOps approach with suitable methods and tools as a workflow.

SRE includes doing the ongoing automation of manual tasks and the continuous integration (Continuous Integration) and delivery ( continuous delivery ). SREs take responsibility for operational reliability and automation during the entire infrastructure life cycle, monitor the provision and operation of the releases.

The 5 basic principles of the DevOps philosophy and their implementation by SRE

  1. Dismantle organizational silos

Large companies have a complex organizational structure with a large number of teams that often work separately in “silos”. Each team has a different view of the whole, which leads to inefficiency. The task of DevOps and SREs is to better coordinate the teams with one another towards the overarching goals and towards a common vision.

  1. See failure as part of the process

DevOps assumes that failure is part of the process and is helpful in learning from it. SREs ensure that there are not too many failures or failures. To do this, they use formulas to weigh up failures with the release of new versions: service level indicators (SLIs) and service level goals (SLOs). SLIs measure failures over time. A SLO is an agreement within a service level agreement about a certain metric such as operating time or response time that must be adhered to.

From the SRE perspective, a clear understanding between the business and IT levels is required in order to set optimal targets for service level targets and service level indicators. Each violation leads to the goals to be reassessed and optimized.

The SRE guidelines encourage radical changes within certain limits. SREs have a risk budget to test these limits and thus potentially innovate faster. SRE quantifies this acceptable risk as the “error budget”. When the bug budgets are exhausted, the focus shifts from development to improving reliability . This balances availability and further development.

  1. Implement changes in quick, small steps

Like DevOps, SRE promotes continuous improvement through small and frequent development steps. With short iteration cycles, any negative effects are less serious, and low-risk improvements can easily be tested (automated if possible) and implemented.

  1. Use common tools and automation

Incompatibility and integration issues between technologies from different vendors, eras and use cases can create silos even in a DevOps environment. SRE introduces uniform technologies and comprehensive information access in the various IT teams. SRE demands that all teams working on the same service use the same technologies.

SRE follows the principle of automating manual tasks that are repetitive and reactive and do not bring lasting improvement. Automation should free up capacity for work that brings long-term benefits.

  1. Base reliability on measurement data

Evaluating suitable goals for reliability is a contextual challenge for DevOps and SREs. SREs ensure that all levels in the company agree on how to measure reliability and what to do if the value does not meet the requirements.

The DevOps key metrics are the number of deployments in time, the lead time from commitment to release, the number of failed deployments and the recovery time required.

The basics for SRE are the service level objective (SLO) and the service level indicator (SLI). SREs use these metrics to determine whether or not a release will go live with a change.

What is Site Reliability Engineering