Running an eCommerce business you may face a problem when you get a message from your clients that the site is down. You refer to your hosting providers. They look at their charts and see that there was some surge in customer activity, but the server overcame the peak and now performs well. It might be a DDoS attack or a coincidence. After a while, the server is getting slower, page speed decreases until the full timeout. You can try to explain it by a peak amount of requests that “broke” the server. But the true cause is hidden.

Here we will not analyze diagrams from a real production system. At the time of the server crash, server engineers often don’t have enough time to reproduce them. The server could run out of its CPU, and as a result, it couldn’t log errors or give metrics.

In this article, we will talk about SRE, an approach that allows producing services based on monitoring systems that are less prone to the effect of a sudden server crash.

What is SRE?

As explained by Google, initiator of the approach, Site Reliability Engineering or SRE:

‘is what you get when you treat operations as if it’s a software problem’.
Google

Reliability needs to be managed, and the SRE experts look after service availability, latency, performance, and capacity. Their job is a combination not found elsewhere in the IT industry. It is close to the Ops teams as the SR-engineers keep high-priority, revenue-critical systems up and running despite any natural disaster or man-made error. It is different from the Ops as the software is considered there as the primary tool to manage and maintain systems. The SRE job requires having the skills from both disciplines to drive reliability and performance across multiple projects.

Background behind SRE

Back in the days, it was a common practice in the web development industry when businesses came with a concept, architects designed solutions, and developers wrote code. Someone was testing the product, someone was delivering it to the end-user, and somewhere at the release of the product a lonely business owner was waiting for a cost-effective product designed, developed and tested to cover the initial concept. The process lacked coordination and cooperation.

Google experts explain the reason why the DevOps movement appeared. Developers wanted to write code to release new features quickly. They were shipping their code to the Ops team who was responsible for this code running. The tension between the two teams emerged due to different priorities. So, DevOps culture was incepted to link the two teams and produce a new product in accordance with its performance and application needs. That is how DevOps, combining developers and operations in one word, was initiated. With DevOps approach in place, developers write code, DevOps engineers turn the system described in the form of code into actual systems, and the code is delivered. To implement DevOps, CI / CD practices are introduced. However, the DevOps philosophy remained an abstract approach without explicit instructions.

Eventually, there came a need to introduce one more concept able to simultaneously support the infrastructure, look after monitoring, resolve incidents, and even deal with code delivery. That is when Google started thinking and expanding the SRE approach.

SRE evolved and finally implemented the DevOps manifests and was focused on measuring and achieving reliability enhancing engineering and operations. SRE prescribes how to perform in the various DevOps areas.

The difference between DevOps and SRE

In their video, Set and Liz, Google DevOps and SRE experts, mentioned that Site Reliability Engineering (SRE) and Development & Operations (DevOps) do not compete, but are more like close friends. The two disciplines overlap and can expand each other. An SRE practitioner should understand the content of both methods to apply them effectively and not swing from one extreme to another.

In the table below, five pillars of DevOps are mapped with their corresponding SRE practices:

*DevOps*	*SRE*
Reduce organization silos	Share ownership with developers by using the same tools and techniques across the stack
Accept failure as normal	Have a formula for balancing accidents and failures against new releases
Implement gradual change	Encourage moving quickly by reducing costs of failure
Leverage tooling & automation	Encourages “automating this year’s job away” and minimizing manual systems work to focus on efforts that bring long-term value to the system
Measure everything	Believes that operations is a software problem, and defines prescriptive ways for measuring availability, uptime, outages, toil, etc.

Originally published on Google

Read more about the difference on our Partner’s page.

What is the benefit of DevOps for a business?

Developers build high-performing tools and services;
Operations teams get best practices after cooperating with developers;
Each product is accompanied by its application programming interface (API) document.

Areas that benefit the most from applying SRE:

Service availability-, latency-, performance-critical projects;
Change management and capacity planning;
Monitoring and emergency response.

Key components for the best SRE practices

The SRE discipline sets the service availability targets and measures this availability based on the inputs provided by engineers, product owners, and customers. It ensures that everyone agrees on how to measure availability, and what to do when availability falls out of specification. This approach adds shared responsibility across the company from VPs to programmers.

Service Level Indicators (SLIs) and Service Level Objectives (SLOs)

SRE engineers deal with stakeholders to decide on Service Level Indicators (SLIs) and Service Level Objectives (SLOs), including SRE metrics.

SLIs are metrics over time such as the request latency, the throughput of requests per second, or failures per request.
SLOs are targets agreed with all stakeholders to define the overall success of SLI within a time window (like “this quarter” or “this month”), incorporating SRE metrics to ensure comprehensive performance evaluation..

Service Level Agreement (SLA)

Although not a part of the daily routine of SRE, an SLA is an agreement between a service provider and a service consumer about the availability of a service to deliver a level of service accepted by both sides.

Risk and error budgets

An error budget is one more concept thoroughly kept in SRE. The SRE approaches risks as normal. Maximizing a system’s stability to a 100 % result is counterproductive. Google experts say:
Users typically won’t notice extreme availability (like 99.999999%) because the quality of their experience is dominated by less reliable components (…). The SRE discipline quantifies this acceptable risk as an “error budget.” When error budgets are depleted, the focus shifts from feature development to improving reliability.

Toil and automation

An important component of the SRE discipline is toil. The SRE discipline aims to reduce toil by focusing on the “engineering” component of Site Reliability Engineering. When SREs find tasks that can be automated, they work to engineer a solution to prevent that toil in the future.

Customer Reliability Engineering (CRE)

Customer Reliability engineering or CRE is the final key to SRE practices. A company that successfully applied SRE, can teach SRE practices to its customers and service consumers to help them implement reliability, availability of their services, keep high productivity with less toil and more realistic expectations.

How does SRE find a tradeoff between speed and stability?

SRE sets specific practices that allow finding a tradeoff between the speed of code delivery and system reliability:

It defines the level of acceptable service level. Example, Latency < 50ms.
It determines the level of quality for a chosen pool of requests. Example: Service Level Objective (SLO) is set to 99,99%.
It admits that 100% quality cannot be achieved and the error budget should be considered. Example: Error Budget = 1 – SLO.
A new service/product can be changed or released on production as frequently as required until the rate of bad requests surpass the error budget.

Who needs SRE?

The SRE techniques can be useful for employees of small, mid and large-size businesses, as the SRE discipline gives lots of practical advice for conducting projects where system reliability is the first priority.

In small companies, hiring an SRE expert will help to anatomize errors, correct all elements that might not work during a system crash and slow down the release when the budget for fixing errors runs out.

In enterprises with established engineering processes, SRE can be useful as well. Even if such a company has a dedicated operations department and is engaged in DevOps, it may become necessary to change the engineering structure of the organization, and the SRE explicit approach will help to achieve this goal.

Closing

Google is exceptionally good at operating giant systems, and SRE is a set of practices and methods that helped them to develop such a high-effective technology culture.

Based on the experience of Google and other companies like IBM and Yandex, Simtech Development is introducing SRE practices into its processes. Our DevOps and SRE experts have long been working according to the SLA. They established all SLIs and SLOs to work upon and manage the level of risk. As of now, our experts expand the culture of keeping projects reliable with the SRE approach and teach our customers how to apply SRE to their projects. Simtech Development provides training in SRE practices to customers on the Enterprise Cloud Hosting plan upon their specific request. Our SRE practitioners will research and develop SLIs / SLOs for the Enterprise + clients to cooperatively resolve the ongoing issues and improve the overall performance.

TALK TO SRE EXPERT

Site Reliability Engineering: What Is SRE and Should eCommerce Rely On It?