Site Reliability Engineer – the concept of reliability

Date: 20/07/2021| Category: IT Governance & Service Management|

In this article Fabio Mora, Software Engineer, Agile Coach, DevOps expert and author, delves into some more practical and technical aspects of the Site Reliability Engineer profession and some fundamental concepts, in particular that of reliability.

Reliability

Site Reliability Engineering (SRE) means working on the most important functionality of a system: reliability, a “feature” that precedes any other. To illustrate its importance, imagine that you need to use a service whose operation is based, in whole or in part, on computer systems, electronics, telco and other related industries. Take any online service, reliability is critical.

If the assistant who wakes you up in the morning and streams your favorite radio station might not be essential, the smartphone that allows you to interact with relatives and friends, manage documents and appointments, definitely is critical for the quality of your daily life.

If your smartphone is “not available” and the apps like your bank account, Google, social networks, and Wikipedia do not work, some problems with your routine could loom. With various nuances of criticality, these are – under the hood – very sophisticated platforms that interact and work with each other, self-balance and often consist of millions, or billions of lines of code and hardware devices.

The functionalities all devices and apps offer correspond to possibilities in the real world. The idea therefore is that they should be kept efficient and responsive for those who use them, with a quality of service that lives up to the needs. This is called reliability.

The immensity of the system

To illustrate, if downloading a file from your Drive may appear to be a simple gesture, behind it lies an endless chain of events: from the mobile radio network, the data flow travels encapsulated, encrypted, in an optical fiber, through transoceanic cables that carry it within milliseconds to a remote datacenter, and back. In turn, there are data links that allow these infrastructures to communicate with each other, provide network services, hardware, but also energy and gas on the network – even further upstream.

From POS to pay in-store, to ticketing services, to railway, motorway, aeronautical and civil signaling networks, to remote surgery, to medical diagnoses in the cloud. But also the logistics of each package delivered by courier, the work of the “riders”, the heat trails of food and drug transport. All of these are pivotal platforms for entire sectors and for the quality of personal life: industry, communication, education, marketing, media, health, public administration, democratic processes – almost the entire service sector – and beyond.

Possible drawbacks

There are many things that can go wrong. First, systems become inherently unstable over time. Due to their incredible complexity, they tend to break down and it is necessary to work continuously so that this does not happen. The work activities on the systems and their updating must not be carried out only when «accidents», or events of an exceptional nature, occur, they must be part of business as usual.

As business as usual the activities can prevent inertia, obsolescence and technical debt. The latter are all daemons that threaten not only the quality of the services, but also the possibility of continuing to introduce changes in them. The SREs is to keep the systems stable and that of the programmers is to write and maintain the functionalities. of products, with continuous software releases. Each release, therefore, could introduce new errors, and complexity.

Value of SRE

Reliability is the upstream feature of any system. However, it is also a difficult feature to communicate because, when it is present, it can easily be taken for granted. It is also difficult to always give the right importance to this theme. To correct this small cognitive bias, the roles and organizational structures of the SREs are often autonomous with respect to the Software Engineers, who instead build the products.

The value attributed to the SRE, therefore, is to keep these products stable on systems; error free, maintainable, usable for the user – no matter what is happening. The value that an SRE offers to its organisation and to the users of its products is, ultimately, that of guaranteeing the stability of production systems, the maintainability of the software and a high quality level of service. All this regardless of external conditions, be it traffic peaks or continuous releases of new features.

SRE Fabio Mora

Fabio Mora

Fabio Mora is a freelance programmer and Agile coach enthusiastic about Extreme Programming and Linux. Passionate about open source, economics and everything related to mathematics and data science, he first founded a web agency and then worked in eBay as a software engineer. He loves music, sound engineering and scientific dissemination.

Share this post, choose Your platform!

Newsletter

Subscribe to the QRP International neswletter and get all the news on trends, useful contents and invitations to our upcoming events.

QRP International will use the information you provide on this form to be in touch with you. We'd like to continue keeping you up-to-date with all our latest news and exclusive content that's designed to help you to be more effective in your role, and keep your professional skills current.

You can change your mind at any time by clicking the unsubscribe link in the footer of any email you receive from us, or by contacting us at marketing@qrpinternational.com. We will treat your information with respect. For more information about our privacy practices please visit our website. By clicking below, you agree that we may process your information in accordance with these terms.

We use Mailchimp as our marketing platform. By clicking below to subscribe, you acknowledge that your information will be transferred to Mailchimp for processing. Learn more about Mailchimp's privacy practices here.