Let’s talk about the role of the Site Reliability Engineer with Fabio Mora; freelance programmer and Agile coach, Extreme Programming and Linux enthusiast. In this article we understand why building robust systems is fundamental and how to move forward in case of incidents.
Building robust systems
If a large slice of problems can be solved with the use of money or by focusing on particular highly skilled professionals, this solution does not always prove to be cheap, or safe from a business point of view.
It is really complex to keep a system that involves so many manual operations stable, because it is more prone to failure and accidents. So much as that it becomes unsustainable in the long term, since there is the risk of bringing stress and boredom to the team. However, by fixing small errors as you go and continuous maintenance, you can still get a great result.
Shift left strategy
Often it is more convenient to start working on a product with a technical kick-off on which to reflect and with which to plan the components of the system and its software architecture. From here you can move from a monolith to an event bus or microservices architecture – with the right design of APIs and protocols to study the patterns of use, and identify the costs of the necessary resources.
This technique is called shift left strategy: SRE (Site Reliability Engineer) support starts with the design, not when the user traffic has already happened. From the design it continues to the implementation of the product. Consciously deploying resources and people is crucial: whether they are computers or software engineering time. This cross-functional design activity is fundamental and continuous, and occupies most of the day of a SRE.
In fact, in most cases, when faced with “poorly designed” but already implemented systems (e.g. if users can already browse the site), the only convenient solution is to redesign the bad parts and rewrite them from scratch.
The way to perform these tasks for a SRE is kind of programming oriented, to “apply computer science to computer science”: intervening with a stronger competence than others in this field is the value that distinguishes this person.
In particular, knowledge of operating systems, network protocols, the study of release pipelines, software quality control, and knowing how to move confidently between telemetry and alarm systems is essential. Defining SLO (Service Level Objectives), the ability to manage complexity and working in emergencies requires transversal knowledge: it involves writing software that can be used to govern other software systems and, not less important, improving it in small steps to keep it stable and simple.
Interaction between SRE and other Team Members
The interaction between the SRE and the other figures of the Development Team is therefore not a trivial reallocation of roles, but a different way of thinking about product governance. The balance is in fact positive for both teams, if you think about making sure that operations responsibilities are minimised.
The way to do this: create ‘continuous’ systems that allow constant and secure change in large systematic architectures, like clusters, applications, databases or networks. From wherever you are: iin the office in front of a workstation and four screens, or waiting at an airport gate with a mobile phone connection. This must be pursued so that a huge system can become manageable by a small team of ordinary people. As long as we keep asking ourselves whether the investment in SRE time makes sense, as this is an ongoing, strategic activity.
Another way of designing stable systems is to design with teams on standard, distributed platforms, which are easier to manage and provide mobility for professionals between teams. Think of an airline for example: often the fleet of aircraft and their configurations are similar, so that ground staff can train on a few, very precise machine types. A standard system and a landscape of finished products to maintain helps to create a norm and enables one to move from one team to another within the same organisation.
This strategy may not have a direct effect on users, but it serves to keep systems simple, and is the key to improvement. A good SRE is first of all a good Software Engineer: he must be fond of solving heterogeneous and complex problems. This means he should know what a low-level computer does and know its operating system, be able to interact with developers and have a good flair for creating automatisms and removing bottlenecks.
The SRE acts as the glue between the technical team and the production systems, facilitating the understanding of complexity that developers do not immediately understand because they are experts in other fields. It is not a risk-free exercise though, as the need to be quick must be combined with the need to transfer knowledge to your colleagues. When something works and you understand it individually, there is an impulse, an urgency to keep simplifying: you can’t stop at the first implementation, but you need to continuously look for simpler technical solutions.
In general, on a day-to-day basis, a development team may not urgently need to know the details of a cloud, but when the need arises, it must have the tools to do so. Moreover, people and teams change, and in IT, to solve a big problem, it is not enough to add people to a group. Instead you need long on-boarding processes, shadowing and pair-programming procedures before a person can be productive.
But it is not only the SRE who is responsible for the product, but mainly the development team. In fact, they must also be included in the on-call rotation, so they can see and use their product. No one likes to be woken up at night or walk out of a movie theatre to open a laptop and connect to the network to run the system.
The product is always the responsibility of the development team, who know the details and drive the decisions. To be ready, you need product diagnostics tools, you need to look at graphs and alarms, but above all you need a lot of practice and prevention with simulations and random failures on which to experiment and learn (so-called Chaos Engineering).
When an incident occurs, the NOC (Network Operating Core, the personnel who monitor the global network 24/7) in the company, is also involved in the SRE landscape. They don’t just look at graphs, but have much more proactive tasks. They have to facilitate communication, open bridges in support of videoconferencing, track uptime and events, automate reporting and alerting procedures, as well as initiate diagnoses and direct the team. For this they could rely on help of chatbots and machine learning.
When an accident occurs there are two objectives:
- Minimise the impact as soon as possible
- Prevent it from happening again
The classic flow of incident management involves:
- Managing and mitigating the event (troubleshooting)
- Finding the root cause (triage)
- Dealing with an impact removal phase (mitigation)
- Dealing with a post-mortem process
- Checking the consolidation and evolution of the system (long-term fix)
The indispensable tool of a SRE is therefore the Post-Mortem, i.e. the report one writes when an accident has happened or something has gone wrong.
A totally transparent and honest process is needed, otherwise it is null and void. It is not an accusatory process, but a reconstruction of the timeline of events that happened, critical moments, causes, effects, lost data and the restoration work generated – where we were lucky and where not.
The aim is to find the error that triggered the event and come up with a practical solution to ensure that the same family of problems does not occur again – consequently, to adjust procedures. At the end of the post-mortem you get a list of bugs, tracked and to be fixed. This is the engine of continuous improvement: explaining what went wrong must be part of the SRE culture.
It is also important to downsize the work in a sustainable way: ideally you should deal with few incidents per shift, as you need a right balance between working hours and rest. This is not only a matter of self-respect and balance, but also of safety, efficiency and responsibility towards the team. A tired person cannot find the attention and concentration that this profession requires, which makes it much easier to make mistakes. The same principle applies to pilots, drivers, doctors and other professions where the human factor is taken seriously.
To sum up: post-mortems, retrospectives and long-term corrective thinking are the three fundamental tools for putting an improvement process into practice. Otherwise, you will have no way to learn from your mistakes.
Fabio Mora on Linkedin