Chances are you’ve heard of site reliability engineering (SRE) although you might not fully understand what it is. You might also be unaware of how important it is for blockchain networks. In this article, we’ll shed some light on this crucial area of the Blockdaemon business.

In a nutshell, SRE creates an environment of rapid, safe delivery.

At Blockdaemon, our SRE team allows us to balance development cadence with reliability. They reduce toil, give protection and let us scale with confidence.

As blockchains continue to gain importance, SRE will play an increasing role in our organization and the industry as a whole.

Today we’ll take a deeper look at SRE, specifically covering:

  • An Introduction to SRE
  • Why SRE is Vital for Healthy Blockchains
  • How Blockdaemon Does SRE

By the end of this article, we hope you’ll understand SRE’s role and importance. It’s not only critical to Blockdaemon, but to blockchains in general.

1. An Introduction to SRE

As the name suggests, SRE (site reliability engineering) is a discipline that engineers reliability into software systems. It does this through automation, elimination and risk reduction.

It was first pioneered at Google in 2003 before spreading to other organizations. The Google SRE book is considered the subject-matter bible. It’s worth checking out if you’re interested in a thorough deep-dive.

In the old days, infrastructure and operations teams were forced to take a reactive approach to issues. Traditional teams had the best intentions, but not the tools or time to get ahead of the game. Most IT functions would predominantly respond to issues and ‘firefight’.

SRE Introduces a DevOps approach to blockchain reliability

Nowadays, simply reacting to issues is less acceptable than ever. By the time you detect an issue, it may well be too late. This is especially true with blockchains, where failures have bigger impacts and are felt much more quickly. 

SRE uses software engineering principles to develop lean, risk-based solutions that increase speed and remove complexity. 

With a huge emphasis on observability and automation, SRE marks a shift away from old-school reactive ops teams.

SRE ensures blockchain ops teams are fully aligned with the Agile/DevOps culture of fast release cadence. It gives greater control and a proactive approach to issues by utilising ‘error budgets’.

You can think of error budgets as the acceptable level of faults within a system. If the number and severity of the faults are within your error budget, your users will be ok. Once you go over your user budget though, your user experience becomes unacceptable.

SRE informed design and engineering allows organisations to build scalable, reliable, fault-tolerant systems

SRE uses automation to eliminate human errors and introduce consistency throughout the ops process. This is in combination with rapid incident response through a robust observability stack and on-call service.

It’s also important to note that SRE is fully focussed on issue prevention, and not-finger pointing. When issues are detected, blameless incident post mortems are carried out to analyse data and avoid recurrences.

2. Why SRE is Vital for Healthy Blockchains

Due to its distributed nature, blockchain technology has an intrinsic level of resilience. After all, the ledger exists in multiple places at the same time. You might think, therefore, that individual node uptime is less important.

With Blockchains, uptime is as important, if not more important than ever.

Consider that:

  • Blockchain service providers have high customer demand and equally high expectations for node uptime and resilient protocol clusters
  • Running a ‘bad node’ can lead to significant financial and reputational consequences
  • Blockchain organizations are expanding their customer base. They need to focus on rapid scaling, not unnecessary maintenance

Maintaining node uptime is a non-trivial matter. 

Blockchain nodes are tough to manage, especially as your landscape grows. Blockchain node software is not always designed with maintenance in mind.

Single nodes can be a challenge. As you add more nodes that challenge increases exponentially. Constant firefighting will quickly consume all of your time.

Implemented correctly, SRE practices greatly reduce this maintenance effort and give engineering teams vital breathing space.  

It’s not just the ongoing maintenance though, SRE can also play a significant part in node deployment. 

Take Blockdaemon for example; we use NodeQ to rapidly deliver customer nodes.

NodeQ is our internal orchestration and automation tool/API. It is a business-critical piece of software and a key component of our success. Our SRE team ensures NodeQ is always operating well within the error budget. 

3. How Blockdaemon Does SRE

In Blockdaemon we have a core SRE team that ensures the reliability of shared services such as Kubernetes. They also keep our production systems up and running.

Because of this, they span all our verticals (NodeQ, protocols, product, infrastructure and Ubiquity) and own the observability stack (Grafana Cloud). 

They work hand-in-hand with our other teams to support and maintain our production blockchain infrastructure and applications.

It’s not just our core team, Blockdaemon project development teams also employ SRE methodology.

Due to our wide range of supported protocols, Blockdaemon uses a squad model for solution and software development.

Blockdaemon squads consist of expert engineers who look after families of protocols. These squads also follow SRE practices, according to the SRE pyramid shown above.

Our Squads use SRE across our development projects to:

  • Remove toil
  • Define SLIs and SLOs
  • Respond to incidents
  • Manage capacity
  • Perform postmortems

Our protocol engineers ensure our infrastructure is built with the right principles in mind. I.e. observable, reliable and ultimately supportable in our production environment

We have aligned our toolsets to allow a greater level of observability across our infrastructure.

As well as the golden signals (latency, traffic, errors, and saturation) we also look for blockchain specific metrics. These include block height, blocks behind and connected peers.

This alerts us before nodes go ‘unhealthy’ and allows a proactive maintenance approach. Ultimately this keeps our nodes healthy, active and prevents slashing penalties.

Effective automation allows us to onboard new protocols quickly, and rapidly expand customer node estates. 

Reducing toil through automation allows us to react quickly to market demand and frequent protocol updates. This is critical with blockchains where:

  1. Node software is constantly evolving
  2. New protocols appear regularly
  3. Network security depends on all nodes being ‘good’ and healthy
  4. Within some protocols, penalties can be incurred if your nodes are not up to date

SRE is a vital component of the Blockdaemon approach and blockchains in general. It allows blockchain organisations to stay ahead of the curve and remain in control of their node estate.

Interested in Joining Our Team?

Are you interested in joining a world-class blockchain engineering team? Check out our careers page for more information on our current opportunities.