If you run a blockchain node it’s important to monitor its performance. A single node is relatively easy to keep on top of. When you start adding more nodes, monitoring soon becomes a challenge.
Uptime has always been important for blockchain technology. The whole solution falls apart if too many nodes are down. However, the move to proof-of-stake (PoS) protocols has ramped this need for uptime to another level.
Maintaining node uptime is vital in PoS Networks. If your PoS validator node goes down, you could end up being slashed and losing staked tokens.
To keep your nodes active, you need to make sure they’re healthy. The only way you can do this is by actively monitoring them.
If you’re only running a single node type, monitoring can be relatively trivial. You can keep track of the pertinent metrics and alert criteria… if you know what they are.
When you start adding and diversifying protocols and node types, things become more complicated.
As a leading node operator, Blockdaemon monitors tens of thousands of nodes. In fact, we probably monitor more nodes than anyone else.
In this post we’re going to share 3 secrets that nobody mentions when it comes to node monitoring:
- You can’t assume the out-of-the-box alert criteria are the most important
- Each protocol requires you to monitor different alert criteria
- Even different node types require different alert criteria
If you’re running a variety of nodes, these secrets could be the difference between success and failure.
It’s Easy to Tell When a Node is Down – Instead, You Need Advance Warning of Failure
Before we get started let’s briefly touch on why exactly you need to monitor, and what you need to look out for.
Certain metrics are easy to spot. For example, it’s easy to tell when a node is down. But by this point, it’s too late. That cat’s already out of the bag.
You need to get ahead of the game. Anticipate and prevent failures before they happen. Because of this, you need to specifically monitor alert criteria.
A Stitch in Time Saves Nine
As stated earlier, maintaining node uptime is vital in PoS Networks. Obviously, downtime restricts your access and limits your potential activities, but it could also hit your pocket. That is, downtime stops you from generating income and can even result in slashing penalties.
It goes without saying then, you want to avoid downtime wherever possible.
Naturally, issues with your nodes are easier to fix when they’re identified early. If you want to be proactive, you need to monitor the predictive metrics, the alert criteria. These are the metrics that let you identify and prevent node failures before they happen.
3 Secrets of Effective Blockchain Node Monitoring
1. You can’t assume the out-of-the-box alert criteria are the most important
As we’ve mentioned above, monitoring a single node type can be relatively straightforward. There’s only so much you need to pay attention to and it shouldn’t take up too much of your time.
Plus, the developer documentation will let you know how best to monitor your infrastructure, right?
Well, not always. We’ve found the out-of-the-box alert criteria can actually be quite misleading, especially when you start ramping up your node infrastructure.
Example – Tendermint Block Production Metrics
Many Tendermint based validators (e.g. Cosmos, Terra), report block production metrics over extremely large moving windows.
This means transient spikes may go missed if your alert threshold is too high.
However, a lower threshold could prevent alerts from subsiding. With a large enough window, the metric will take days to drop below that threshold.
To properly alert (and prevent false positives), you need to:
- Keep an instantaneous running diff
- Watch for transient spikes
- Average those over a more reasonable window
- Determine your own suitable threshold
Blockdaemon Advice: Read beyond the documentation. Reach out to the communities to get real-world experience. Find out which metrics really matter and how to sensibly monitor them.
2. Each protocol requires you to monitor different alert criteria
You may have been running a certain protocol successfully and decide that now is the time to diversify.
You’ve maintained healthy nodes and you’ve got your head around the alert criteria – you know what to look out for. The problem is, your new protocol probably has different alert criteria.
Example – Polkadot v Solana
If you’re monitoring a Polkadot validator node, you will want to keep an eye on CPU utilization, which should be well under 100%.
Alternatively with Solana, one CPU core is always at 100%. Instead of CPU, you need to monitor vote rate and leader slot success rate.
Blockdaemon Advice: As you add new protocols, learn their specific intricacies. Don’t assume that metrics are the same across blockchain networks.
3. Even different node types require different alert criteria
Somewhat surprisingly, even within the same protocol, different node types can have different alert criteria.
Example – Stellar Full History Node v Stellar Validator Node
A Stellar full history will constantly consume disk, whereas a validator won’t. Since (hopefully) you have enabled ledger pruning on your validator.
For a full history node, a drive that is almost full does not indicate an error; it just means it needs more space.
On the other hand, for a validator that only needs a limited history to continue to function, the ledger data should never consume more disk space than expected.
Blindly resizing a validator disk probably suggests you’ve ignored a potentially severe problem. Even worse, you’re needlessly increasing the cost of running that machine or instance.
Blockdaemon Advice: Pay attention to your different node types. As with protocols, you need to learn their specifics. Alert criteria can vary even within protocols.
How to Prevent Node Failure & Protect Your Investment
As we’ve discussed, monitoring is key to maintaining healthy nodes.
By paying attention to metrics, you can get ahead of the game. You can adopt a proactive, rather than reactive, approach.
Monitoring gets more and more difficult as you add more nodes to your landscape. There comes a point when it’s just not feasible to track everything. You need to get smarter and focus on the metrics that matter.
Some of these important metrics are universal, whereas some are important to specific blockchains or even node types. It’s vital that you understand not only which to monitor for each node, but how to interpret the results.
Remember, the same metrics can mean very different things depending on the node you are monitoring.
To prevent node failures, you need to focus on the critical metrics for your infrastructure. The node-specific alert criteria.
You Could Benefit From Outsourcing Your Blockchain Infrastructure & Monitoring
If you’re managing a significant number of nodes, it might be more beneficial to move your infrastructure and monitoring to a 3rd party.
Blockdaemon is the perfect choice
We are the leading independent blockchain infrastructure provider:
- We manage 500+ validating nodes
- Support 40+ protocols
- We have launched 23K+ nodes
- $10B+ in assets are staked using our infrastructure
If node health is a concern, outsourcing to experts is the best option. Not only do we take away the infrastructure headache, but we also allow you to focus on your core business activities.