On Tuesday 14th September, Solana went down. This was caused by a flood of transactions overwhelming the chain. These transactions ground the network to a halt, temporarily.
While this drop-off was relatively short, it was acutely felt by the community. Many services that rely on Solana suffered downtime. These included exchanges, NFTs and staking providers.
Fortunately, the community rallied and successfully restored the heartbeat of Solana. Following a successful restart, we have seen a strong recovery. Services such as dApps, block explorers and validators are back running smoothly again.
As a Solana validator, Blockdaemon is committed to the health and success of the Solana blockchain. We actively participated in successfully rebooting the blockchain.
Through this outage, we have all learned 3 key lessons:
1. The Importance of Community Collaboration
The blockchain industry relies on the active participation of millions around the world. This participation is not coordinated by any central actor. Each entity is independent, yet interdependent on everyone else. There is no clearer example than what happened with Solana.
Fixing the problem relied on the collective efforts of engineers, network validators, community managers and more, from all walks of life. Some of these were volunteers who love the project. Others have large skin in the game to see Solana succeed.
Being able to collaborate with the community on various social channels was key. By staying active in the Solana Validator Discord, we were able to send and receive updates in real-time. This is what allowed us to be one of the first validators (and RPC nodes) ready for the network restart.
The lesson here is clear. This level of active participation is important in blockchain communities.
Validators can too easily remain anonymous and simply carry out their on-chain duties. However, being proactive has major advantages. We become aware of problems much faster than others who take a “set it and forget it” approach. This comes from having an active seat at the table.
2. Reaction Speed is Vital
Blockchain is the industry that never sleeps. Our 24/7 node monitoring means we’re prepared to rapidly respond to network changes. We had a response time of around 30 minutes for the successful network restart that took place.
The lesson is that predicting if a black swan event will occur with a network is virtually impossible. Reacting fast when a problem does occur is ultimately more important.
We cannot control the former, but we can control the latter. Our rapid reaction time comes from our engineering excellence, internal processes and ability to be nimble.
3. Have a Plan, and Get it Done
The quick fix and restart can be attributed to a few factors.
Firstly, we had a plan. Our dedicated engineers spent considerable effort developing and implementing a failover plan. We monitor for exactly this reason and we’re ready to react if things go bad.
Secondly, our startup hands-on get it done approach. Our world-class engineering teams deliver manual failovers when necessary. This means our engineers manually intervene in failover procedures using our internally documented runbooks.
Both Blockdaemon and our data center providers monitor Solana nodes. This includes our validator nodes and a dozen or so RPC nodes. We firmly believe that the ongoing lesson is that planning ahead is important. Having a procedure in place for events such as this is key to minimizing downtime.
In the end, Blockdaemon restarted all of our nodes without any excess downtime. We were at the front line of the network restart.
We are extremely proud to be part of a community that bands together in times of difficulties.
We offer nodes on over 40 protocols, and we are committed to every single one. We actively support the individual communities we are part of.
By looking back at what happened, we can all learn lessons. This helps us, the world’s largest independent blockchain infrastructure provider, deliver the best service in the industry.