On April 30, 2022, Solana suffered a network outage. Block production ceased, caused by a flood of incoming transactions totaling 100 Gbs, and exacerbated by a validator bug causing >500 GB of memory consumption.
A full write-up of the incident was published by the Solana Foundation, and can be found here.
At this critical moment, Blockdaemon engineers worked tirelessly to support our customers, while helping restore the network.
It’s been a week since the blockchain network was successfully restored. This was aided in no small part by the efforts of Blockdaemon’s Solana experts.
In this post, we’ll focus on how Solana’s outage played out over seven hours. We’ll also take a look at the important role Blockdaemon engineers played in recovering the network’s health.
Solana’s Mainnet Beta Cluster Stops Block Production
Saturday, April 30, 20:30 UTC
On Saturday evening, Blockdaemon’s engineers noticed that a large spike of network traffic hit our Solana nodes.
However, Blockdaemon’s infrastructure was not alone in this regard. Six million inbound transactions per second flooded the Solana network, clocking over 100 Gbps of traffic at individual nodes.
This surge of transactions stemmed from one overwhelming source: bot activity.
Bots were attempting to win an NFT mint with a fixed floor price. This perversely incentivized bots to spam the network with transactions, in an attempt to claim the mint. Given Solana’s low fees, the cost of trying to win the mint did not outweigh the potential upside of winning. These NFTs were being minted by the Candy Machine program.
Ultimately, the bot spam worsened block propagation, leading to an increased number of forks. More forks caused an extreme increase in memory requirements, until Solana validators eventually crashed. This stalled network consensus.
Here’s how Blockdaemon responded to the incident…
Blockdaemon Closed All Mainnet Solana RPC Nodes
Saturday, April 30
Blockdaemon’s engineers responded rapidly to the situation. In order to prevent RPC servers from running out of memory and crashing, the team decided to shut down all mainnet Solana RPC nodes. This was decided after block production nearly halted and block confirmation had ceased. This prevented Blockdaemon’s clients from retrieving stale information, such as incorrect account balances, and other important on-chain data.
However, Blockdaemon’s validator nodes were kept online until block production completely halted. This allowed the team to capture the latest blocks leading up to the halt, and to independently verify the blockchain state.
From the outset, Blockdaemon’s engineers looked to mitigate as much bad traffic as possible.
Blockdaemon Helped Snapshot Creation & Distribution
Saturday, April 30
With Solana’s consensus stopped, the heartbeat of the blockchain, the validator community had to manually reach an agreement over the latest specific slot and bank hash before the chain halted completely. The community also had to manually agree on the restart instructions.
By manually agreeing on the bank hash of the restart snapshot, the community ensured
- All validators have the exact same revision of the blockchain state
- No fraud attempts were made to modify the state (e.g. no one could mint new SOL out of thin air)
Blockdaemon’s engineers played a pivotal role in both Solana snapshot creation and distribution. This snapshot was the key to the Mainnet-Beta Cluster Restart.
Blockdaemon engineers worked closely with the Solana validator community to create the snapshot.
Together they decided on a slot and shred version to roll the chain back to. This was decided to be slot 131973970. This was the last optimistically confirmed slot before block production halted.
The snapshot creation process underwent a lot of scrutiny from the community. The state from which the blockchain was to restart was independently verified by dozens of Solana validator operators.
Since time was of the essence, Blockdaemon Engineers proactively commenced the snapshot process. This action was taken before the community had agreed on a slot number and accelerated recovery efforts.
We also helped other validators come online quickly by providing a fast mirror for restart-relevant snapshots. Blockdaemon was then featured in the community-created cluster restart runbook.
To distribute the snapshot to our fellow validators, Blockchain provided a public download for the restart snapshot, following our snapshot management procedure. Other validators soon followed.
Based on the number of downloads this garnered, we know that many other validators used Blockdaemon’s snapshot download to restart. Blockdaemon’s URL-based download method is faster and more reliable than Solana’s default p2p downloads.
Without Blockdaemon’s efforts, Solana could have taken longer to successfully restart.
A full look at Blockdaemon’s Solana snapshot management is shared here.
Solana’s Mainnet Beta Cluster Successfully Restarts
Solana’s Mainnet Beta Cluster successfully restarted, seven hours after the incident began.
This was achieved once 80% of the network stake reached consensus, which took several hours. Yet this still marked the fastest restart of Solana yet. This is a testament to the close, quality collaboration between active validators.
Because of the steps that our engineers took, our RPC nodes were among the first to begin running again.
Block production re-commenced, and the consensus resumed as normal.
Blockdaemon’s Proactive Solana Engineering
During the seven hours it took to restore the network, our engineers rose to the occasion. Working together with the community, they helped restore the blockchain.
Solana is a decentralized ecosystem. As such, close collaboration with fellow validators, without a centralized coordinating party, was essential throughout this entire process.
Blockdaemon’s engineers were on the front lines, acting as motivated operators that helped lead and coordinate the restoration process.
It is also important to highlight the engineering team’s swift time-to-respond.
Blockdaemon’s Solana team were aware of the impending incident before it happened, and were online for the entire seven hours, until Solana and all of Blockdaemon’s infrastructure had been restored.
The result was a smooth transition back to a fully functioning, high performance blockchain and a return to normal service for Blockdaemon’s clients.