Building a more robust deployment process to prevent future meltdown

I believe Compound needs a more robust deployment process if it is to stand the test of time. The recent COMP distribution bug and our inability to fix it for the next 7 days highlight the gaps in the current process. Improving the code review and testing process is a must, but still insufficient. Any stable software system needs the ability to quickly rollback changes, and rollout updates in a phased manner. No amount of code review and testing can compensate for this.

  1. Rollback capability - it currently takes at least 7 days to make changes to the protocol. I know there are good reasons for this process, but we also need the ability to quickly deploy fixes in emergencies. As such, I propose that we create a separate process for rolling back changes. To prevent abuse, we can have different parameters around the process such as an elevated minimum votes required, max timelapse since deploy, versions that are eligible to be rolled back to (if possible), etc. IMO, as engineers, we are responsible to ensure a minimum level of safety in the systems we build, and I don’t think we will ever get there with the 7-day delay constraint. It’s like we are building a major bridge in San Francisco and not making it earthquake proof. Safety needs to come first.
  2. Phased Rollout capability - a standard practice of traditional software that I believe we can implement on the blockchain to make Compound more robust. Similar to how AB tests are implemented, we can require all major code changes to branch users (into new code vs old code) based on a predefined schedule (e.g. Day 0 - 1%, Day 2 - 5%, Day 7 - 10%, etc). This gives us time to monitor changes, deploy fixes, and minimize the impact of bugs.

Robert mentioned on Discord recently that Compound is decentralized to “ensure that the protocol can run for 100 years”. That’s the future I want to see, and I believe we need to engineer a more robust system so that we can realize that dream. We are lucky that the recent bug was only around COMP distribution. It could have been much worse, and the story would have been that Compound is decentralized which caused the protocol to only last for 3 years. Let’s learn from this and ensure our future is the former and not the latter.

7 Likes

Related thread:
More Rigorous Process On Reviewing Large Code Changes (RE: Comp Bug 9/29/21) - Governance Process - Compound Community Forum

Additional ideas:

  • Test env
  • Proposal simulation on real data
  • Extra audit incentivization
  • Greater test coverage
  • More granular changes
  • Formal verification
4 Likes

Rollback capability would generally increase the complexity of an upgrade, sometimes non-trivially. However I still think we should push for this wherever possible.

I feel pausability is also a critical mitigation step that could have completely prevented this issue. Adding it in the right places is a must in my opinion, especially on the COMP distribution which is extra functionality not core to the protocol.

2 Likes

Agree that rollback capabilities would increase complexity. Perhaps the first step is to create a quicker deployment process (more votes required than normal) without restricting the change to rollbacks. Seems like this thread is suggesting the same.

I also think that adding distributeSupplierComp, distributeBorrowerComp, and claimComp to the set of functions that the existing Pause Guardian can control is the simplest and low risk change we can make right now to protect ourselves from future bugs related to COMP distributions. If others agree, I can put a PR together.