cETH Price Feed Incident: Post-Mortem

Overview

OpenZeppelin, in its role as a security advisor to the Compound DAO, has conducted a post-mortem to provide a summary of the incident that temporarily disrupted the price feed to the Compound V2 ETH market. This incident started with the execution of Proposal 117 to upgrade the oracle mechanism and ended with the execution of Proposal 119 which resolved the issue by reversing the upgrade. We’ll share a retrospective on what contributed to the issue, how the response was handled and then suggest recommendations to prevent it from occurring again and improve security going forward.

Background

Compound, like many lending protocols, relies on an oracle pricing feed to ensure that its markets remain healthy and up-to-date. The oracle system used by Compound is based on ChainLink price feeds but uses anchors from Uniswap pools to ensure that prices don’t fall outside an anchor-defined range. Compound has been using this open oracle system since June 2021 when it moved away from Coinbase price feeds as part of Proposal 47.

The open oracle system was first built to use pricing from Uniswap v2 but since the release of Uniswap v3, more liquidity has been moving to those pools instead. For that reason, an upgrade was needed to update the anchors to use Uniswap v3 for the most accurate pricing. These changes were developed by ChainLink in coordination with GFX Labs in late 2021. The PR then received 3 separate security audits from Dedaub, ABDK and OpenZeppelin in which only minor security issues were raised and were mostly addressed.

In May 2022, the upgrade changes were proposed in the Compound forum and discussed by the community. After incorporating community feedback, the implementation contract was deployed by ChainLink on August 5th and submitted for an upgrade by GFX Labs on August 22nd as Proposal 177. It passed quorum with unanimous approval from all active voters and became available to execute on August 30th.

Incident Timeline

The following covers the timeline of events from start to end of the incident. Some of these event descriptions are taken from forum updates that were posted as the incident occured.

  1. On Aug 30th at 6:20 PM UTC, Compound Proposal 117 was executed to upgrade the Oracle system to the new implementation.
  2. At 6:38 PM UTC, the proposal executor notified Compound developers in a closed channel that all calls coming from the Comptroller to getUnderlyingPrice for the cETH market were reverting. All other cToken markets appeared to be unaffected. Compound Labs, ChainLink, GFX Labs and OpenZeppelin teams were immediately notified and jumped into a war room.
  3. At around 6:47 PM UTC, the proposal executor reported that because cETH does not have an underlying() method assumed to be present in every cToken contract by the new oracle implementation, the getUnderlyingPrice function returns empty bytes that cannot be decoded and the call reverts. This source of the issue was verified by Compound Labs and OpenZeppelin shortly afterwards. This left withdrawals and liquidations in the cETH market effectively frozen while the issue remained in place.
  4. At 7:14 PM UTC, a fix proposal, Proposal 119, was submitted by GFX Labs to revert the upgrade and return to using the V2 Oracle contract. Like all proposals, it could only be executed after a 7 day governance process. Around the same time, Compound Labs issued Twitter and Discord posts notifying users of the price feed issue while the war room participants continued to investigate the issue further to determine the impact on existing users.
  5. At 10:15 PM UTC, after confirming the incident status with other war room participants, OpenZeppelin issued a more detailed update that reflected events up to this point and informed users that the primary issue was a temporary denial of service for the cETH market and that other cToken markets on Compound V2 and all of V3 remained functional. This update also highlighted that while user funds were not presently at risk, users would need to carefully monitor their positions to ensure they weren’t unexpectedly liquidated once the price feed was restored a week later.
  6. The next day, Aug 31st at 5:47 PM UTC, a proactive proposal to unpause liquidations, Proposal 121, was submitted by Arr00 in coordination with OpenZeppelin and another forum update was shared. This was done to address potential risks to some downstream proposals, namely Index Coop, that did not have the ability to manage their positions while the cETH markets were frozen and so could be liquidated unfairly when the fix was executed. While this was only an edge case in the event that the Ether price dropped significantly, the Pause Guardian Multi-sig decided to be prepared to pause liquidations for roughly 24 hours after the fix if unfair liquidations were set to occur. Several other downstream protocols were checked and confirmed to not be at risk.
  7. Later at 12:53 AM UTC, Gauntlet canceled Proposal 118 for Risk parameter updates to avoid non-critical changes while the fix was pending, with plans to deploy again after the incident was resolved.
  8. Starting on Aug 31st, OpenZeppelin deployed monitoring alerts of the cETH market that were shared with war room participants to check for abnormal activity and liquidation health. The Pause Guardian also prepared countermeasures in the unlikely event that the protocol was somehow exploited while the price feeds remained frozen.
  9. On Sep 6th, after monitoring the situation over the weekend, the Pause Guardian Multisig and OpenZeppelin agreed that pausing before the fix would be unnecessary as Ether prices were significantly higher than they were prior to the disruption, making unfair liquidations a non-issue.
  10. On Sep 7th at 1:18am UTC, the fix proposal was executed and the price feeds were restored. After verifying that the cETH markets had returned to normal, the Unpause Liquidations Proposal was canceled.

Issue Source

As mentioned in the timeline, the source of the issue came from an integration issue between the new oracle contract and the cETH market. Unlike all other cToken contracts, cETH does not support the underlying function since it depends on native ETH and not on an underlying asset. As a result, when calling getUnderlyingPrice for cETH, the function returns empty bytes that cannot be decoded and the call reverts. In short, this produced a denial of service whenever the oracle tried to retrieve the ETH price.

This integration bug meant that the cETH market was effectively frozen without the ability to supply or borrow from it. Every user that entered the cETH market (by either minting cETH or calling enterMarkets) could not query the ETH price for any borrow, redeem or liquidation actions. This also meant that any users holding cETH were not liquidatable, representing a further risk for the protocol until the issue was resolved. The simplest immediate solution was to roll back to the prior oracle implementation. Work is now underway to fix the issue in the v3 upgrade with plans to redeploy it after it goes through greater scrutiny and integration testing.

Retrospective

With the issue resolved, it now becomes important to analyze the factors that contributed to this failure with a focus on process over people. Many reputable parties were involved in the events leading up to the incident so we must consider how Compound contributors, OpenZeppelin and the community as a whole can learn from this incident. It’s also important to acknowledge what went well leading up to and during the incident in addition to what did not before finally considering what changes should be made.

What went well?

  • 3 separate audits were solicited and performed well ahead of the upgrade
  • The changes did have a test suite that checked ETH price reporting, even though mocked contracts still left gaps
  • Proposal details were posted on the forum with ample time for community review and the authors addressed community feedback that came up
  • The issue was identified quickly and a fix proposal was submitted within an hour.
  • Proactive measures were taken by the Pause Guardian to protect against edge cases
  • Monitoring was used to alert for suspicious activity until the fix proposal was executed

Overall, the incident was successfully resolved with a prompt fix submitted, given the 7-day governance constraint, with no direct loss of funds. While ETH markets were favorable to Compound during the incident, proactive countermeasures were also taken to protect the protocol if liquidations became a risk. A lot of community members came forward to assist during the incident and the collective support made a big difference in protecting the protocol while the fix proposal was being passed.

What could have gone better?

  • The tests for checking ETH price reporting did not fully utilize a live mainnet fork to check if all cToken markets would operate successfully after the upgrade and instead utilized mocked contract calls that made assumptions about the functions present.
  • Three security audits did not catch the bug. The code changes within the PR used an interface for CErc20 that assumed the presence of underlying in all given contracts which made it difficult for the bug to be identified with the scope of these audits.
  • Proposal simulations using Compound’s repository tooling, while often stated as a community best practice, were not used to check if the markets would perform as expected. There was not a clear expectation of who should perform these simulations within the community and some users have reported difficulty using them.
  • While the Incident Response turnaround time was generally good, it could have been improved with a planned health check of the upgrade immediately after it was executed.
  • The Pause Guardian was able to prepare proactive measures to address potential edge cases during the incident but some of these measures had to first be discussed and understood. This response time could have been reduced with more preparation.
  • The 7-day governance process for new changes added a great deal of friction when attempting to revert the upgrade. A rollback feature would have reduced the downtime of the cETH market to minutes instead of days.

Overall, it’s clear that several failures occurred in the pre-deployment process that must be addressed for all Compound proposals going forward. While the incident response was largely successful, there are also areas here that can be improved upon given what’s been learned.

It’s also important to note that while no loss of funds occurred with this specific incident, these types of integration issues could result in far more lasting damage. This should be taken as an opportunity to reflect on what can be done to improve security to the point that these issues are incapable of finding their way into production.

What should be done differently going forward?

With this incident now resolved, there are areas where both OpenZeppelin and the Compound community can improve. The lack of complete simulation testing, complex code dependencies and unclear security processes ultimately led to a failure in catching this bug in pre-deployment. While the incident response was successful in resolving the issue, we also see a need to improve here as well. However, before these issues can be addressed, we need to first clarify the security role for OpenZeppelin, proposal authors and the community at large.

While OpenZeppelin is responsible for auditing all Compound upgrades, we see this function as the last line of defense for Compound, not the first and only defensive measure. Security first starts with the initial design of a proposal change, then its development, testing, simulation and then finally an audit as a final step before submission. Monitoring and incident response capabilities after deployment are also a crucial consideration to plan for in each proposal. This holistic approach to security throughout the development lifecycle is already well established in web2 security circles and has been more recently discussed for web3. As Compound’s Security Partner, OpenZeppelin should still be involved in the security of these other stages of the proposal lifecycle but we cannot do so alone.

Based on what the OpenZeppelin team has learned during this incident, as well as what we’ve experienced over this past year, we recommend that the Compound community establish clearly documented security policies for all proposal changes going forward. We think that the formation of a formal working group, made up of prominent community developers, would be the best forum for these changes to be discussed and implemented. We’ve proposed this idea to several community stakeholders already and will be preparing a more formal proposal to share in the coming weeks. We expect that OpenZeppelin will take a leading role in this group but we will require participation and feedback from other community members to ensure new policies are successful.

As a starting point, we recommend the following security policies be considered:

  1. Make Proposal Simulations an Explicit Requirement for Upgrades. All upgrade proposals should be required to perform comprehensive simulation testing prior to being submitted for a governance vote. This could be done by either using Compound’s custom simulation tools or other testing frameworks that can fully simulate changes on the live mainnet. OpenZeppelin will verify sufficient simulations were performed as part of its proposal audits going forward.
  2. Implement a change management process to ensure that code deployed on-chain always matches the versions of the Compound protocol repository to maintain a high level of quality and readability. This is where automated testing and security tooling should be run in preparation for an audit and final community review.
  3. Explore how proposal upgrades could be rolled back more quickly for a limited time following an upgrade’s execution without going through 7-day governance. While the goal should always be to avoid rollbacks from ever being necessary again, the option to quickly rollback would greatly reduce the damage of upgrade issues and could be done without compromising protocol decentralization.
  4. Improve the protocol’s response capabilities through incident response training conducted by OpenZeppelin with the Pause Guardian. This initiative was already in its early stages prior to this incident and its importance is all the more apparent.

OpenZeppelin stands ready to help Compound implement all of these changes. We are also conducting an internal post-mortem to improve our own audit practices so that we do a better job of catching integration issues outside the immediate scope of PR changes such as this one. However, we don’t believe any protocol can safely rely on audits alone or any single security method and that’s why we ask the Compound community to consider these new security policies and a supporting working group as crucial requirements for protecting the protocol.

Conclusion

Compound is one of the longest running DeFi lending protocols with billions of dollars in TVL and many downstream protocols, products and users that depend on its markets. Downtime due to upgrade issues cannot be tolerated going forward and so this incident must be used as an opportunity for Compound to improve its quality assurance process for proposal upgrades as well as its incident response.

This incident was preventable and can be prevented going forward with clearly defined security policies that should start with mandatory simulations for all proposal upgrades. OpenZeppelin will be taking the lead in implementing these changes but we will need the community’s support to ensure that security is improved at every step of the development process leading up to deployment and beyond. A defense-in-depth strategy, coordinated by a working group that creates explicit security policies for all proposal upgrades, is the best path forward.

We hope this post-mortem helps Compound and other protocols improve their security and better protect the open economy. We’re invite the community to share further feedback and to be on the look out for a more detailed proposal to create a developer working group.

13 Likes

Nice to see this post-mortem published :muscle:

Since the first post in Protocol Development on this forum describes a recommended process for contributing to the protocol, I would just add that ultimately governance participants need to enforce that these processes are followed.

Compound III provides significantly better tooling in this regard, so the solution to (1) and (2) might be to move in that direction. A challenge for (3) is maybe finding a general-purpose undo mechanism, but I agree that it’s a great idea to explore.

4 Likes

Thanks @jared. I agree that the post you linked is a great starting point and might just need some minor updates since it was published in 2020. The next step could be a more living checklist for each proposal to verify that each step was performed to be verified by community participants. Preferably, I think this should be done through codebase PRs to keep everything in one place.

Compound III codebase tooling is definitely superior to V2 so this is another reason to push for a migration to the new markets.

1 Like

Any good shadow fork solutions out there for devs?

1 Like

Great post-mortem @cylon, it’s very comprehensive!

I think the most important action item is the formation for a formal working group overseeing/orchestrating those operations. This is the main missing thing and major improvement, which can also set a good standard for other protocols making Compound leading the governance ops practices.

In addition, I would recommend to make explicit adding pre and post checks with governance proposal. Simulations may help but on-chain conditions can change blocks by blocks. We can leverage atomicity to avoid to perform an upgrade if a condition is not anymore true or the changes break core protocol functionalities.

In the specific case, a small borrow tx in the proposal could have spot the issue and revert the proposal passing.

Looking forward for your feedback.

2 Likes