TECHBye-Bye Burst Balance: 2-Minute Fix For Your AWS RDS WoesTL;DR: Don’t use GP2 SSD for your RDS. For almost all cases, GP3 is a better choice for the same price.
Anish Dutta
15Dec2024
3 min read
In mid-April 2024, we discovered that Kaffein had suddenly become extremely slow and unresponsive. All important features stopped working. Users couldn't send messages in private or group rooms, post new content (Stories and Pulses), or receive notifications. In short, it was a system-wide failure unlike any we had encountered before.Our initial investigation of the analytics data indicated that the failure wasn't triggered by a sudden influx of new users. While there was a steady increase in overall user activity, this growth was not substantial enough to cause such a severe system failure.Further investigation revealed that most DB writes were failing, and only a handful of reads were going through.
Something was clearly wrong with our database.Was it bad code, or was our cloud service provider experiencing an outage?
We were using a single-replica MariaDB instance hosted on AWS RDS. After confirming that AWS was operating normally, we analyzed our RDS monitoring data. Everything seemed fine except for the burst balance, which was at 0.We had no idea what burst balance was or whether it mattered.
Burst Balance during April, 2024Note: Red circle indicates where burst balance was recovered manually by upgrading and downgrading the DB instance back to back.
According to Google's Gemini, RDS burst balance provides temporary I/O performance boosts beyond the baseline to handle short-lived demand spikes. It's like extra credit that allows for higher I/O operations for a short period, replenishing over time. Depletion limits I/O performance until the balance is restored, potentially impacting application performance.
Simply put, your DB can only perform a certain number of IOPS. If you're using a 20GB GP2 (General Purpose) SSD for your RDS, like we were, your baseline would be 60 IOPS (3 * 20). Exceeding this baseline reduces the burst balance. When the burst balance reaches 0, IOPS are capped at the baseline, and further operations fail.A zero burst balance explained the terrible experience Kaffein users were having. But how did we get there? Some DB queries must have been using excessive IOPS. More importantly, how could we restore normal functionality by increasing the burst balance back to 100?
We accidentally discovered that upgrading and downgrading the RDS instance type restored the burst balance to 100. This brought Kaffein back to normal. After reviewing and debugging our backend code, we found and fixed the buggy DB queries that had been draining the burst balance. So the problem was solved temporarily.
In the following months, our understanding of burst balance influenced our entire backend design. We avoided long periods of batch processing, which could quickly drain the burst balance. Instead, we broke up processing into smaller chunks and handled them gradually throughout the day, minimizing the impact on the burst balance.A few months later, in November 2024, we noticed the burst balance steadily declining. We optimized queries, added secondary indexes to the DB, and cached more data in Redis to reduce the load on the RDS burst balance. However, the decline continued and we were back to square one.Once again, we needed to find a solution for the 0 burst balance issue. Previous methods were no longer effective.
What we discovered next was nothing short of an anti-climax.
GP2 vs GP3 pricing for MariaDB on 15th December, 2024
With GP2, a 100GB storage volume provides a baseline of 300 IOPS, which is quite expensive for early-stage startups like Kaffein. GP3, on the other hand, offers a baseline of 3000 IOPS, regardless of storage size — 50 times the GP2 baseline we were working with. Best of all, GP3 doesn't have a burst balance concept. You only pay for the additional IOPS you consume beyond the baseline.
Burst Balance during Nov-Dec, 2024Note - (i) Red circles indicate where burst balance was recovered before hitting 0 via the manual intervention described earlier.(ii) Graph discontinues after switching to GP3.
Essentially, switching to GP3 had no downsides for Kaffein. The transition was smooth, with no RDS downtime or performance degradation.
In conclusion, if you're using GP2 storage and experiencing burst balance issues, consider switching to GP3. By doing so, you can eliminate burst balance concerns altogether. Having one less thing to worry about is a good thing especially when you are building a startup.
If you enjoyed this blog, check out our other engineering and non-engineering blogs. And if you want to become smarter, try out Kaffein today - a social media that's designed to make you smarter.