Presenters
Source
Scaling Database Maintenance with Go: A Tale of Hundreds of Databases 🚀
Hey Gophers! 👋 Ever wrestled with database maintenance? It’s a necessary evil, but can quickly become a massive headache, especially when dealing with a microservices architecture. That’s what Abhishek, Software Architect at Cred, explored at GopherCon, and we’re breaking down the journey to a more scalable and less painful process.
The Challenge: Database Maintenance at Scale 💾
Cred, a leading payments platform in India, processes a huge volume of transactions – over one-third of all credit card bill payments in India! This is powered by a sprawling network of over 800 microservices and 150+ databases. Imagine the complexity of maintaining all of that!
Traditionally, database maintenance (version upgrades, scaling, etc.) was a grueling, multi-hour process involving:
- Massive Downtime: A full app downtime while containers were brought down, databases switched, and sanity checks were performed. This meant significant disruption for users.
- Team Coordination: Requiring 70+ team members to coordinate, slowing everything down.
- Time-Consuming: 2-3 hours per 100 databases – a major inefficiency.
The Solution: Introducing Maintenance Mode 🛠️
Abhishek and the Cred team tackled this head-on with a clever solution: Maintenance Mode. The core idea? Automate and streamline the entire process, minimizing downtime and freeing up valuable engineering resources.
Here’s how they did it, broken down into key components:
- Maintenance Libraries (Go & Java): Custom-built libraries that services can integrate to enter and exit Maintenance Mode. These libraries listen for an “etc key” to toggle the mode.
- Handlers: The heart of the maintenance libraries. These handle traffic management, connection refreshing, and other critical tasks.
- HTTP/gRPC Handlers: Return maintenance pages and exclude health check endpoints.
- Kafka/SQS Handlers: Stop and restart consumers, ensuring message processing doesn’t get lost.
- Database Handlers: The key to connection refreshing. They cache existing connection configurations, force aggressive connection refreshes (reducing idle connections to zero), and restore the original configuration when maintenance ends.
- Maintenance Control Tower: An automated control panel that orchestrates the entire maintenance process. It handles:
- App downtime initiation
- Traffic termination via Kong (API Gateway)
- Enabling Maintenance Mode for services
- Automated database maintenance
- Automated validation at each step
Diving Deeper: How it Works 🌐
Let’s break down the technical highlights:
- Traffic Termination: Instead of bringing down entire containers, the Maintenance Mode libraries intelligently stop new connections while allowing existing requests to drain.
- Connection Refresh: The database handlers are ingenious. They force a refresh of database connections without restarting containers. This is crucial for quickly switching to the new master database.
- Automated Validation: The Control Tower automatically validates each step, ensuring a smooth transition and minimizing errors.
- Simplicity for Service Owners: Service owners just need to integrate the maintenance libraries and configure their services in the Control Tower.
The Impact: A Game Changer 🎯
The results speak for themselves:
- Reduced Downtime: What used to take hours now takes minutes.
- Increased Efficiency: A single operator can now manage maintenance for hundreds of databases.
- Significant Time Savings: Tens of hours of downtime and hundreds of developer hours saved.
Key Takeaways & Resources 💡
- Embrace Automation: Manual processes are ripe for automation, especially at scale.
- Strategic Connection Management: Cleverly managing database connections is key to minimizing disruption.
- Standardized Libraries: Building reusable libraries promotes consistency and reduces development effort.
Reference Code: [Check Abhishek’s presentation for the reference code location]
This journey showcases the power of thoughtful engineering and Go’s ability to tackle complex challenges. By embracing automation and a strategic approach to database management, Cred significantly improved their operational efficiency and user experience. What are your biggest database maintenance headaches? Let’s discuss in the comments! ⬇️