Presenters

Source

Scaling Database Maintenance with Go: A Tale of Hundreds of Databases 🚀

Hey Gophers! 👋 Ever wrestled with database maintenance? It’s a necessary evil, but can quickly become a massive headache, especially when dealing with a microservices architecture. That’s what Abhishek, Software Architect at Cred, explored at GopherCon, and we’re breaking down the journey to a more scalable and less painful process.

The Challenge: Database Maintenance at Scale 💾

Cred, a leading payments platform in India, processes a huge volume of transactions – over one-third of all credit card bill payments in India! This is powered by a sprawling network of over 800 microservices and 150+ databases. Imagine the complexity of maintaining all of that!

Traditionally, database maintenance (version upgrades, scaling, etc.) was a grueling, multi-hour process involving:

  • Massive Downtime: A full app downtime while containers were brought down, databases switched, and sanity checks were performed. This meant significant disruption for users.
  • Team Coordination: Requiring 70+ team members to coordinate, slowing everything down.
  • Time-Consuming: 2-3 hours per 100 databases – a major inefficiency.

The Solution: Introducing Maintenance Mode 🛠️

Abhishek and the Cred team tackled this head-on with a clever solution: Maintenance Mode. The core idea? Automate and streamline the entire process, minimizing downtime and freeing up valuable engineering resources.

Here’s how they did it, broken down into key components:

  • Maintenance Libraries (Go & Java): Custom-built libraries that services can integrate to enter and exit Maintenance Mode. These libraries listen for an “etc key” to toggle the mode.
  • Handlers: The heart of the maintenance libraries. These handle traffic management, connection refreshing, and other critical tasks.
    • HTTP/gRPC Handlers: Return maintenance pages and exclude health check endpoints.
    • Kafka/SQS Handlers: Stop and restart consumers, ensuring message processing doesn’t get lost.
    • Database Handlers: The key to connection refreshing. They cache existing connection configurations, force aggressive connection refreshes (reducing idle connections to zero), and restore the original configuration when maintenance ends.
  • Maintenance Control Tower: An automated control panel that orchestrates the entire maintenance process. It handles:
    • App downtime initiation
    • Traffic termination via Kong (API Gateway)
    • Enabling Maintenance Mode for services
    • Automated database maintenance
    • Automated validation at each step

Diving Deeper: How it Works 🌐

Let’s break down the technical highlights:

  • Traffic Termination: Instead of bringing down entire containers, the Maintenance Mode libraries intelligently stop new connections while allowing existing requests to drain.
  • Connection Refresh: The database handlers are ingenious. They force a refresh of database connections without restarting containers. This is crucial for quickly switching to the new master database.
  • Automated Validation: The Control Tower automatically validates each step, ensuring a smooth transition and minimizing errors.
  • Simplicity for Service Owners: Service owners just need to integrate the maintenance libraries and configure their services in the Control Tower.

The Impact: A Game Changer 🎯

The results speak for themselves:

  • Reduced Downtime: What used to take hours now takes minutes.
  • Increased Efficiency: A single operator can now manage maintenance for hundreds of databases.
  • Significant Time Savings: Tens of hours of downtime and hundreds of developer hours saved.

Key Takeaways & Resources 💡

  • Embrace Automation: Manual processes are ripe for automation, especially at scale.
  • Strategic Connection Management: Cleverly managing database connections is key to minimizing disruption.
  • Standardized Libraries: Building reusable libraries promotes consistency and reduces development effort.

Reference Code: [Check Abhishek’s presentation for the reference code location]

This journey showcases the power of thoughtful engineering and Go’s ability to tackle complex challenges. By embracing automation and a strategic approach to database management, Cred significantly improved their operational efficiency and user experience. What are your biggest database maintenance headaches? Let’s discuss in the comments! ⬇️

Appendix