Presenters

Mirek Chocholous

Source

PromCon EU 2025

Taming the Alert Beast: Scaling Alert Management in the Modern World 🚀

Hey tech enthusiasts! 👋 Today, we’re diving into a surprisingly complex challenge faced by many monitoring teams: Alert Manager chaos. We’re going to explore how to streamline notifications, reduce alert fatigue, and ultimately, keep your team – and your boss – happy. Let’s unpack this with Merrick from CDN77, who highlighted some critical issues and a potential solution.

The Problem: Alert Manager’s Quirks 🤯

Let’s be honest, Alert Manager is a foundational tool for many Prometheus and VictoriaMetrics setups. It’s designed to efficiently deliver alerts to receivers. However, as Merrick pointed out, it’s not without its… quirks.

Single Node Limitations: The basic Alert Manager model works great with a single node. But when you scale to a cluster, things get tricky.
Cluster Communication Headaches: Cluster mode relies on nodes talking to each other to maintain knowledge of alerted alerts. This can lead to a cascade of “okay” and “fail” messages, overwhelming receivers. Imagine your phone buzzing with alerts from servers in Amsterdam while a critical server is down in Hong Kong – not ideal! 🇭🇰➡️🇳🇱
Amnesia Strikes: The biggest pain point? When a node restarts unexpectedly, Alert Manager essentially “forgets” everything. This results in a flood of alerts needing re-acknowledgment. 💾
Rate Limit Deficiency: Alert Manager lacks built-in rate limiting, exacerbating the problem of excessive notifications.

Scaling the Solution: A Custom Approach 🛠️

CDN77 tackled this challenge with a clever, custom-built tool. Here’s how they approached it:

Centralized Management: The tool manages a collection of Alert Manager nodes, acting as a central hub.
Script-Driven Logic: It processes alerts based on user-defined scripts, allowing for granular control over notifications.
Targeted Messaging: Crucially, it allows for customized messages – like specifying the number of affected servers and their IPs – to avoid overwhelming recipients. This is a game-changer for busy owners! 🎯

The Open Source Quest 🌐

Merrick posed a critical question: “Is there an open-source tool that can handle this elegantly?” The short answer is… not really. He believes that the current situation is widespread and that open-sourcing their solution would be hugely beneficial.

Dev Summit Discussion: The team plans to discuss these challenges and potential solutions at the upcoming dev summit.

Key Takeaways & Moving Forward 💡

Quorum is Key: When dealing with geographically distributed clusters, establishing quorum (ensuring a server is reachable from multiple locations) is paramount. Don’t notify about a server issue if you can’t even ping it from a nearby region.
Rate Limiting is Essential: Implement rate limiting to prevent alert floods.
Consider Custom Solutions: If off-the-shelf tools don’t meet your needs, building a custom solution – like CDN77’s – might be the best path forward. 🦾

The Future of Alert Management 👾

This isn’t just about fixing a technical glitch; it’s about improving operational efficiency and reducing alert fatigue. By addressing these challenges, we can empower teams to focus on what truly matters: proactive monitoring and effective incident response.

Let’s continue the conversation! What are your biggest challenges with Alert Manager? Share your thoughts in the comments below. 👇

Lightning Talk: Alert Quorum Universal Aggregator - AQUA - Mirek Chocholous

Taming the Alert Beast: Scaling Alert Management in the Modern World 🚀

The Problem: Alert Manager’s Quirks 🤯

Scaling the Solution: A Custom Approach 🛠️

The Open Source Quest 🌐

Key Takeaways & Moving Forward 💡

The Future of Alert Management 👾

Appendix

Taming the Alert Beast: Scaling Alert Management in the Modern World 🚀#

The Problem: Alert Manager’s Quirks 🤯#

Scaling the Solution: A Custom Approach 🛠️#

The Open Source Quest 🌐#

Key Takeaways & Moving Forward 💡#

The Future of Alert Management 👾#

Appendix#

Taming the Alert Beast: Scaling Alert Management in the Modern World 🚀

The Problem: Alert Manager’s Quirks 🤯

Scaling the Solution: A Custom Approach 🛠️

The Open Source Quest 🌐

Key Takeaways & Moving Forward 💡

The Future of Alert Management 👾

Appendix