Presenters

Source

PromCon EU 2025

🚀 Level Up Your Troubleshooting: Introducing the Assertion Framework 💡

Hey everyone! I’m Jorge from Grafana Labs, and I’m thrilled to share a game-changing approach to observability – one designed to cut through the alert chaos and deliver real insights. We’ve been working on this alongside Manoj (who’s enjoying some family time!), and we believe it’s a significant step forward in how we approach troubleshooting. Let’s dive in!

😫 The Alert Fatigue Epidemic – Are You Drowning? 🌊

Let’s be honest, how many of you have felt completely overwhelmed by alerts? It’s a common struggle. At last year’s PromCon conference, we saw firsthand how the sheer volume of alerts – a problem affecting many teams – leads to alert fatigue. We’re talking about a situation where you’re bombarded with notifications, struggling to prioritize, and often reacting to symptoms instead of tackling the root cause. It’s a frustrating cycle, and it’s costing teams valuable time and resources. The goal here is simple: shift from detecting anomalies to understanding them and taking decisive action. 🎯

✨ Introducing the Assertion Framework: A New Way to See 🚦

Our solution? The Assertion Framework. Think of it as a system that transforms alerts into clear, prioritized indicators – like your car’s dashboard lights. Instead of just triggering an alert, assertions provide a concise, visually-driven view of system health, allowing teams to quickly zoom out and understand the bigger picture. It’s about moving beyond reactive firefighting to proactive problem-solving. 🦾

🧱 Key Components – Building the Foundation 🛠️

This framework is built on several key elements, all working together seamlessly:

The SAAFE Taxonomy: We’ve created a classification system – “SAAFE” – that breaks down system health into four categories:
- Saturation: Monitoring resource utilization (CPU, memory, etc.).
- Anomaly: Identifying unexpected changes in metrics.
- Failure: Detecting violations of environment integrity.
- Error: Assessing whether services are meeting their expected performance. Crucially, the severity levels for these “SAAFE” categories are independent of the alert severity. This allows for flexible prioritization – a low-severity saturation issue might still warrant immediate attention.
Recording Rules & PromQL: The core of the framework relies on Prometheus recording rules, leveraging the power of PromQL to capture critical metrics. We’ve already created example rules for things like:
- Service Rollouts: Detecting new versions of services.
- Resource Saturation: Monitoring resource utilization thresholds.
- Instance Failures: Identifying instances that are down or unhealthy.
Anomaly Detection: We’ve significantly enhanced our existing anomaly detection framework, incorporating robust statistical methods. This goes beyond simple spikes to capture gradual trends, providing a more accurate picture of system health.
Visualization & Scoring: Assertions are visualized in a timeline, allowing teams to track their impact over time. A scoring system, based on duration and severity, ranks assertions, enabling teams to focus on the most critical issues first. We’re using a heuristic approach, weighting assertions based on their importance – a service rollout assertion, for example, might receive a higher score than a minor saturation issue.

🚗 Example: Service Rollout Assertions – Context is Key 💡

Let’s illustrate with a service rollout. Imagine a new version of a service is deployed. Our recording rule detects this, generating an assertion that immediately highlights the potential impact on other services and resource utilization. This assertion, visualized in a timeline, provides immediate context – you can see why the alert is happening and what needs to be investigated.

🚀 Beyond Alerts: Knowledge Rap & Entity Mapping – The Next Level 🌐

We’re taking this a step further with our product, Knowledge Rap. This combines the Assertion Framework with Entity Mapping – extracting a graph of relationships between services, infrastructure, and data. By visualizing assertions alongside these relationships, teams gain a holistic understanding of the system and can quickly identify the root cause of problems. It’s like having a map of your entire infrastructure, with critical issues highlighted in real-time. 🗺️

🛠️ Tools & Technologies – The Tech Stack 💾📡

Here’s the tech stack powering the Assertion Framework:

Prometheus: The core monitoring system.
PromQL: The query language for Prometheus.
Alertmanager: For alert distribution and deduplication.
OpenTelemetry: Used for collecting telemetry data.
Kubernetes: A common deployment environment.

⚠️ Challenges & Tradeoffs – Realistic Expectations 🧐

Let’s be upPromCont – implementing this framework isn’t without its challenges:

Rule Complexity: Creating robust recording rules can be complex and requires a deep understanding of the system. It’s an investment in learning and expertise.
Subjectivity in Scoring: The scoring heuristic is a starting point and may require adjustment based on specific team needs. It’s a flexible system that can be tailored to your environment.
Initial Investment: Implementing the framework requires an initial investment in time and resources to define assertions and visualize them effectively.

✨ Key Takeaway: Augment, Don’t Replace 🌟

The Assertion Framework isn’t about replacing alerts; it’s about augmenting them. It’s about transforming alerts from a source of noise into a source of actionable intelligence, empowering teams to proactively troubleshoot and prevent incidents. We encourage you to explore the framework, experiment with the rules, and contribute to its evolution. Let’s build a more resilient and observable future, together! 🤝

SAAFE - A prioritized alerting model to troubleshoot your incidents - Jorge Creixell, Manoj Acharya

🚀 Level Up Your Troubleshooting: Introducing the Assertion Framework 💡

😫 The Alert Fatigue Epidemic – Are You Drowning? 🌊

✨ Introducing the Assertion Framework: A New Way to See 🚦

🧱 Key Components – Building the Foundation 🛠️

🚗 Example: Service Rollout Assertions – Context is Key 💡

🚀 Beyond Alerts: Knowledge Rap & Entity Mapping – The Next Level 🌐

🛠️ Tools & Technologies – The Tech Stack 💾📡

⚠️ Challenges & Tradeoffs – Realistic Expectations 🧐

✨ Key Takeaway: Augment, Don’t Replace 🌟

Appendix

🚀 Level Up Your Troubleshooting: Introducing the Assertion Framework 💡#

😫 The Alert Fatigue Epidemic – Are You Drowning? 🌊#

✨ Introducing the Assertion Framework: A New Way to See 🚦#

🧱 Key Components – Building the Foundation 🛠️#

🚗 Example: Service Rollout Assertions – Context is Key 💡#

🚀 Beyond Alerts: Knowledge Rap & Entity Mapping – The Next Level 🌐#

🛠️ Tools & Technologies – The Tech Stack 💾📡#

⚠️ Challenges & Tradeoffs – Realistic Expectations 🧐#

✨ Key Takeaway: Augment, Don’t Replace 🌟#

Appendix#

🚀 Level Up Your Troubleshooting: Introducing the Assertion Framework 💡

😫 The Alert Fatigue Epidemic – Are You Drowning? 🌊

✨ Introducing the Assertion Framework: A New Way to See 🚦

🧱 Key Components – Building the Foundation 🛠️

🚗 Example: Service Rollout Assertions – Context is Key 💡

🚀 Beyond Alerts: Knowledge Rap & Entity Mapping – The Next Level 🌐

🛠️ Tools & Technologies – The Tech Stack 💾📡

⚠️ Challenges & Tradeoffs – Realistic Expectations 🧐

✨ Key Takeaway: Augment, Don’t Replace 🌟

Appendix