Presenters
Source
🤖 The Trolley Problem of Metrics: Navigating Collection Resiliency 🚀
Hey everyone! 👋 Let’s talk about a surprisingly relevant thought experiment that’s impacting how we manage our data pipelines – the Trolley Problem. It might sound a bit heavy, but it perfectly illustrates a critical challenge in collection resiliency and how we’re tackling it.
🤯 The Trolley Problem in Data Collection
The classic Trolley Problem asks: if you can divert a runaway trolley to save five people, but doing so will kill one, what do you do? In the world of data collection, we’re facing a similar dilemma. Imagine your data collection system – think Prometheus, OpenTelemetry collectors, or any other metric pipeline – is nearing its memory limit. You’re staring down a massive influx of data, potentially high-cardinality data (meaning lots of unique values), and you know that processing it could lead to a crash.
You have a choice:
- Option 1: Stop the scrape. This might prevent a server-wide crash, but it could also mean missing valuable data.
- Option 2: Ignore the metric. Let the system continue to struggle, potentially leading to a catastrophic failure.
Neither option is ideal, and that’s where the complexity arises. It’s like deciding whether to sacrifice one metric to save the whole system – a tough call with potentially unforeseen consequences.
📉 The Push vs. Pull Dilemma
This challenge is particularly acute with push architectures, where data is actively pushed to collectors. While push receivers can handle retries, they can’t buffer indefinitely. Eventually, the receiver will hit its memory limit, forcing it to discard data – essentially “shifting” the problem elsewhere.
- Push: Client retries, eventual collapse.
- Pull: Collector makes a potentially destructive decision.
⚠️ The Risk of Random Killing
Currently, some collectors are employing a somewhat haphazard approach – essentially killing metrics randomly. This is a risky strategy because you don’t know if the metric you’re discarding is truly critical.
💡 Solutions in Progress: Iteration and Criticality
So, what are we doing about it? The good news is, the team is actively exploring solutions, and it’s all about iteration and gathering user feedback. Here’s what’s on the table:
- Early Garbage Collection: Implementing techniques like early compaction to proactively reduce memory usage – a bit like sacrificing a small block size for overall stability.
- Slower Scrapes: Temporarily reducing scrape intervals (e.g., from every second to every minute) to alleviate pressure on the system. Monitoring should remain fine, but sample volume will decrease.
- Metric Criticality Flags: A key future direction is incorporating metadata into metrics themselves. Imagine each metric having a “criticality” flag – indicating whether it’s essential for autoscaling, alerting, or other vital functions. This would allow collectors to intelligently prioritize which metrics to preserve.
Currently, the team is considering an “optional flag” to allow users to selectively kill metrics, acknowledging that there’s no perfect solution.
🎯 The Goal: Controlled Sacrifice
The overarching goal is to move beyond reactive crisis management to a more proactive approach. We need to be able to consciously decide which metrics to sacrifice, understanding the potential impact.
💾 Looking Ahead: Telemetry Health
The team is also focusing on “Telemetry Health,” which will involve adding schema information to metrics, including their criticality. This will empower collectors to make informed decisions about which metrics to prioritize.
🛠️ Your Feedback Matters!
This isn’t a problem we can solve in isolation. We need your input! What would you like to see implemented? How would you approach this “Trolley Problem” in your own data pipelines? Let us know your thoughts – your feedback is crucial to shaping the future of collection resiliency.
Let’s work together to build data systems that are not only robust but also intelligently manage their resources. ✨