Presenters
Source
🚀 Unleashing the Power of Change: Delta Temporality in Prometheus 🤖
Time-series data is the lifeblood of modern monitoring and observability. But how we interpret that data – specifically, how we represent change – is surprisingly complex. Today, we’re diving deep into a fascinating evolution within Prometheus, spearheaded by Grafana Labs, exploring the concept of “delta temporality” and its potential to revolutionize how we track and analyze metrics. Let’s unpack this and see what it means for your monitoring stack. 💡
🕰️ Temporality vs. Delta: A Fundamental Shift
Traditionally, Prometheus has operated on what’s called “cumulative temporality.” Think of it like a running total. Each time a counter metric is reported, it’s simply added to the previous value. This is convenient, but it has limitations. It’s like tracking the total number of cars that have passed a point – you only know the current total, not how many new cars have arrived since the last update.
Enter “delta temporality,” inspired by databases like OTEL. This approach focuses on the change in value – the difference between the current sample and the previous one. It uses “exclusive start times” for each sample, marking the precise moment of change. This is like tracking the number of cars that have passed a point every time a car arrives – you get a complete picture of the flow. 🎯
🚧 Prometheus’s Current Struggle: Conversion and Loss
Currently, Prometheus handles delta metrics ingested via OTLP by converting them into cumulative metrics. It adds each new delta sample to the running total. While this works, it’s not without its problems. The biggest challenge? Out-of-order samples. If a delta sample arrives later than expected, Prometheus has to recalculate it, potentially leading to data loss. 👾
To compensate, Prometheus employs “edge extrapolation.” It estimates the values at the boundaries of the query range. However, this extrapolation is prone to inaccuracy, especially when dealing with irregular sample intervals – a common occurrence with delta metrics. This can significantly skew increase calculations, making it harder to accurately track growth. 📉
✨ The Promise of Native Delta Support: A New Path Forward
The goal is simple: to ingest delta metrics directly, without the conversion step. This would offer greater flexibility, potentially improved accuracy, and a more faithful representation of the underlying system. The proposed solution centers around a new field: the “created time stamp.” 💾
This timestamp, added to each sample, allows Prometheus to accurately detect resets – moments when a counter resets to zero – and leverage standard Prometheus functions like “increase” and “rate” with confidence. 💪
🤯 Challenges and Trade-offs: Navigating the Complexity
Implementing native delta support isn’t a walk in the park. Here’s what’s involved:
- Out-of-Order Handling: This is the big challenge. Maintaining accurate increase calculations when samples arrive out of order requires sophisticated logic.
- Data Irregularity: Delta metrics often have irregular sample intervals, making extrapolation less reliable.
- Query Engine Complexity: Changing the query engine to support delta temporality will inevitably impact performance and potentially introduce inconsistencies. 🛠️
- User Choice: A key suggestion is to introduce user-configurable calculation modifiers, similar to Victoria Metrics. This would allow users to choose between different calculation methods (e.g., sum over time vs. extrapolation), offering greater control but adding complexity. ⚙️
📡 Tools and Inspiration: OTLP and Victoria Metrics
- OTLP (OpenTelemetry Protocol): The standard for metric ingestion, facilitating the flow of delta metrics into Prometheus.
- Victoria Metrics: A valuable reference point, showcasing a flexible approach to metric processing with configurable calculation methods. 🌐
🎯 Conclusion: A Balancing Act
Prometheus’s journey towards native delta temporality is a complex one, filled with significant challenges. The proposed solution, leveraging the “created time stamp,” is a promising step forward. However, careful consideration must be given to edge cases, query engine complexity, and the potential impact on performance. The shift towards user-configurable calculation modifiers represents a crucial balance between flexibility and maintainability. 💫
Ultimately, understanding the difference between cumulative and delta temporality is key to unlocking the full potential of your time-series data. It’s a fascinating evolution, and one that promises to make monitoring and observability even more powerful. 🚀