Presenters
Source
Level Up Your Observability: Crypto’s Journey to Automatic Aggregation 🚀
Observability is the lifeblood of any modern, rapidly scaling operation – especially in the dynamic world of cryptocurrency. At Crypto, we’ve been on a fascinating journey to build a robust and effortless observability system, and today, I want to share the key insights we’ve gained. Buckle up, because it’s a story of challenges, pivots, and ultimately, a smarter, more scalable approach. 💡
The Initial Struggle: A Metric Avalanche 🌊
Let’s be honest – starting out with observability can feel like trying to catch water in a sieve. Initially, we were focused on tracing, but quickly realized we needed a system capable of handling the sheer volume of data generated by our 1000+ users – SREs, data scientists, and researchers – all monitoring their services. We were processing around 15 million data points per second, representing a staggering one billion active series. This is a lot of data. 🤯
Graphite’s Limitations: The Cardinality Conundrum 📉
Our initial solution was Graphite, a popular time-series database. However, we hit a major roadblock: cardinality. As more instances of our services were deployed, the number of metrics exploded, limiting our query capabilities to just 10,000 metrics at a time. This meant our users were struggling to effectively monitor their services – a critical issue for a platform built on speed and reliability.
Overseer: A Bold Experiment 🧪
Recognizing the need for change, our team developed “Overseer,” our first attempt at automatic aggregation. Overseer was a clever solution, leveraging Prometheus and a unique data format. We transformed metric names into metric types (counter, gauge, histogram), dictating the appropriate aggregation method (rate, sum, percentiles). But here’s the really interesting part: we removed instance labels. This was a bold decision – sacrificing the ability to drill down to a specific instance for troubleshooting – but it dramatically reduced cardinality, simplifying queries and boosting performance. The result? We slashed 120 million metrics down to under 4 million, a 10x reduction in data volume! 😲
Victoria Metrics and the Streaming Revolution 📡
While Overseer was a significant step forward, we knew we couldn’t stop there. We transitioned to Victoria Metrics, a Prometheus-compatible backend, offering improved scalability and performance. More recently, we’ve been exploring streaming aggregations within Prometheus itself, utilizing a new feature flag. This approach, built on open telemetry conventions, promises to further reduce resource consumption and simplify our architecture – a key focus for us. 🤖
Key Tradeoffs: Balancing Act ⚖️
This journey hasn’t been without its challenges. Here’s a breakdown of the key tradeoffs we’ve had to consider:
- Instance Labels: Removing instance labels provided massive benefits in terms of cardinality reduction and query speed, but required careful consideration and a willingness to accept limitations for troubleshooting.
- Graphite Shadow: We still maintain a legacy Graphite infrastructure, presenting a potential hurdle for future modernization.
- OpenTelemetry Adoption: Our long-term strategy is firmly rooted in embracing open telemetry standards, ensuring interoperability and reducing vendor lock-in. This is a crucial step towards a more flexible and future-proof observability stack.
- Cost vs. Retention: Continuously balancing retention periods with operational costs is a constant challenge, driving our exploration of streaming aggregations.
Quantifiable Results: The Numbers Speak 📊
Let’s look at the impact of our efforts:
- Data Ingestion: 15 million data points per second (1 billion active series).
- Aggregation: 100 million aggregated metrics stored for 3 months, downsampled for a year.
- Metric Reduction: 120 million metrics reduced to under 4 million through automatic aggregation.
- Deployment Stability: We’ve seen a significant reduction in deployment-related issues, directly attributable to the reliable metric availability provided by our improved system. 💪
Looking Ahead: A Native, Automated Future 🎯
Our ultimate goal is to build a native automatic aggregation system, leveraging open telemetry standards and Prometheus’s streaming aggregation capabilities. We envision a future where our observability stack is completely streamlined, reducing operational overhead and empowering our users with a truly effortless experience. We’re confident that this pragmatic approach – born from a technical necessity – will continue to drive reliability and innovation at Crypto. 💫
This journey has been a testament to the power of iterative development and a willingness to embrace new technologies. It’s a reminder that observability isn’t just about collecting data; it’s about making that data usable and actionable. Let’s continue to explore, experiment, and build the best possible observability solutions for the future! 🛠️