Presenters
Source
Level Up Your Prometheus Observability with Automated PR Checks 🚀💡
Hey everyone! 👋 Let’s dive into a super practical solution for ensuring best practices are followed when users are modifying your Prometheus configurations. This isn’t a polished presentation – it’s a lightning talk born from a lunch-time brainstorm, but the core ideas are incredibly valuable. 🤖
The Problem: Manual Validation is a Bottleneck 🧱
Traditionally, when users (like those on your platform observability team) create new script targets or alerting rules in Prometheus repositories, the validation process was entirely manual. 🧑💻 This meant a real operator had to manually review every pull request, checking for potential issues. The challenge? This was:
- Cumbersome: A huge time sink.
- Expert-Dependent: The quality of validation relied heavily on the operator’s expertise.
- Prone to Errors: Manual checks are susceptible to human oversight.
The Solution: PR Checker – Automating the Review 🛠️
To tackle this, the team built a project called “PR Checker.” It’s a Go-based tool designed to automate the validation process. Here’s how it works:
- Go Implementation: Chosen for its ability to parse Prometheus and leverage its specific features.
- Limited PromQL: The project intentionally restricts the use of all PromQL features to maintain a focused and manageable set of checks.
- Integrated Checks: The PR Checker integrates directly into the repository, triggering on every pull request and pull request change.
- Smart Feedback: Instead of simply failing the build, the PR Checker
provides feedback in the pull request comments:
- Warnings: Non-critical issues that can be addressed.
- Errors: Critical issues that require immediate attention and build failure.
- Comprehensive Checks: The PR Checker currently validates:
- Target Connectivity: Ensuring targets are reachable.
- Metric Count: Verifying the number of metrics being collected.
- Alert Definition Checks: Analyzing alert expressions for potential problems.
The Impact: Driving Best Practices 🎯
The PR Checker isn’t just about catching errors; it’s about actively promoting best practices. Here’s what they’ve observed:
- Team Efficiency: The automated system has significantly streamlined the validation process, especially as the team and the number of pull requests grow.
- User Engagement: Users are more likely to adopt best practices when they receive direct, actionable feedback on their pull requests.
- Documentation Leverage: The PR Checker cleverly uses pull request comments to link users to relevant documentation – a surprisingly effective way to drive adoption of recommended practices. (As Ivana pointed out, users often don’t read documentation before submitting a pull request.)
Key Takeaways & Future Directions 🌟
- Proactive Guidance: This approach transforms manual work into a proactive way to guide users towards better practices.
- Extensibility: The PR Checker is designed to be easily extended with new checks as needed.
- Sweet Spot for Best Practices: It’s a great way to “sneak in” desired practices when users aren’t actively seeking documentation.
This simple, automated solution demonstrates how a little ingenuity can dramatically improve the quality and consistency of your Prometheus configurations. It’s a fantastic example of how to empower your users and build a stronger observability ecosystem. ✨