Presenters

Source

Level Up Your Prometheus Observability with Automated PR Checks 🚀💡

Hey everyone! 👋 Let’s dive into a super practical solution for ensuring best practices are followed when users are modifying your Prometheus configurations. This isn’t a polished presentation – it’s a lightning talk born from a lunch-time brainstorm, but the core ideas are incredibly valuable. 🤖

The Problem: Manual Validation is a Bottleneck 🧱

Traditionally, when users (like those on your platform observability team) create new script targets or alerting rules in Prometheus repositories, the validation process was entirely manual. 🧑‍💻 This meant a real operator had to manually review every pull request, checking for potential issues. The challenge? This was:

  • Cumbersome: A huge time sink.
  • Expert-Dependent: The quality of validation relied heavily on the operator’s expertise.
  • Prone to Errors: Manual checks are susceptible to human oversight.

The Solution: PR Checker – Automating the Review 🛠️

To tackle this, the team built a project called “PR Checker.” It’s a Go-based tool designed to automate the validation process. Here’s how it works:

  • Go Implementation: Chosen for its ability to parse Prometheus and leverage its specific features.
  • Limited PromQL: The project intentionally restricts the use of all PromQL features to maintain a focused and manageable set of checks.
  • Integrated Checks: The PR Checker integrates directly into the repository, triggering on every pull request and pull request change.
  • Smart Feedback: Instead of simply failing the build, the PR Checker provides feedback in the pull request comments:
    • Warnings: Non-critical issues that can be addressed.
    • Errors: Critical issues that require immediate attention and build failure.
  • Comprehensive Checks: The PR Checker currently validates:
    • Target Connectivity: Ensuring targets are reachable.
    • Metric Count: Verifying the number of metrics being collected.
    • Alert Definition Checks: Analyzing alert expressions for potential problems.

The Impact: Driving Best Practices 🎯

The PR Checker isn’t just about catching errors; it’s about actively promoting best practices. Here’s what they’ve observed:

  • Team Efficiency: The automated system has significantly streamlined the validation process, especially as the team and the number of pull requests grow.
  • User Engagement: Users are more likely to adopt best practices when they receive direct, actionable feedback on their pull requests.
  • Documentation Leverage: The PR Checker cleverly uses pull request comments to link users to relevant documentation – a surprisingly effective way to drive adoption of recommended practices. (As Ivana pointed out, users often don’t read documentation before submitting a pull request.)

Key Takeaways & Future Directions 🌟

  • Proactive Guidance: This approach transforms manual work into a proactive way to guide users towards better practices.
  • Extensibility: The PR Checker is designed to be easily extended with new checks as needed.
  • Sweet Spot for Best Practices: It’s a great way to “sneak in” desired practices when users aren’t actively seeking documentation.

This simple, automated solution demonstrates how a little ingenuity can dramatically improve the quality and consistency of your Prometheus configurations. It’s a fantastic example of how to empower your users and build a stronger observability ecosystem. ✨


Appendix