Is AIOps the Future of IT Operations? Real Use Cases from the Trenches by Danilo Banjac, Iveri Prang

Presenters

Source

Devoxx Belgium 2025

Taming the LLM: Lessons from Building an Autonomous Incident Response Agent 🚀

Building AI agents to automate tasks is the hot topic in tech right now. But what happens when the theory meets reality? This post dives into a fascinating case study: a company’s journey in building an LLM-powered agent for incident response and troubleshooting. It’s a real-world post-mortem, packed with hard-won lessons and practical advice for anyone considering a similar venture.

The Initial Dream: A Single Tool to Rule Them All 💡

The initial vision was ambitious: create a single, versatile AI agent capable of handling a wide range of incident response tasks. The plan involved using a single, flexible tool (think a Prometheus query tool) controlled by the LLM. The hope? The LLM would leverage a knowledge base to generate queries and resolve issues.

However, this approach quickly hit a wall. The team encountered several critical challenges:

Tool Call Overload: Complex issues demanded many tool calls, slowing down the process and creating endless decision points for the LLM.
Unpredictable Trajectories: The LLM’s path through these tool calls became difficult to predict, leading to inconsistent and unreliable results.
Limited Adaptability: The LLM struggled to fetch data it didn’t know how to query for, severely limiting its problem-solving capabilities.

The Pivot: Specialization and Structure 🛠️

Recognizing the limitations of the initial approach, the team shifted gears. This pivot centered around two key changes:

Specialized Tools: Replacing the generic tool with focused tools tailored to specific tasks – metrics, logs, releases, etc.
Structured Output: Ensuring tools provided structured data, including descriptions of what the query does and the expected output. This made the LLM’s reasoning far more transparent.
Composite Tools: Combining related tools into single calls for common use cases, streamlining workflows.
Read-Only Deployment: Initially deploying the agent in read-only mode, a critical safety measure to prevent accidental modifications to the production system.
User Impersonation: Executing all tool calls under a specific user identity with limited permissions.

Key Takeaways: What We Learned 🎯

This journey yielded a wealth of valuable lessons. Here’s a breakdown of the most impactful:

Avoid Overly Generous Tools: Flexibility is great, but too much freedom can lead to unpredictable behavior. Structure and focus are key.
Structure is Crucial: Well-defined tool interfaces and structured data are essential for predictable and effective agent operation.
Specialization Beats Generality: Focused tools tailored to specific tasks consistently outperform generic tools.
One-Shot Summarization is Powerful: LLMs excel at summarizing raw data when provided with the right context.
Prioritize Safety: Read-only deployment and user impersonation are critical for preventing unintended consequences.
Treat LLMs as Unreliable Clients: Rigorous parameter validation and graceful error handling are non-negotiable.
Fail Fast on Untrained Tasks: Prevent the agent from attempting tasks it’s not designed for.

Diving Deeper: Technical Nuances 💾

Let’s look at some of the specific technical challenges and solutions:

Prometheus Query Tool Issues: The initial attempt with a single Prometheus query tool proved too complex, demanding too many calls and creating unpredictable decision points.
Importance of Tool Descriptions: Providing clear descriptions of tool queries helps the LLM understand the context and interpret results effectively.
User Context Injection: Injecting user identity into tool calls enables granular access control and auditability.
Structured Output from Tools: Standardizing output formats makes integration and interpretation much smoother.
Validation and Error Handling: Essential for reliability; think of it as building a safety net for the LLM.

What’s Next? Future Directions 📡

The journey doesn’t end here! Here’s what the team is exploring:

Refine Tool Design: Continuously evaluating and improving individual tools based on usage patterns and feedback.
Dynamic Tool Selection: Exploring ways to allow the LLM to dynamically select the appropriate tool based on the context (a tricky balancing act!).
Advanced Prompt Engineering: Experimenting with more sophisticated prompt engineering techniques to guide the LLM’s behavior.
Feedback Loops: Implementing a feedback loop to allow human operators to correct the agent’s actions and improve its performance.
Expand Toolset: Considering expanding the toolset to include capabilities for more advanced troubleshooting tasks.
Monitoring and Observability: Implementing robust monitoring and observability to track the agent’s performance and identify areas for improvement.
Security Audits: Regularly conducting security audits to ensure the agent’s security posture.

Beyond the Basics: Advanced Considerations 🌐

Data Source Prioritization: Carefully curating data sources – prioritizing validated Root Cause Analysis (RCA) tickets and cautiously processing Slack threads.
Continuous Grounding in Knowledge Bases (RAG): Continuously grounding the LLM at each step, ensuring it uses the most relevant data and runbooks.
System Prompt Optimization: Focusing on curating and updating knowledge bases (runbooks, Jira tickets, Slack threads) rather than complex system prompts.

This case study provides a valuable roadmap for anyone venturing into the world of LLM-powered automation. It’s a reminder that building truly autonomous systems requires a pragmatic approach, a willingness to adapt, and a relentless focus on safety and reliability.

Taming the LLM: Lessons from Building an Autonomous Incident Response Agent 🚀#

The Initial Dream: A Single Tool to Rule Them All 💡#

The Pivot: Specialization and Structure 🛠️#

Key Takeaways: What We Learned 🎯#

Diving Deeper: Technical Nuances 💾#

What’s Next? Future Directions 📡#

Beyond the Basics: Advanced Considerations 🌐#

Appendix#