Presenters
Source
Beyond the Hype: Building Real Reliability with SLOs and Fit Practices 🚀
Ever feel like the dazzling world of SRE and reliability engineering is a bit like a Michelin-starred restaurant menu for most of us? Google, bless their innovative hearts, has given us incredible insights and frameworks. But let’s be honest, for many organizations, trying to replicate their exact approach feels like baking a soufflé without an oven. That’s where Alex’s journey and his book come in, offering a grounded, practical path to building true reliability, one “fit practice” at a time.
We recently caught up with Alex, a Senior Software Engineer at Volvo Cars, who has a fascinating story of how he stumbled into the world of Site Reliability Engineering (SRE). His path isn’t the typical “from sysadmin to code wizard” narrative. Instead, it’s a story of necessity, adaptation, and a deep dive into what actually makes systems reliable in the real world.
From Media to Reliability: A Serendipitous Shift 🔄
Alex’s career began in programming, but a significant shift occurred when he was working at a media company struggling to compete with digital giants. As he recounts, “Google and Facebook were eating our lunch basically because they could do targeted advertisement.” In a bold move, the company invested heavily in talent, bringing in top engineers.
Then came the pivot: a mandate from leadership to move towards generalist software developers, eliminating specialized roles like QA and DevOps. This meant engineers, including Alex, were suddenly tasked with infrastructure and reliability – tasks that previously seemed like “black magic.”
This forced evolution, however, proved to be a revelation. Alex shares a key insight from his book: “If you find a way to make quality and reliability a software engineer’s problem, they’re going to fix that.” This echoes a sentiment echoed by Charity, the interviewer: the days of separate dev and ops teams are fading. We’re moving towards engineers owning their code from development all the way into production. This tight feedback loop is crucial for shipping good software, and anything that breaks it, like handoffs between teams, becomes a breeding ground for incidents.
The Car as a Computer: Embracing Software at Volvo 🚗💻
Alex’s current role at Volvo Cars is a testament to how deeply software is integrated into modern vehicles. He describes cars as “computer on wheels,” with software ecosystems becoming a primary differentiator. Volvo’s strategic investment in a dedicated software office in Stockholm is where Alex found himself, tasked with a monumental challenge: implementing SLIs (Service Level Indicators) and SLOs (Service Level Objectives) across a massive organization.
What started with 110 teams has now scaled to an astonishing 1,000 teams. This sheer scale meant that Alex couldn’t rely on face-to-face workshops. He needed a way to “scale the basic language of SLO and SLA across the company.” This need was the genesis of his book – to create a scalable framework for reliability.
The Google SRE “Delta”: Bridging the Gap 🌉
Charity highlights a common perception: that Google’s SRE books offer a blueprint, but there’s a significant “delta” between their model and what most companies can achieve. Alex agrees, using the metaphor of a Michelin-star recipe versus what’s possible in a home kitchen.
“Google, in their fantastic books… it’s like a fantastic chef kind of recipe for like Michelin kind of restaurants. And the most companies have started to cargo cult to Google and kind of mimic what Google does hoping to get same result. They don’t have Google’s business model. They don’t have Google’s resources. They don’t have Google’s stack.”
Instead of blindly copying, Alex advocates for building the foundational “kitchen” – the platform and tooling – before fully embracing advanced SRE practices. His book aims to bridge this gap, acknowledging that most companies aren’t at Google’s level of maturity.
Fit Practices, Not Best Practices: A Nuanced Approach 🎯
The conversation naturally steers towards the overused and often misleading concept of “best practices.” Alex, and Charity, both express skepticism, favoring the term “fit practice.”
“Best practice has this idea of absolutism in it like this is best you know objectively this is the best whereas in reality we need to find what works for a particular company company based on the tech landscape based on the budgeting headcount all that stuff.”
This philosophy is central to his approach to SLIs and SLOs. He doesn’t believe in a one-size-fits-all recipe. The more flexible and nuanced an approach is, the more he respects it.
SLOs: The APIs for Your Engineering Teams 🤝
Alex identifies SLOs as Google’s most significant contribution to reliability engineering – a sentiment Charity wholeheartedly supports. However, implementing them presents its own set of challenges.
Common SLO Pitfalls:
- Not measuring at all: Companies shy away, thinking it’s “Google jargon.”
- Measuring the wrong thing: Focusing on easy metrics like availability without considering business impact.
To combat this, Alex developed an open-source tool that visualizes dependencies between services and identifies failure points. By listing and prioritizing failures based on business impact, teams can tie meaningful SLIs to them. He’s applied this process over 70 times in his current company, deeming it a “foolproof way to find SLI that are meaningful.”
Charity brilliantly frames SLOs as “the APIs for your engineering teams.” They provide a clear, data-driven way for teams to push back against micromanagement and demonstrate their value. As long as they hit agreed-upon SLOs, they have the autonomy to manage their roadmap. This shifts the conversation from subjective opinions to objective, data-backed agreements.
The Cost of Reliability: Negotiating the Nines 💰
Reliability isn’t free. Alex emphasizes that each additional “nine” of availability comes with a significant cost – requiring refactoring, better tooling, potential hiring, and sometimes slower shipping.
“The reliability is not free right for every nine you’re adding you’re essentially shrinking the error budget by a factor of 10. So and it has a cost.”
He advocates for making these costs transparent in negotiations between product and platform teams. A junior engineering manager might simply accept requests to increase nines, leading to burnout. A seasoned leader, however, understands how to “shift the discussion from the territory of what are we being asked to do to what does it cost.”
Beyond the Nines: Understanding User Experience 💡
A striking anecdote highlights this point: a media streaming app’s CTO demanding “99.9999 availability.” Alex points out that for many services, this level of reliability is overkill and astronomically expensive. Through UX research, they discovered that for this specific product, users could tolerate up to two hours of unavailability before switching vendors – equating to a much more achievable 99.7% availability.
This underscores a critical takeaway: “What they’re asking for, they don’t know what they’re actually saying when they say that what they mean is they we want our customers to have a good experience.”
The Power of Data Over Emotion 📊
Alex acknowledges the risk of service levels being “weaponized by management.” However, he also extends an invitation to his engineering peers: “look at it as a way to use hard data to put an end to emotional discussions.” This is particularly relevant when certain services have blanket SLIs applied, irrespective of their criticality. A more nuanced approach involves defining reliability classes based on the user’s critical path and business impact.
The Dream: SLOs and Observability United ✨
The discussion circles back to observability and the ideal scenario where SLOs are a first-class citizen. The current model of siloed data storage across metrics, traces, and logs often hinders comprehensive analysis. Alex expresses his excitement about the potential of tools that unify these signals, allowing SLOs to be computed from the same data used for investigation.
“I got really really excited when I learned about it is that SLO is a first class citizen in Hong Kong and that’s how it should be. It shouldn’t be some you know checking a box or something and then you only enable it in some enterprise tier or something. No, this is such a core topic.”
This unified approach allows engineers to seamlessly transition from observing SLO violations to deep-diving into the exact requests and contributing factors, transforming troubleshooting from a chore into an efficient exploration.
Final Thoughts: Fluke, Chaos, and Meaning 🧠
As the conversation winds down, Alex shares a profound book recommendation: “Fluke: Chance, Chaos, and Why Everything We Do Matters.” He finds resonance in its exploration of how seemingly random events can have monumental impacts, offering a balanced perspective on agency and the inherent contingency of history. It’s a reminder that while not everything we do changes the world, some things do, and the impact can be unpredictable.
This deep dive into reliability, from the practical challenges of implementing SLOs at scale to the philosophical implications of our work, offers a valuable blueprint for any engineering team striving for robust and meaningful system performance. It’s a call to move beyond dogma and embrace “fit practices” that truly serve your organization’s unique context.
Key Takeaways:
- Embrace “Fit Practices”: Adapt reliability frameworks to your specific company context, resources, and tech stack, rather than blindly following “best practices.”
- Empower Engineers: Making reliability an engineer’s problem fosters ownership and drives effective solutions.
- SLOs as APIs: Use SLOs as data-driven tools to manage expectations, push back on unreasonable demands, and foster informed discussions.
- Transparency of Cost: Understand and communicate the real cost associated with achieving higher levels of reliability.
- Focus on User Experience: Align reliability goals with actual user needs, not just arbitrary “nines.”
- Unified Observability: The future lies in tools that integrate SLOs with underlying telemetry data for seamless investigation.
This conversation is a powerful reminder that building reliable systems is an ongoing journey, fueled by practical insights, a willingness to adapt, and a deep understanding of what truly matters.