Presenters

Source

Embracing Flux: BT’s Transformative GitOps Journey 🚀

Ever felt like your infrastructure is a bit… unpredictable? That’s exactly where BT found themselves, wrestling with on-premise challenges that made even daily deployments feel like a gamble. But instead of throwing in the towel, they embarked on a fascinating journey to adopt Flux for GitOps, transforming their deployment pipelines and learning some invaluable lessons along the way. Let’s dive into their experience, from a cautious first step to a global scaling ambition!

Project A: The First Foray into Flux 💡

BT’s initial encounter with complexity was with a medium-sized, on-premise infrastructure. While their existing Terraform and GitLab CI/CD setup enabled daily deployments, the lead time from a merge request to production hovered around a frustrating 30 minutes. The real pain points? Flakiness. Network hiccups, hardware quirks, storage woes, and even critical service outages like DNS were causing cross-cluster synchronization nightmares.

The driving force for adopting Flux wasn’t just this instability, but a pressing security imperative: keeping Kubernetes versions and all associated software gleamingly up-to-date. After adopting a managed Kubernetes solution, Flux was introduced, almost like a secret weapon – a “Trojan horse” to streamline operations. And to their surprise, Flux-based deployments showed a remarkable tolerance to those pesky environmental fluctuations.

This early success was so compelling that it spurred a bold decision: migrate their existing, three-year-old Terraform implementation to Flux. The team ingeniously transformed declarative Terraform code into Flux Custom Resource Definitions (CRDs) and refactored imperative bash/Python scripts into Kubernetes Jobs, all orchestrated via GitLab.

Key Wins from Project A’s Flux Migration: ✨

  • Rock-Solid Resilience: Flux’s magical reconciliation loops, with their indefinite retries, became the heroes. They tamed underlying platform health issues, dramatically boosting cluster synchronization resilience without constant manual intervention.
  • Simplified Codebase: Saying goodbye to managing 30 Terraform state files across separate repositories brought immense relief. Centralized management within the cluster was a game-changer.
  • Deeper Kubernetes Mastery: The shift to an asynchronous model was a steep learning curve, but it forced a deeper understanding of Kubernetes controllers and their reconciliation mechanisms. It was a powerful educational experience!

The Hurdles Faced in Project A: 🚧

  • Mindset Shift: Transitioning from a synchronous to an asynchronous model required a significant conceptual leap.
  • Tooling Divide: Platform engineers embraced the CLI, while app developers stuck with their familiar GitLab workflow.
  • Dependency Puzzles: Building dependencies with Customize items led to unpredictable reconciliation times, especially with chained dependencies and the default 30-second delay.
  • Helm Release Headaches: Initial struggles with Helm releases, particularly with stateful sets, sometimes required brutal deletion strategies for quick fixes.
  • Overlay Overload: Managing configurations across 30 clusters using a directory structure with patched overlays quickly became unwieldy and unmanageable.

Project F: Scaling GitOps Globally 🌐

Buoyed by the reliability gains from Project A, the team tackled an even more ambitious undertaking: Project F. This project was designed for global deployment within BT, starting small but with the potential to scale into the thousands. The team ballooned from a lean two to a robust 32 platform engineers, supporting a much larger project team of approximately 150.

Project F presented a stark contrast to their previous experiences. It was a traditional waterfall project with grueling 3-month release cycles. The existing system was a tangled mess, with deployment, application, and testing code all intertwined in a monolithic Ansible repository. Developers lacked sandboxed environments, leading to widespread drift and chaos, especially in development. The deployment process was excruciatingly slow; even minor Ansible changes demanded a full image rebuild, taking about an hour. Each environment had its own repository, forcing them to manage a plethora of merge requests.

The CI/CD Revolution for Project F: 🛠️

  • Management Buy-In: The transition to CI/CD was greenlit by showcasing the crippling bottlenecks and the urgent need for a faster delivery cadence.
  • Decoupling Monitoring First: The team wisely started by decoupling the monitoring and observability stack, enabling decent cadence deployments to integration environments.
  • Enabling Ephemeral Wonders: The old system was a roadblock for ephemeral environment deployments. With a lightweight monitoring stack, they could now spin up ephemeral environments without impacting shared clusters.
  • A Bold Offer: Instead of just migrating the platform, the team made a game-changing offer: they would migrate the core application, Netbox, too!
  • Full Flux Commitment: Despite some teams eyeing Argo, the team opted for a full Flux route.
  • Developer Delight: Flux’s apparent simplicity, compared to their previous complex deployment methods, was a major selling point. Developers were presented with parameterized repositories where simply tweaking a configuration triggered application deployment.
  • Expanding Ephemeral Horizons: The team showcased the power of resource sets and input providers, allowing developers to manage environments via GitLab MRs. The vision? To enable the deployment of multiple instances within a namespace, which will require further containerization of dependencies.

Tackling UI and Testing Challenges: 👨‍💻

  • Bridging the UI Gap: To ease operational skepticism about CLI-only deployments, the team packaged familiar UIs like Headlamp and the Flux plugin. This provided crucial visibility and a deployment history.
  • Seamless Test Integration: Automating testing was paramount. They implemented post-check Customize jobs that triggered Kubernetes Jobs, monitored their progress, and stalled deployments on failure, raising immediate alarms.

Project F: Current Status and Future Ambitions 🎯

  • Progress Report: Currently at 30% completion, with a target of 80% or full completion by the end of the quarter.
  • Tangible Benefits: More predictable and simpler environments, the end of multi-MR complexity for environment control, and significantly improved auditing.
  • Conquering Overlay Sprawl: Resources are now treated as templates, with post-build changes and large config maps handling configuration.
  • Immutable Applications: Applications are now truly immutable, with versioning for structural changes requiring copies and cluster switches during upgrades.
  • Enhanced Reconciliation: A template structure with pre-post deploy stages, inspired by Flux documentation, now incorporates necessary testing steps and manages dependency chains more effectively.

What’s Next for Project F? 🔮

  • Cluster Bootstrapping: Enhanced use of resource sets for bootstrapping clusters and automatic code generation.
  • Automated Builds: Leveraging auto image updates and registry scanning for automated build changes.
  • Progressive Delivery: Implementing canary releases for smoother, progressive delivery, which will require enhancements in the database layer with backward-facing migrations.
  • Continuous Deployment Dreams: The ultimate goal is to transform Project F into a continuously deployed project, a far cry from its waterfall origins.

Key Takeaways and Future Gazing 👁️

  • Multi-Cluster Management: Bootstrapping multi-cluster environments remains a focus, though Flux Operator resources are proving helpful.
  • Observability is King: Surfacing alerts and improving observability for Flux deployments is critical to alleviate developer anxieties.
  • Pre/Post Upgrade Orchestration: Flux handles pre/post steps using Kubernetes Jobs, chained for dependency management. Failures, especially in testing, can effectively stall deployments.
  • Smart Dependency Chains: Flux jobs can be made dependent on the post-checks of other applications, ensuring proper sequencing.
  • Progressive Delivery Path: Flagger is on the radar for future progressive delivery strategies, as current deployments from main lack segregation.

BT’s journey with Flux is a testament to the power of embracing new technologies to overcome complex challenges. From taming flaky infrastructure to scaling GitOps globally, they’ve not only improved their operations but have also built trust and confidence within the organization, paving the way for even more ambitious GitOps implementations. It’s a story of learning, adapting, and ultimately, succeeding.

Appendix