Inside the Voyage AI Platform

Presenters

Source

MongoDB.local San Francisco 2026

🚀 Scaling AI Inference: A Deep Dive into the Boyi Inference Platform 🤖

The demand for real-time AI is exploding. But serving those models – especially complex embedding and re-ranking models – at scale while maintaining lightning-fast response times is a monumental challenge. Angel Lim and Andrew Gaut, engineers from the Boyi inference and research teams (formerly at Voyage, now part of a larger organization), recently shared their insights into how they built the Boyi inference platform to tackle this head-on. Let’s explore the key strategies and technologies they’re using to deliver high-performance AI in production.

🎯 The Challenge: Low Latency, High Throughput, and a Heterogeneous Landscape

The Boyi team’s primary goal was simple: strict latency SLOs for interactive retrieval paths. Think about it – even a tiny delay can be frustrating for users. They quickly realized that latency wasn’t just about raw GPU power; it was a complex interplay of factors.

Here’s a breakdown of the core challenges:

Latency Breakdown: Latency is a combination of overhead (request routing, autoscaling) and GPU compute. Minimizing both is crucial.
The Throughput Tradeoff: Optimizing for latency often comes at the expense of overall GPU throughput. Finding the right balance is key.
Model Diversity: The platform needs to handle a heterogeneous mix of embedding and re-ranking models, each with its own unique workload patterns. Predictability is a luxury they don’t have.

✨ Key Optimizations: A Toolkit for Speed and Efficiency

So, how did the Boyi team overcome these hurdles? They employed a clever combination of techniques, focusing on both request processing and infrastructure management.

1. Dynamic Query Batching: Squeezing More Out of Every GPU 💾

One of the most impactful optimizations was dynamic query batching. The team discovered that GPUs are often “memory-bound” when processing very small requests (batch size 1, token count < 512). By intelligently grouping small queries together without increasing latency, they significantly improved GPU utilization. This reduces queue wait times and boosts overall throughput.

2. Unbatching: Taming Large, Latency-Sensitive Requests 🌐

While batching is great for smaller requests, larger, latency-sensitive requests (common in re-ranking) require a different approach. The Boyi platform utilizes unbatching, splitting large batches into smaller execution units and distributing them across multiple GPUs in parallel. The results? A dramatic reduction in latency – 5 seconds for a large request compared to 17 seconds without unbatching!

3. Two-Tiered Autoscaling: Responding to Bursty Traffic 📡

Handling unpredictable traffic spikes is essential for any production system. The Boyi platform uses a sophisticated two-tiered autoscaling system:

Model Autoscaler: This component dynamically adjusts the number of replicas for each model based on real-time queue metrics (in rate, out rate, backlog, total tokens).
Cluster Autoscaler: This manages the overall GPU cluster size, scaling up or down based on demand while maintaining a “warm pool” of GPUs for rapid scaling. The goal is to meet latency SLOs without over-provisioning resources.

4. Fast Pod Startup: Eliminating the Waiting Game 🛠️

GPU pod startup times can be a major bottleneck, often taking minutes. The Boyi team tackled this head-on with these optimizations:

Pre-caching Container Images: Images are pre-loaded onto nodes, eliminating the download overhead.
Multi-Tier Caching for Model Weights: Model weights are strategically cached closer to GPU memory over time, drastically reducing loading latency. This shaved pod startup time down from minutes to seconds.

5. GPU Performance Tweaks: Maximizing Arithmetic Intensity 🦾

Finally, the team implemented several low-level GPU optimizations to squeeze every last bit of performance:

Sequence Packing: Transforms sequences into a 1D list to eliminate padding waste, improving arithmetic intensity.
Kernel Fusion: Combines multiple GPU kernels into a single kernel to reduce memory transfers and increase arithmetic intensity.
Kernel Launch Overhead Reduction: Utilizing CUDA graphs to accelerate kernel submission to the GPU.

👨‍💻 Tools and Technologies Powering the Platform

The Boyi inference platform leverages a powerful stack of technologies:

CUDA: The foundation for GPU programming and kernel development.
Queuing Theory: Used to model queue behavior and inform autoscaling decisions – a data-driven approach to resource management.
Python & Rust: Considered for optimizing kernel launch overhead, highlighting the importance of choosing the right language for performance-critical tasks.

💡 Key Takeaways and Future Directions

The Boyi inference platform demonstrates that achieving low latency and high throughput for AI inference is possible with a combination of clever architectural choices, intelligent optimizations, and a deep understanding of GPU performance. The team’s focus on dynamic batching, unbatching, and efficient autoscaling provides a blueprint for building scalable and responsive AI systems. It’s a testament to the power of engineering ingenuity in the face of demanding performance requirements.

🚀 Scaling AI Inference: A Deep Dive into the Boyi Inference Platform 🤖#

🎯 The Challenge: Low Latency, High Throughput, and a Heterogeneous Landscape#

✨ Key Optimizations: A Toolkit for Speed and Efficiency#

1. Dynamic Query Batching: Squeezing More Out of Every GPU 💾#

2. Unbatching: Taming Large, Latency-Sensitive Requests 🌐#

3. Two-Tiered Autoscaling: Responding to Bursty Traffic 📡#

4. Fast Pod Startup: Eliminating the Waiting Game 🛠️#

5. GPU Performance Tweaks: Maximizing Arithmetic Intensity 🦾#

👨‍💻 Tools and Technologies Powering the Platform#

💡 Key Takeaways and Future Directions#

Appendix#