FOSDEM 2026

The explosive growth of Large Language Models (LLMs) requires massively efficient and scalable inference systems. This talk will share key innovations NVIDIA Dynamo (https://github.com/ai-dynamo/dynamo) adds to enable system-level optimizations while leveraging performance from inference engines such as vLLM, SGLang, and TRT-LLM:

Smart Scheduling that routes requests based on the KV cache hit rate and load, intelligently autoscales, and disaggregates the prefill and decode phases.
Hierarchical Memory Management that utilizes HBM, host memory, local disk, and remote storage.
Low-Latency Transfer of the KV cache across nodes and the memory hierarchy.

This talk will also introduce production-grade LLM serving features of Dynamo that enable users to: - Find the best configuration for disaggregated serving offline. - Tune performance automatically based on real-time traffic. - Dynamically scale prefill and decode workers via topology-aware gang scheduling. - Leverage LLM-specific fault tolerance.

Related: