"Supercharging LLM serving with Dynamo" ( 2026 )

Saturday at 15:40, 20 minutes, UD2.120 (Chavanne), UD2.120 (Chavanne), AI Plumbers Piotr Tarasiewicz , video

The explosive growth of Large Language Models (LLMs) requires massively efficient and scalable inference systems. This talk will share key innovations NVIDIA Dynamo (https://github.com/ai-dynamo/dynamo) adds to enable system-level optimizations while leveraging performance from inference engines such as vLLM, SGLang, and TRT-LLM:

  • Smart Scheduling that routes requests based on the KV cache hit rate and load, intelligently autoscales, and disaggregates the prefill and decode phases.
  • Hierarchical Memory Management that utilizes HBM, host memory, local disk, and remote storage.
  • Low-Latency Transfer of the KV cache across nodes and the memory hierarchy.

This talk will also introduce production-grade LLM serving features of Dynamo that enable users to: - Find the best configuration for disaggregated serving offline. - Tune performance automatically based on real-time traffic. - Dynamically scale prefill and decode workers via topology-aware gang scheduling. - Leverage LLM-specific fault tolerance.