Chipmaker Nvidia has released a new open-source inferencing software — Dynamo, at its GTC 2025 conference, that will allow enterprises to increase throughput and reduce cost while using large language models on Nvidia GPUs.
“Efficiently orchestrating and coordinating AI inference requests across a large fleet of GPUs is crucial to ensuring that AI factories (group of chips running AI workloads) run at the lowest possible cost to maximize token revenue generation,” Nvidia said in a statement.
The chipmaker believes that the proliferation of generative AI will see a rise in the adoption of reasoning LLMs, which in turn will drive inferencing workloads, and any means to reduce the cost of such workloads would be beneficial for enterprises as it will ensure better and faster generative AI use cases for end consumers.
Globally, the AI inference market is expected to grow from $106.15 billion in 2025 to $254.98 billion by 2030, according to a report from MarketsAndMarkets.
Successor to Triton Inference Server
Nvidia Dynamo is the successor to its Triton Inference Server — introduced in 2018 as an open-source project to optimize and serve machine learning models for inference in production environments in order to ensure efficient use of GPUs and CPUs.
Dynamo is designed to drive more efficiency by orchestrating and accelerating inference communication across thousands of GPUs, according to Nvidia.
It uses disaggregated serving to separate the processing and generation phases of large language models (LLMs) on different GPUs, which allows each phase to be optimized independently for its specific needs and ensures maximum GPU resource utilization, the chipmaker explained.
The efficiency gain is made possible as Dynamo has the ability to map the knowledge that inference systems hold in memory from serving prior requests — known as KV cache — across potentially thousands of GPUs.
It then routes new inference requests to the GPUs that have the best knowledge match, avoiding costly re-computations and freeing up GPUs to respond to new incoming requests, the chipmaker explained.
Dynamo upgrades make it better than vLLM and SG Lang
Dynamo includes four upgrades over its predecessor that may help it reduce inference serving costs, including a GPU Planner, a Smart Router, a low latency Communication Library, and a Memory Manager.
While the GPU Planner gives enterprises the ability to use Dynamo to add, remove, and reallocate GPUs in response to fluctuating request volumes and types to avoid over and under-provisioning of GPUs, the low latency Communication Library enables GPU-to-GPU communication and faster data transfer.
The Smart Router upgrade, on the other hand, will allow enterprises to use Dynamo to pinpoint specific GPUs in large clusters that can minimize response computations and route queries, Nvidia said.
Additionally, enterprises can make use of Dynamo’s Memory Manager ability to offload inference data to more affordable memory and storage devices and quickly retrieve them when needed, minimizing inference costs, the chipmaker added.
The chipmaker claims that it has used Dynamo to generate 30x more tokens per GPU when running DeepSeek-R1 model on a large cluster of GB200 NVL72 racks and double the performance of Hopper GPUs serving Llama models.
However, Abhivyakti Sengar, practice director at Everest Group is not convinced with these claims.
“The claim of 30x cost reduction and faster inference is compelling, but enterprises will need to test these optimizations in real-world workloads,” Sengar said, adding that if Dynamo delivers on its promise, it could redefine AI reasoning at scale, making AI applications more accessible and cost-effective.
At the same time, Sengar pointed out that the open-source nature of the software shifts the responsibility of integration, optimization, and security to enterprises, and these factors will determine its true impact in production environments.
Availability through NIM microservices and AI Enterprise software
Dynamo, according to the chipmaker, will be made available via its NIM microservices and supported in a future release by the NVIDIA AI Enterprise software platform.
Additionally, Nvidia has partnered with cloud service providers and other vendors, such as Oracle, AWS, Microsoft, IBM, and Google Cloud to make its NIM microservices available through their enterprise AI platforms such as OCI, Vertex AI, and Azure AI Foundry among others.
Dynamo supports PyTorch, SGLang, and vLLM and enterprises will be able to serve AI models across disaggregated inference scenarios as well.