What is Nvidia Dynamo and why does it matter to enterprises?

Chipmaker Nvidia has released a new open-source inferencing software — Dynamo, at its GTC 2025 conference, that will allow enterprises to increase throughput and reduce cost while using large language models on Nvidia GPUs.

“Efficiently orchestrating and coordinating AI inference requests across a large fleet of GPUs is crucial to ensuring that AI factories (group of chips running AI workloads) run at the lowest possible cost to maximize token revenue generation,” Nvidia said in a statement.

The chipmaker believes that the proliferation of generative AI will see a rise in the adoption of reasoning LLMs, which in turn will drive inferencing workloads, and any means to reduce the cost of such workloads would be beneficial for enterprises as it will ensure better and faster generative AI use cases for end consumers.

Globally, the AI inference market is expected to grow from $106.15 billion in 2025 to $254.98 billion by 2030, according to a report from MarketsAndMarkets.

Successor to Triton Inference Server

Nvidia Dynamo is the successor to its Triton Inference Server — introduced in 2018 as an open-source project to optimize and serve machine learning models for inference in production environments in order to ensure efficient use of GPUs and CPUs.

Dynamo is designed to drive more efficiency by orchestrating and accelerating inference communication across thousands of GPUs, according to Nvidia.

It uses disaggregated serving to separate the processing and generation phases of large language models (LLMs) on different GPUs, which allows each phase to be optimized independently for its specific needs and ensures maximum GPU resource utilization, the chipmaker explained.

The efficiency gain is made possible as Dynamo has the ability to map the knowledge that inference systems hold in memory from serving prior requests — known as KV cache — across potentially thousands of GPUs.

It then routes new inference requests to the GPUs that have the best knowledge match, avoiding costly re-computations and freeing up GPUs to respond to new incoming requests, the chipmaker explained.

Dynamo upgrades make it better than vLLM and SG Lang

Dynamo includes four upgrades over its predecessor that may help it reduce inference serving costs, including a GPU Planner, a Smart Router, a low latency Communication Library, and a Memory Manager.

While the GPU Planner gives enterprises the ability to use Dynamo to add, remove, and reallocate GPUs in response to fluctuating request volumes and types to avoid over and under-provisioning of GPUs, the low latency Communication Library enables GPU-to-GPU communication and faster data transfer.

The Smart Router upgrade, on the other hand, will allow enterprises to use Dynamo to pinpoint specific GPUs in large clusters that can minimize response computations and route queries, Nvidia said.

Additionally, enterprises can make use of Dynamo’s Memory Manager ability to offload inference data to more affordable memory and storage devices and quickly retrieve them when needed, minimizing inference costs, the chipmaker added.

The chipmaker claims that it has used Dynamo to generate 30x more tokens per GPU when running DeepSeek-R1 model on a large cluster of GB200 NVL72 racks and double the performance of Hopper GPUs serving Llama models.

However, Abhivyakti Sengar, practice director at Everest Group is not convinced with these claims.

“The claim of 30x cost reduction and faster inference is compelling, but enterprises will need to test these optimizations in real-world workloads,” Sengar said, adding that if Dynamo delivers on its promise, it could redefine AI reasoning at scale, making AI applications more accessible and cost-effective.

At the same time, Sengar pointed out that the open-source nature of the software shifts the responsibility of integration, optimization, and security to enterprises, and these factors will determine its true impact in production environments.

Availability through NIM microservices and AI Enterprise software

Dynamo, according to the chipmaker, will be made available via its NIM microservices and supported in a future release by the NVIDIA AI Enterprise software platform.

Additionally, Nvidia has partnered with cloud service providers and other vendors, such as Oracle, AWS, Microsoft, IBM, and Google Cloud to make its NIM microservices available through their enterprise AI platforms such as OCI, Vertex AI, and Azure AI Foundry among others.

Dynamo supports PyTorch, SGLang, and vLLM and enterprises will be able to serve AI models across disaggregated inference scenarios as well.

What is Nvidia Dynamo and why does it matter to enterprises?

Successor to Triton Inference Server

Dynamo upgrades make it better than vLLM and SG Lang

Availability through NIM microservices and AI Enterprise software

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112