At its March GTC keynote, Nvidia’s chief executive recast the AI battleground. Jensen Huang argued that the era of one‑off model training is giving way to a permanently running economy of inference — the nonstop production of tokens — and placed Nvidia at the centre of that shift by unveiling a next‑generation platform codenamed Vera Rubin.
Huang framed data centres as factories that take in electricity and data and spit out tokens — the smallest units of model computation — and he urged the market to treat compute as a revenue engine rather than merely a capital cost. To support that thesis Nvidia announced Vera Rubin, a platform Huang says delivers roughly ten times the per‑watt inference performance of the previous generation and can reduce token production costs by as much as 90 percent when paired with specialised low‑latency processors acquired from Groq.
Technically, Nvidia’s pitch is a hybrid one. Heavy, memory‑bound parts of inference run on Vera Rubin GPUs, while the final, latency‑sensitive step of token emission is handed to Groq’s deterministic data‑flow processors. The company also showcased a vertically integrated stack — processors, networking, liquid cooling and a digital twin design tool — intended to standardise and accelerate the construction of large‑scale inference centres.
Nvidia’s strategic aim is clear: to define both the metaphor and the economics of the new era so that customers equate Nvidia’s stack with the lowest possible token cost. If Huang’s market sizing is right, the token economy could swell to as much as a trillion‑dollar scale by 2027, making token price and token throughput fundamental commercial metrics for cloud providers and AI businesses.
But Huang’s call for a token king did not go unchallenged. Chinese large‑language model providers are already exhibiting a powerful price advantage in the inference market. Aggregator platform OpenRouter recorded weeks in February and March when Chinese models consumed more tokens than their US counterparts — 4.12 trillion token calls versus 2.94 trillion in one week — and they have stayed ahead in overall weekly volumes since.
That lead is driven by price. The story’s data shows Chinese models charging token prices that are roughly one‑sixth to one‑tenth of comparable foreign offerings. Benchmark cost comparisons amplify this gap: one Chinese model, Minimax M2.5, reportedly incurred only about $125 in inference cost for a standard test run, versus $4,970 for Claude Opus 4.6 and $3,244 for GPT‑5.2 Codex on the same test. Minimax also consumed fewer tokens per run, compounding the advantage.
Two technical trends underpin China’s low cost. First, inference architectures have evolved to squeeze memory and compute: techniques such as Multi‑Head Latent Attention (MLA) to compress KV caches, Mixture‑of‑Experts layers, FP8 mixed precision, and multi‑token prediction reduce the compute per token and enable competitive performance on more modest GPUs. Second, operating costs are lower: domestic electricity prices and improved data‑centre scheduling mean each GPU can deliver tokens more cheaply. Analysts estimate that running inference on Chinese power grids can save hundreds of dollars per GPU annually, and large shipments of H200/B200‑class chips could translate into tens of millions in yearly energy savings at scale.
The contest is therefore not just about raw silicon. It is a tussle over systems, software and the economics of running inference at scale. Nvidia hopes to answer with an integrated hardware‑software stack that promises the lowest token cost through co‑design and vertical integration. Chinese providers answer with architectures, engineering optimisations and a cost base that together create an attractive price‑performance proposition for developers and agent builders.
That dynamic has market consequences. If token prices remain the dominant axis of competition, buyers will gravitate toward cheaper inference, especially for agentic applications that consume tokens continuously. A persistent price differential could shift developer ecosystems, API routing and commercial partnerships toward low‑cost providers, enabling rapid growth of Chinese models in global markets despite ongoing differences in model capabilities, latency profiles and regulatory regimes.
The coming months will test these claims. Nvidia’s performance and cost figures for Vera Rubin are manufacturer projections and will face scrutiny in independent benchmarks and in real world deployments that confront cooling, power and network realities. Likewise, Chinese models will need to sustain their price advantage while proving reliability, safety and compliance in diverse markets. The token era has opened a new front in the AI race: speed and scale matter, but so do the unit economics of every generated token.
