Microsoft has publicly released Maia 200, a second-generation in-house AI accelerator the company says is optimised for large-model inference and designed to compete directly with the custom chips rolled out by Amazon and Google. Built on TSMC’s 3nm process, the Maia 200 packs over 140 billion transistors, native FP8/FP4 tensor cores and on-chip engineering that Microsoft says delivers more than 10 PetaFLOPS at FP4 and above 5 PetaFLOPS at FP8 while keeping SoC thermal design power under 750W.
The company also highlighted a heavy memory and interconnect stack: each Maia 200 chip is paired with 216GB of HBM3e offering about 7TB/s of bandwidth plus 272MB of on-chip SRAM. Microsoft reports a bidirectional expansion link of 2.8TB/s per chip and says the architecture supports predictable collective operations across clusters of up to 6,144 accelerators. In Microsoft’s internal comparisons the Maia 200’s FP4 throughput exceeds Amazon’s Trainium v3 by more than threefold, and its FP8 performance surpasses Google’s TPU v7.
Microsoft is framing Maia 200 not only as a performance play but as an economics one. The company claims the Maia 200-based inference system is the most efficient it has deployed to date and that “performance per dollar” has improved by roughly 30% compared with the latest hardware in its fleet. Each Maia 200 server houses four chips and uses Ethernet rather than InfiniBand for interconnect — a noteworthy choice given InfiniBand’s association with Nvidia after its Mellanox acquisition.
Early deployment is underway in Microsoft’s central US data centres, with broader Azure availability and customer access left unspecified. Microsoft said it is already working on follow-on designs under the Maia 300 name and disclosed arrangements with OpenAI that involve using a startup’s chip designs, signalling close coordination between Microsoft’s cloud-stack ambitions and leading AI model consumers.
This release sits squarely in a larger trend: hyperscale cloud providers are designing bespoke accelerators to lower inference costs, diversify away from reliance on Nvidia GPUs, and capture more of the stack value as AI workloads proliferate. Google and Amazon have both fielded chips for training and inference in recent quarters; Microsoft’s public benchmark claims position it as an equal contender in raw silicon metrics and cost-efficiency assertions.
But technical claims are only one side of the story. The real battleground will be software ecosystems, compiler maturity, model compatibility and customer migration costs. Nvidia’s dominance rests not just on silicon performance but on a vast, entrenched ecosystem—CUDA libraries, optimized frameworks, third-party tooling and a large installed base. For Microsoft, delivering the promised gains to customers will require making model porting and orchestration seamless on Azure, and proving throughput and latency improvements on representative, third-party benchmarks rather than internal tests.
The Maia 200 also illustrates how design choices map to commercial strategy. Microsoft’s use of Ethernet rather than InfiniBand suggests an approach optimised for broad compatibility with existing cloud networking and possibly lower-cost, scalable fabrics. TSMC 3nm sourcing, high-bandwidth HBM3e and sizeable on-chip SRAM demonstrate Microsoft’s willingness to absorb advanced packaging and fabrication complexity to win on inference economics.
If Microsoft can substantiate its performance and cost claims in public benchmarks and in customer deployments, Maia 200 will accelerate a multi-year shift in the cloud compute market: hyperscalers pushing vertically into silicon, squeezing GPU vendors’ margins on cloud contracts, and fragmenting the accelerator landscape. That will force customers, model builders and the chip ecosystem to manage trade-offs between raw performance, portability and supplier lock-in.
