In the volatile landscape of generative AI, the distinction between a viral sensation and a sustainable business has never been more pronounced. While Silicon Valley remains fixated on the raw scaling power of models like OpenAI’s Sora, Chinese entrepreneurs are charting a more pragmatic course. Mei Tao, the founder of ZhiXiang Future (VAST) and a former Microsoft and JD.com executive, argues that the future of video generation lies not in consumer toys, but in deep enterprise integration. This shift reflects a maturing industry that is moving beyond the 'showcase' phase toward a focus on verifiable commercial utility.
The recent suspension of Sora’s public-facing momentum serves as a cautionary tale for the industry. Mei suggests that without a natural traffic gateway like Google or ByteDance, standalone consumer-centric video tools struggle with low retention and high compute costs. VAST’s strategy purposefully avoids the 'nomadic' retail user, focusing instead on professional creators and small-to-medium enterprises. By securing over 500 million RMB in recent funding and generating over 100 million RMB in annual revenue, the company is proving that enterprise services (B2B) offer a more stable path to a commercial closed loop.
At the heart of this transition is the concept of the 'World Model.' Rather than simply simulating visual aesthetics, Mei defines a true world model as a unified architecture capable of reasoning and molding physical reality across all modalities. This includes vision, sensor data, and mechanical action. By moving away from the standard Diffusion Transformer (DiT) architecture toward more integrated frameworks, startups aim to lower training costs while increasing the precision of output, which is essential for high-stakes industrial applications.
Perhaps the most significant frontier for these video models is 'Embodied AI' or robotics. Video generation is increasingly viewed as the essential 'foundation' for training autonomous machines. High-precision, synthetic video data allows robots to learn complex maneuvers—such as millimeter-level object manipulation—without the prohibitive time and cost of real-world physical training. This intersection of video synthesis and robotics is where the next decade of AI value is expected to be unlocked.
Success in this sector requires a radical departure from the traditional 'selling tokens' business model. Mei advocates for a 'pay-for-results' approach, where AI providers share in the actual revenue generated by their clients, such as e-commerce conversions or marketing ROI. This alignment of interests forces AI companies to move beyond technical perfectionism and toward solving specific industry pain points. In the global race for AI supremacy, China’s ability to marry its massive supply chain with sophisticated multimodal models may become its most enduring competitive advantage.
