At the Galaxea WDC 2026 in Beijing’s Yizhuang economic zone, the conversation among China’s robotics elite has shifted from hardware specifications to a more fundamental bottleneck: the scarcity of high-quality physical interaction data. While Large Language Models (LLMs) have matured on a diet of trillions of internet tokens, the 'embodied AI' sector—which seeks to put those brains into robotic bodies—is finding that the physical world is much harder to scrape than the web. Industry leaders now estimate that China possesses only about one million hours of high-quality training data for robots, a rounding error compared to the vast datasets used by models like GPT-5.
Unlike the digital-first nature of generative AI, embodied intelligence requires data that bridges the gap between vision, language, and physical action (VLA). Gao Jiyang, CEO of Galaxea, argues that the current industry debate between VLA and 'World Models' is a false dichotomy; both require the transformation of multi-modal data into tokens. The real challenge lies in the four dimensions of robot learning: action, object, scene, and embodiment. To master these, developers are currently prioritizing real-world 'Human-Centric' and 'Robot-Centric' data over simulation, which has yet to prove its fidelity at scale.
This data quest comes with a steep price tag. Collecting human-motion data costs between 50 and 100 RMB per hour, while robot-specific data can reach 250 RMB per hour. For a startup to reach the critical milestone of one million hours, a capital expenditure of 100 million to 200 million RMB is required. However, Gao suggests this is a bargain compared to the hundreds of millions of dollars spent on compute. The real waste, he warns, is not the cost of collection but the 'training waste' generated by feeding low-quality data into expensive GPU clusters.
Fragmentation remains a significant barrier to the 'emergence' of robotic intelligence. Currently, data exists in silos, with different manufacturers using proprietary robot architectures that make data sharing nearly impossible. Industry veterans like Li Ke, CEO of SpeechOcean, suggest that it will take another three to five years and at least ten million hours of refined data before robots reach the cognitive and physical fluidity of a seven-year-old child. Until then, the industry is looking toward consumer electronics—smartphones and AR glasses—as potential passive 'data inlets' to accelerate the collection of human-world interactions.
