The global artificial intelligence race is rapidly pivoting from digital assistants to embodied intelligence, where machines must navigate and interact with the physical world. Huang Tiejun, Chairman of the Beijing Academy of Artificial Intelligence (BAAI), argues that the key to this transition lies in the development of world models—the internal cognitive frameworks that allow machines to understand physics, causality, and human social norms. This conceptual leap aims to provide AI with an intuitive grasp of reality that goes far beyond the capabilities of current large language models.
While many current robotics firms utilize the Vision-Language-Action (VLA) framework to solve specific tasks like sorting or lifting, Huang views these as specialized solutions rather than a general brain. He believes that while VLA is sufficient for immediate industrial applications, a true world model is necessary for robots to operate in highly complex or hazardous environments. For instance, a world model would allow a robot to judge whether its own material composition can withstand a fire, enabling autonomous decision-making in disaster recovery zones.
This shift in AI architecture necessitates a fundamental change in how models are trained, moving away from the static, text-heavy datasets of the previous decade toward real-time, interactive data. Huang suggests that data is becoming less of a library and more of an evolutionary experience. Future AI will likely learn through first-person sensory input provided by wearables and smart sensors, capturing human-environment interactions as they happen rather than relying on historical archives.
Furthermore, the strategic importance of code data is coming into sharper focus, with leading technology firms prioritizing logical datasets over natural language. Huang notes that because society’s critical infrastructure—from power grids to financial systems—is built on code, mastering this digital architecture is a prerequisite for any agent intended to manage a modern economy. The logical rigor of programming languages provides a more stable foundation for reasoning than the ambiguities of human speech.
Ultimately, Huang predicts that while a comprehensive model of all scientific and biological knowledge remains a distant goal, a world model possessing human-level common sense could emerge within the next two to three years. This timeline suggests that the bridge between generative AI and fully autonomous physical robots is narrowing faster than many observers anticipate. The evolution of these models will depend heavily on the efficiency of data collection and the ability to maintain low-power consumption in highly responsive systems.
