Over the past few years, few terms have surfaced more often at Chinese automakers’ launch events than a new wave of technical buzzwords. After “end-to-end” and “vision-language-action (VLA),” the latest phrase reshaping the smart driving lexicon is the “world model.”
Every company seems to have its own definition. Xpeng calls it a “world foundation model.” Nio refers to an “end-to-end world model.” Huawei promotes its “world action model.” Meanwhile, Horizon Robotics, Li Auto, DeepRoute.ai, and Momenta are all developing their own versions.
Yet, watching their presentations consecutively makes it hard to tell whether they are even describing the same concept. What problem does a world model actually solve, and where does it fit within the overall architecture?
At its broadest, a world model attempts to recreate the physical world inside a virtual one. It helps artificial intelligence understand physical laws, causal relationships, and environmental dynamics, much as humans do. Many scientists and companies see it as a cornerstone in developing “physical AI.” Li Fei-Fei, a professor at Stanford University, has said that spatial intelligence will define the next decade of AI and that world models are essential to achieving it.
While researchers are still refining the theory, China’s automotive sector has already begun branding its own interpretations.
In practice, however, today’s world models differ more in name than in technical substance. Fundamentally, they represent an evolution of simulation tools. By building higher-fidelity virtual environments with finer granularity, richer scenarios, and more degrees of freedom, companies hope to overcome the testing and validation challenges of end-to-end systems. The ultimate goal is to train driving models that behave more like humans and perform better across varied conditions.
In other words, automakers are not constructing digital twins of the physical world but rather more advanced simulators. For now, these world models run in the cloud and are not yet deployed inside vehicles.
Exposing the limits of traditional simulators
In recent years, smart driving systems have shifted from rule-based frameworks to AI-driven architectures. The industry has largely converged on an approach where perception, prediction, and planning are fused into a single neural network. Models are larger, computing power is greater, and executives often claim that these end-to-end systems drive more like humans.
Yet real-world deployment has revealed an unexpected issue: a new over-the-air update based on end-to-end design does not always perform better. In some cases, performance even regresses.
This does not mean the models are worsening. Instead, end-to-end systems make evaluation and regression testing much harder.
Initially, many engineers believed that if a model was trained well enough, it would naturally drive like a human. Early results were promising. But the “black box” nature of such systems brought new challenges. When a model makes an error, engineers struggle to explain why, or to prove that it will not repeat the same mistake.
Model quality now depends not only on data scale but also on how problems are discovered, defined, and validated. Companies soon realized they needed better simulators to assess model performance.
As a result, most leading players are turning to world models as new validation tools. For example, Li Auto proposed a “driving world model” in 2025 that simulates both the ego vehicle and surrounding traffic, acting as a scoring “teacher” for its VLA system. Although Xpeng publicly emphasizes its “world foundation model,” 36Kr has reported that it also uses world models for simulation testing and algorithm evaluation.
Traditional simulators struggle to keep up with these demands. As one R&D engineer explained, “When systems were modular, validation costs were lower. You could test each part separately. Once everything went end-to-end, that became impossible, and the limits of old simulators were exposed.”
In the rule-based era, simulators mainly replayed incidents from road testing or generated scripted edge cases, like pedestrians crossing or cars cutting in. They functioned as magnifying glasses for specific issues. But with end-to-end models, responsibilities are less divisible, edge cases harder to isolate, and closed-loop validation more complex. These gaps led to the rise of world models.
The world model becomes the “teacher”
Chinese automakers’ world models still lag behind Tesla’s, but insiders estimate the gap at less than a year. Tesla does not use the term “world model” but instead refers to its “world simulator,” a concept introduced by Ashok Elluswamy, its vice president of autonomous driving, at last year’s ICCV. Trained on Tesla’s vast proprietary dataset, the simulator predicts the next state of a system given its current state and action, creating a closed loop that mirrors real-world driving.
According to industry sources, Tesla’s method uses neural networks to approximate the world itself. It minimizes explicit physical modeling, relying instead on probabilistic combinations and neural rendering for stronger generalization.
Many Chinese firms have chosen a more controllable approach. One supplier told 36Kr that Li Auto uses 3D Gaussian reconstruction, a technique increasingly common among Chinese developers.
Regardless of implementation, these models serve the same engineering purpose: they validate and falsify end-to-end systems in the cloud by replaying, modifying, and augmenting real driving scenarios. They test whether vehicle-side models can produce stable and reproducible results. In this way, world models transform “where it failed” and “why it failed” into traceable evidence chains.
In that sense, the world model functions as a “teacher.” In theory, a more capable teacher should produce a stronger student. “As the cloud-based world model becomes more powerful, the end-side model it trains should also become more capable,” one R&D engineer said.
A world model must do two things: digitally reconstruct the physical world and generate predictions about its future states. Its quality depends on how realistically and diversely it can synthesize data. “If an automaker only replays real-world data, that’s not a world model. It’s just a replay system,” a supplier manager said.
Because world models learn from real-world data, input quality directly shapes output fidelity. Mao Jiming, head of product lines at GigaAI, said, “If your dataset quality scores 60 out of 100, the generated data might only reach 55.”
With better models, automakers can simulate scenarios that are rare or difficult to capture in real life, producing synthetic training videos that speed up iteration. “The efficiency gains over traditional retraining are significant,” a supplier engineer said. “Model updates could accelerate by a full generation.”
Still, these benefits remain mostly theoretical. World models are a clear improvement over older simulators but remain far from perfect.
Algorithms not yet ready, and hallucinations persist
These are still early days. One automaker engineer told 36Kr that Chinese-made world models can currently generate 30–60 seconds of video at most. Yet object consistency remains weak, and maintaining spatiotemporal coherence across multiple views is difficult.
At their core, world models are generative systems, that means they could hallucinate. “The hardest challenge is ensuring generated content stays realistic,” a supplier product manager said. “If you generate a person, how do you ensure their movement follows real-world physics?”
When a simulator produces physically impossible scenes, downstream models can learn false correlations, degrading vehicle performance. For instance, if generated vehicles move sideways, the model might assume a car can instantly switch lanes, prompting real-world overreactions.
If a simulator cannot model causal relationships, such as how rain affects braking or how glare distorts sensor readings, its edge cases become fictional, and optimization efforts are wasted.
Some see the main constraint in data and computing power. But Xia Zhongpu, former head of Li Auto’s end-to-end driving development, sides with Yann LeCun: the real limitation is algorithmic. In vision, self-supervised learning has yet to mature as smoothly as in language.
Language models scale rapidly because text carries dense semantic meaning; each word encodes information. Visual data is far sparser. For driving, only a small fraction of pixels influence decisions. “Today’s algorithms can’t yet extract enough decision-relevant information from images,” Xia said. “A picture has millions of pixels, but maybe a few dozen matter for driving.”
Without breakthroughs in world model algorithms, debates about data sufficiency or compute capacity are secondary. Because the underlying technology remains experimental, automakers’ investments are still exploratory. Even some executives admit uncertainty about the next step.
If world models eventually become powerful enough, and hardware catches up, they could, in theory, run directly on vehicles. For now, though, Chinese automakers treat them mainly as simulation tools. Their real-time decision-making capabilities remain limited.
That explains the current disconnect: everyone talks about world models, but users notice little difference. Most systems still live in the cloud, serving as training supervisors rather than on-road decision-makers.
“Deploying a world model on the end side is the hardest part,” Xia said.
So far, no company has applied a world model directly to vehicles. Xia added that if large models can eventually capture the physical world, predict its evolution, and influence it beneficially, not only autonomous driving but robotics as a whole could be transformed.
For now, however, the world model remains more teacher than driver, waiting in the cloud for its turn behind the wheel.
KrASIA Connection features translated and adapted content that was originally published by 36Kr. This article was written by Xiao Man for 36Kr.
