Why are robots still not part of our everyday lives? Why do sales remain sluggish despite years of hype?
These are the questions Han Fengtao posed in a 10,000-word essay published on June 30, 2023, while he was still CTO at Rokae Robotics. In the piece, Han pinpointed the robot industry’s fundamental issue: poor usability due to a lack of intelligence. His proposed solution? Large models.
Months later, Han left Rokae to launch his second startup—this time centered on embodied intelligence.
He joined forces with Gao Yang, a scientist trained at UC Berkeley with deep experience in computer vision and reinforcement learning. Gao now serves as an assistant professor and PhD advisor at the Institute for Interdisciplinary Information Sciences at Tsinghua University.
One brought years of hardware experience and a belief that AI could finally make machines smart. The other, a lifelong researcher in AI, wanted to see software embedded into the physical world. Together, they started Spirit AI.
Founded by two deeply technical minds, Spirit AI’s approach is focused, minimalist, and product-driven.
That ethos is evident the moment you enter the company’s office in Beijing’s Zhongguancun Software Park. There’s no flashy reception area. Instead, what stands out is a humanoid robot—built in-house—undergoing quiet tests near the front desk.
From a small founding team to a nearly 60-person company today, Spirit AI has already outgrown its current space. Meeting rooms are booked up, and even on the day of our interview, finding an empty one proved difficult.
“This is my second time founding a company,” Han told 36Kr. “We spend money more efficiently than most in the industry.”
Still, when it comes to hiring, Spirit AI doesn’t cut corners.
The company recently brought on Xie Junyuan, a former tech lead at ByteDance, to head its embodied intelligence division. For Han, developing embodied intelligence takes top-tier talent—and that doesn’t come cheap.
Han also spoke candidly about how some investors remain wary of the space. Though interest is rising, the market is still divided on whether embodied intelligence is a viable path to smarter machines and commercial-scale deployment.
“By the end of this year, I’m sure they will come around,” he said.
Han described his first startup as “a hammer looking for nails”—a business driven more by academic impulse than by actual market demand. This time, he’s tackling a specific challenge: making robots more generalizable, smarter, and genuinely useful.
He’s also learned the value of focus. His biggest operational lesson: most waste comes from doing things over again. The antidote is deliberate management, careful planning, and aligning every team toward a singular goal.
While other startups rush to push humanoid robots into factories or retail stores, Spirit AI is taking a different route.
It builds both software and hardware, but its emphasis is on the model—the core engine of embodied intelligence. Han has always believed: “If a robotics company doesn’t develop its own embodied intelligence model, it can’t know what good hardware even looks like.”
“Our energy allocation between model training and hardware development is probably 80 to 20,” he said. “If an embodied intelligence company wants to reach the level of GPT-3.5, the bulk of its focus should be on model capabilities.”
Embodied intelligence represents one of the final frontiers in AI. It’s humanity’s other moonshot for achieving artificial general intelligence (AGI)—teaching robots to sense, reason, and act in the real world, the way humans do.
But because of those lofty ambitions, general-purpose embodied intelligence struggled for years to attract believers. Like large language models (LLMs) in their early days, it was seen as a high-risk, long-horizon pursuit—until October 2024.
That’s when Physical Intelligence (PI), a US-based startup, shifted the conversation.
At a product launch that month, PI showed off a robot that could fold clothes with impressive precision—a task that had long confounded roboticists. For the first time, the idea of generalization in physical environments felt tangible. The demo was a turning point.
The impact was immediate. Competing approaches began coalescing around PI’s unified, end-to-end large model architecture. The funding environment followed. Embodied intelligence, once niche, was now being heralded as the next LLM-sized opportunity.
In China, a wave of capital poured into startups like Robotera, Galaxea AI, X Square Robot, and Tars—all of which closed eight-figure RMB rounds. Spirit AI joined the surge, raising RMB 528 million (USD 73.9 million) in a pre-Series A funding round. Backers included Prosperity7 Ventures (under Aramco), China Merchants Venture, GF Xinde Investment, Eminence Ventures, Oriental Fortune Capital, and T-Capital.
In March, Spirit AI released a demo of its Spirit v1 model folding clothes in a single continuous take—effectively replicating PI’s benchmark. It marked Spirit AI’s most significant technical breakthrough to date.
Han acknowledges the path to a general-purpose embodied system is a long one. But he’s not worried about bottlenecks—at least, not yet.
“Like LLMs, embodied intelligence also follows a scaling law,” Han told 36Kr. “Model capability is mainly a function of data quality and quantity. And even with limited high-quality data today, companies like PI have already achieved solid performance. As the quality and volume of data increase, model capabilities will continue to improve.”
That said, Han stressed that commercial traction doesn’t depend on full generalization. He believes companies can start monetizing within two to three years by applying embodied intelligence in narrowly defined use cases—offering a path to real revenue and continued investor backing.
Throughout the interview—and in his earlier longform post on Zhihu—Han referenced Liu Cixin’s short story The Truth of the Universe. In it, primitive humans stare at the stars long enough to trigger a warning from an advanced alien civilization, which fears that once a species can perceive cosmic truth, it will inevitably decode it.
Han sees a parallel in today’s robotics landscape. “After more than half a century of evolution, the robot industry might finally be entering its own moment of gazing at the stars,” he wrote toward the end of his essay.
The following transcript has been edited and consolidated for brevity and clarity.
36Kr: Investors still seem divided on embodied intelligence. ZhenFund’s Dai Yusen says it’s too early for general-purpose humanoid robots. A Coatue report suggested there might never be a “ChatGPT moment” for embodied intelligence. What’s your take?
Han Fengtao (HF): I think by the end of this year, they will change their minds.
Look at how investors once dismissed China’s chances in LLMs. Many wouldn’t touch Chinese LLM startups. But once DeepSeek took off, they suddenly didn’t care about valuations—they just wanted in.
First, these investors hadn’t yet seen Chinese companies deliver real results. Second, technically speaking, embodied intelligence is already feasible. What’s missing is better product definition, clearer user targeting, and more refined development work.
We’re still far from building a general-purpose humanoid robot, but in two to three years, embodied intelligence can definitely be commercialized in vertical niches—even without humanoid form.
36Kr: What are some of these niche use cases where you see near-term commercialization?
HF: Take clothing folding, for example. Factories and laundromats need to fold massive volumes of clothes. I dealt with this at my last company. Industrial robots couldn’t handle the task—but embodied intelligence models can. It doesn’t matter whether the execution hardware is humanoid. What matters is solving the problem.
36Kr: So embodied intelligence and humanoid robots are not the same thing?
HF: Right. The humanoid robot emphasizes form. Embodied intelligence is about model capabilities—it doesn’t care about what the robot looks like.
Under the embodied intelligence paradigm, robots can take many shapes.
36Kr: Are you worried about embodied intelligence hitting a plateau?
HF: Not really. Folding clothes is already a complex task. If an embodied AI can do that well, it can likely handle many other tasks, too. We’ll unlock more and more capabilities going forward.
And again, scaling law applies here, just like in LLMs. Today, there’s still very little high-quality data in the field, yet we’re already seeing promising results. As more and better data becomes available, model capabilities will keep rising.
36Kr: Compared to LLM companies, embodied intelligence firms have raised less money. Is it just that building embodied AI models doesn’t burn as much cash?
HF: Funding amounts are tied closely to how far along the sector is. LLMs have had a seven- to eight-year head start. Embodied intelligence is barely one year in—of course we’ve raised less. But to be honest, what’s been raised so far isn’t enough. We’ll definitely need more rounds down the line.
That said, building embodied AI models probably doesn’t cost as much as building LLMs.
For one, we can learn from all the pitfalls and best practices of the LLM era—things like engineering workflows and talent development. That alone saves a ton. On top of that, embodied models require less compute, since their scale is smaller. Ours, for example, is under ten billion parameters.
At Spirit AI, we’re running a lean operation. This is my second startup, and I’d say our spending efficiency is one of the best in the business. Our first model cost very little to build, and it still performed well.
36Kr: US embodied intelligence firms have higher valuations and bigger war chests. Does that mean Chinese companies need to win on cost performance—like DeepSeek did?
HF: Absolutely. From a global standpoint, Chinese embodied intelligence startups must deliver superior cost efficiency. That means leveraging our strengths—solid engineering teams, strong supply chains, and refined execution.
For Spirit AI, our current funding position is healthy. It’s enough to support fast-paced iterations.
And long-term, I believe embodied intelligence will thrive in China. Our hardware is cheaper, our supply chain is more robust, our labor costs for data collection are lower. We also have a wider variety of real-world application scenarios. Right now, we might still be evenly matched with US peers. But if we scale to 10,000 or even 100,000 data-collecting workers or build massive data factories using crowdsourcing, the US simply can’t compete.
36Kr: Why did you choose to focus on embodied intelligence for your second startup after leaving Rokae Robotics?
HF: My first company focused on industrial robots. I had a hammer and was looking for nails—I studied robot control, so we built mechanical arms. But this time, I wanted to respond to actual market needs. Real opportunities come from understanding what the market truly demands.
What has made this second venture possible is progress in AI. Language models, image models, video generation—they have all thrived in virtual environments. The next step is naturally the physical world. That’s what’s driving this. Spirit AI’s core focus is building large models for embodied intelligence.
36Kr: Will Spirit AI develop its own robot hardware?
HF: Absolutely. If you’re making an integrated product, software alone isn’t enough. Once the industry matures, a software-only approach becomes very difficult to commercialize. No revenue means no way to sustain the business.
Also, from a technical standpoint, our model training currently relies on video data and information collected from our own data factory. But in the future, the best training data will come from the robots we ship and how they perform in real-world settings. That kind of data helps the model keep evolving.
Without our own end-user hardware, we won’t have access to that data. The autonomous driving space has already shown us the pitfalls of being software-only.
So both commercially and technically, any top-tier embodied intelligence company must be full-stack—software and hardware.
On the flip side, if a robotics company doesn’t build embodied intelligence models, it won’t even know what good hardware should look like. It might make decent parts, but it won’t know how to design a great whole machine. It won’t know which hardware designs and iterations best support embodied intelligence.
36Kr: What do you make of Unitree Robotics’ recent surge in popularity? Many people still view them as a hardware-focused company.
HF: Unitree going viral has definitely brought some momentum to hardware-heavy companies. But at the end of the day, robots need to become general-purpose and capable—and that doesn’t come from hardware alone.
Once the buzz dies down, people will start asking: “What can this robot actually do?”
And they’ll realize that without breakthroughs in the “brain,” most robots are still clumsy and limited.
36Kr: When it comes to the physical body of the robot, what is Spirit AI best at—hands or feet?
HF: Our strength lies in the upper body. Both our model and hardware are focused on manipulation tasks—primarily robotic arms and dexterous hands.
Our long-term goal is to enable 10% of people to own their own robots within the next decade. We want robots to help with work—or do the work entirely. And that work is largely done with the upper body. Locomotion affects performance, sure, but it’s not the core of the task.
36Kr: What’s the biggest bottleneck right now in embodied intelligence development?
HF: Talent shortage.
Gao Yang once made a great point: within the next three years, the biggest constraint on progress in embodied intelligence will be talent.
Technology and know-how are evolving rapidly. You need top-tier people to keep pace and push the field forward. This is still an uncharted space. Even though the general technical direction is becoming clearer, there are still countless unknowns—small but crucial details that need solving. That requires smart, creative minds at the frontier.
If you look at deep learning’s evolution since 2012, all the major breakthroughs came from PhD students and researchers working on the cutting edge. Their work drove industrialization forward. That’s why we’re laser-focused on hiring outstanding doctoral graduates from China’s top computer science and AI programs.
36Kr: There are still competing technical approaches in embodied AI. Some support end-to-end modeling, while others prefer separating functions like perception, language, and control. What’s Spirit AI’s approach?
HF: We don’t divide things up—we go end-to-end with a single VLA model: vision, language, and action all together.
Because the model needs to run on actual robots, which have limited on-device compute, we keep the language component relatively small—around three to seven billion parameters. We mostly use pretrained, open-source models for that.
Model size really depends on how deep you want the robot’s understanding of the environment to go. For simple tasks, a small model on the device plus action modules, totaling under ten billion parameters, is enough.
36Kr: More companies are now talking about end-to-end systems. Is this becoming a consensus?
HF: It is. The industry has more or less converged in this direction.
The turning point was last October, when PI’s new model made folding clothes practically viable. That was a major technical leap—a real milestone.
Before PI, most demos in the space just involved basic grabbing. Folding clothes is a long-horizon, continuous task with complex object manipulation. Embodied intelligence hadn’t cracked that—until PI did, using end-to-end VLA training.
Fun fact: PI’s two co-founders were advised by the same professor as Gao. They are basically academic siblings.
36Kr: How important is object recognition accuracy for you?
HF: It depends on the use case. Generally speaking, embodied intelligence systems don’t need the same level of precision as autonomous driving. For example, if a robot misidentifies a tissue and grabs the wrong one, it can just try again. That’s a far cry from a car crash.
Of course, we’re still working to improve recognition accuracy.
36Kr: How do you train your embodied intelligence models?
HF: Our training pipeline mirrors the one used for LLMs.
First, we pretrain on massive amounts of video data from the internet—YouTube, iQiyi, and others. These aren’t high-precision, but they offer great diversity. Then we fine-tune on high-quality data from our own teleoperation factory, followed by reinforcement learning to improve execution success rates.
This mirrors the LLM workflow: pretraining, supervised fine-tuning (SFT), and reinforcement learning with human feedback (RLHF).
36Kr: What kinds of data do you use for pretraining versus fine-tuning?
HF: Pretraining uses broad, low-precision video data—anything that gives the model a general sense of how the world works. Fine-tuning uses high-quality teleoperation data that we collect in our own data factory.
Think of it like learning to swim. Watching videos teaches you the basics—that’s pretraining. Then a coach guides you hands-on—that’s fine-tuning.
36Kr: Some companies use simulation data. How useful is that?
HF: Every data type has pros and cons.
Video offers massive volume but low precision. Teleoperation is highly precise but limited in quantity. Simulations are easy to generate but not very accurate.
We use simulation for pretraining. It works well for rigid objects—good for training grasping behaviors. But it falls short with soft objects, like clothes. Simulating how fabric deforms and moves isn’t realistic enough yet.
36Kr: Some companies like Agibot have open-sourced their datasets. Are those usable for you?
HF: They are useful for pretraining—but not for fine-tuning.
At this stage, data quality is tightly coupled with hardware. If another company collects data using different equipment—different specs, frequencies, sensor types—that data won’t work directly on our hardware.
That’s another reason why I say the best embodied intelligence companies must do both software and hardware. The two must be co-optimized.
36Kr: Some say humanoid robots are being deployed in car factories. Is that a good application?
HF: Commercially? Not really.
Automotive original equipment manufacturers (OEMs) aren’t ideal clients. Their production lines are already heavily automated. The tasks that still rely on human workers tend to be complex—and not easily handled by robots for now.
Automotive parts factories—battery plants, for instance—make more sense. They have larger workforces and simpler workflows.
Embodied intelligence and humanoid robots are both nascent. Trying to combine two immature technologies and apply them to a very complex task is just too ambitious. Tesla can do it because it owns the factory—but others following suit might be missing the point.
36Kr: Some investors suggest embodied robots should first go to government use, then business, then consumer—based on technical difficulty. Do you agree?
HF: That’s a fair assessment.
Government projects often aim to support the ecosystem and can choose safer, more controlled environments. Business use cases—like factories—are also relatively structured. But consumer environments are unpredictable—every home is different. So it’s much harder to make robots work there.
That said, consumer demand is the largest, followed by business, then government. Commercialization strategies need to balance feasibility with market size.
36Kr: Once embodied AI models mature, will we finally get general-purpose humanoid robots? What are the remaining hardware bottlenecks?
HF: On the hardware side, challenges include dexterous hands, dynamic bipedal locomotion, resistance to external disturbances, battery life, and sensory tools like force sensors or electronic skin.
None of these can be solved by AI alone. These are fundamental science and engineering challenges—things like materials and motor power density.
Humanoid robots will hit roadblocks in those areas. Embodied intelligence will be bottlenecked by model capability.
KrASIA Connection features translated and adapted content that was originally published by 36Kr. This article was written by Wang Fangyu for 36Kr.