“A PR stunt”: X Square Robot CEO says humanoid robots don’t belong in factories, calls for focus on generalization

“If you’re just aiming to follow others, you’re already falling behind. That’s a weak-minded way to build tech.”

Wang Qian has the calm presence of a scholar, soft-spoken, measured, and composed. But when the conversation turns to embodied intelligence, a different side of him emerges: sharp, adamant, and unflinching.

“Starting a company takes real resolve,” Wang said. “If you already have a backup plan on day one, your mindset is flawed.”

Robotics has long been Wang’s professional focus. He earned his bachelor’s and master’s degrees at Tsinghua University, then completed a doctorate at the University of Southern California. While he briefly ran a quantitative hedge fund in the US, stepping away from robotics proved unsettling. “I couldn’t sleep for nights. I regretted not pursuing robotics full time,” he said.

In 2023, he shut down the fund, returned to China, and founded X Square Robot in Shenzhen.

Less than 18 months later, the startup has raised over RMB 1 billion (USD 140 million) across seven funding rounds. On May 12, Meituan reportedly led another investment in the company, described as a nine-figure RMB sum.

China’s embodied artificial intelligence sector began gaining definition around the same time. Nvidia CEO Jensen Huang predicted it would be the next major technological trend. Companies like Galbot and Agibot were also founded in 2023.

At first, X Square drew little attention. But with each funding round, it moved closer to the spotlight.

One unnamed investor told 36Kr that humanoid robotics in China now falls into clear tiers. Unitree Robotics, Agibot, and Galbot sit at the top, each having raised more than RMB 1.5 billion (USD 210 million). With its current funding total, X Square is on the cusp of that group.

As with foundational AI models, the embodied AI landscape in China is polarized. Some investors, like venture capitalist Zhu Xiaohu, question the field’s commercial prospects, even as robots perform flashy demonstrations. Others are making large bets, backing companies in a race toward mass production.

Wang is firmly among the believers.

From the outset, X Square has pursued a specific technical path: an end-to-end vision-language-action (VLA) model. New versions are released every two to three months.

By the time US-based Physical Intelligence (PI) launched its own model, VLA had already become an industry standard.

While many companies remained focused on basic tasks like pick-and-place, X Square’s WALL-A robot was executing more complex functions such as clothing management, home organization, and cable routing.

Critics argue that general-purpose embodied AI is still premature. Wang disagrees. He believes progress is outpacing expectations and says models with capabilities comparable to GPT-3 could emerge within a year, with real commercial use following one or two years later.

Current deployments remain limited to research, education, and concierge-style roles. But Wang sees these as narrow in scope. “Putting humanoids in factories to do repetitive tasks? That’s just a PR stunt,” he said.

Wang argues that meaningful commercialization depends on improving generalization—the model’s ability to adapt across different tasks and environments.

For now, monetization is not X Square’s priority. About two-thirds of its budget is dedicated to model development and related areas.

“To be blunt, we’re leading the domestic field in embodied AI models,” Wang said. “Investors give preferential treatment to frontrunners. They trust we’ll deliver outsized upside, and want us to stay focused on building a general-purpose model.”

Photo of Wang Qian, founder and CEO of X Square Robot. Photo source: X Square Robot.

The following transcript has been edited and consolidated for brevity and clarity.

36Kr: What technical progress has X Square made over the past six months?

Wang Qian (WQ): We’ve made fast progress, releasing new model versions every two to three months. Earlier, our models produced only actions: multimodal input, unimodal output. But since last October and November, we shifted to “any-to-any” models: multimodal input and output. Now our models generate not just actions, but also language and vision outputs.

We’ve also developed long chains of thought (COT). Around the time of our last two funding rounds, we got COT working in our full-modality framework.

In March, Google’s Gemini Robotics team published similar results: any-to-any and COT. Then PI released Pi-0.5, also following this structure. We anticipated this direction early and have kept pace with top global players. Technically, we’re on par with PI and Google.

36Kr: Has the VLA model architecture become the industry standard?

WQ: Yes, especially after PI’s model launch last October. Everyone realized E2E is the way forward.

Now everyone’s waving the flag, but execution varies wildly. Some companies redefine E2E to suit their needs.

There are two main approaches. One is a two-layer system: a high-level vision-language model for reasoning and planning, and a low-level VLA model for generating actions. The other uses a single unified model. We tried both and found the single-layer version offers a higher performance ceiling.

36Kr: What’s the alternative to E2E solutions?

WQ: Some still use traditional setups: 3D vision for perception plus rule-based control. That’s fine for basic pick-and-place tasks, like those in legacy industrial automation. But it’s not what we’re aiming for. Even Figure AI and Boston Dynamics have moved on from that.

36Kr: If we compared embodied AI models to large language models (LLMs), where is the field now?

WQ: We’re around the GPT-2 stage. GPT-3 had certain scale-dependent features we haven’t hit yet, nor have others like PI and Google. It’s all governed by scaling laws.

36Kr: When will commercialization really take off in China?

WQ: If things go well, within a year. Two years if slower. I mean customers actually paying for solutions. Household robots will take more time: probably three to five years.

People tend to overestimate what can be done in the short term and underestimate what can happen in the long term. I think embodied AI will arrive faster than most people expect.

36Kr: Everyone says data is the bottleneck. Do you have enough?

WQ: Data is more of a timeline issue. If you don’t understand embodied AI models well, collecting more data doesn’t help. It might even slow you down. A lot of it will be irrelevant or low quality.

It’s not just about having volume. It’s about knowing what kind of data matters. We’ve focused on quality and targeting. That’s more efficient.

Public datasets often fall short. We primarily rely on our own data collection.

36Kr: Startups seem cautious with spending lately. Are you preparing for a cooldown?

WQ: We’re frugal. We won’t spend where it’s not needed. But to build long-term value, you have to invest. Following open-source models and copying others is not just unambitious, it also won’t get you to general-purpose robots.

A lack of confidence usually reflects a lack of capability. If you truly believe in your abilities, you act accordingly. Why wait for the next boom when you can lead it?

36Kr: How do investors gauge your technical progress: videos or live demos?

WQ: Always live. We’ve insisted on real-time demos from day one. Videos can be faked. Only hands-on interaction shows true performance, especially when investors try to throw robots off-balance or introduce stress conditions.

36Kr: At this valuation, are investors pressuring you to commercialize?

WQ: It depends. Some care more about the model’s long-term potential. Others push for near-term commercialization. Because we’re leading on the technical front, we’ve earned more leeway. Investors want us to pursue meaningful commercialization, not just hit superficial milestones.

36Kr: But you haven’t released your robot hardware yet?

WQ: Actually, we have hardware. It just hasn’t been broadly released. Some units are already being used in service roles. We’ll roll out additional models soon.

36Kr: Is the tech ready for service sector use?

WQ: We’re still running proof-of-concept pilots with seed customers. We’re aiming for full deployment by the end of the year or early next year. But we’re not limiting ourselves to pick-and-place tasks.

Simple tasks don’t test the model’s capabilities. Legacy tech could already handle those. We’re targeting complex, varied, and open-ended scenarios.

36Kr: What kind of margins do you expect once real deployments begin?

WQ: Traditional service robots are task-specific. Ours are general-purpose, so the value depends on what they can do. Early profitability isn’t the goal. We’re refining the product based on real-world usage.

36Kr: Your peers are focused on education and retail concierge. Are those mature markets?

WQ: Those are marketable, but their value is debatable. They mostly serve to reassure investors. They are too small to be the endgame.

They are fine as byproducts along the way, but if you spend too long on them, you lose sight of the real mission.

36Kr: If general-purpose embodied AI is too hard, why not settle for niche commercialization?

WQ: Then why enter this field at all? If you don’t believe in the goal, there’s no point in starting.

36Kr: Some say Figure AI’s BMW factory deployment was overhyped. What’s your take on factory use cases?

WQ: Putting humanoids in factories to do repetitive tasks? That’s just a PR stunt. Current demands on speed and accuracy mean older tech often works better.

Factories are closed, structured environments that aren’t ideal for training generalist models. Embodied AI needs complexity, randomness, and open-ended interaction. That’s where models really grow.

In economic theory, people debate whether supply creates demand or vice versa. In embodied AI, supply clearly creates demand.

36Kr: US peers have higher valuations and deeper pockets. How big is the China-US gap?

WQ: It’s still significant. We closely watch PI, Google, and Tesla.

But we have a real shot at catching up this year or next. China’s follower mentality comes from past habits, but that’s not necessary in embodied AI. We’ve matched PI in many areas and even outperformed them in some metrics.

36Kr: PI open-sourced its Pi-0 model. Will that level the field?

WQ: Six months in, several Chinese firms have tried fine-tuning it. But results haven’t matched PI’s proprietary setup. Cross-platform adaptation remains a huge challenge.

36Kr: What can Pi-0 do commercially?

WQ: On new hardware, performance drops off sharply. That makes it hard to commercialize. PI likely open-sourced it because it couldn’t deploy it independently. They don’t build hardware and depend on partners for integration.

36Kr: Is waiting for open-source models and then following along a bad strategy?

WQ: It may sound practical, but it’s misguided. Embodied AI isn’t like LLMs. You can’t just fine-tune or replicate your way to success. You’ll still hit the same roadblocks.

Worse, your team loses morale. If leadership doesn’t believe in building from the ground up, how can the team?

Innovation demands conviction and creativity. Copying doesn’t cut it.

36Kr: Could embodied AI split into open and closed ecosystems like LLMs?

WQ: It’s not the same. For integrated systems, open source is a flawed model, especially when it comes to commercialization. We’ve seen this play out in drones, autonomous vehicles, and more.

People expect open-source success in embodied AI because of LLMs. But LLMs are software. Embodied AI involves physical hardware and world interaction. That reduces transparency and public engagement.

PI’s Pi-0 is world-class, but it didn’t make a splash. Without real-world interaction, these models remain academic.

And you can’t replicate open-source models exactly. No lab can reproduce another’s environment. Hardware-based data can’t be distilled. That’s a major difference from software.

Open source won’t unlock the market for embodied AI. Not in this domain.

KrASIA Connection features translated and adapted content that was originally published by 36Kr. This article was written by Wang Fangyu for 36Kr.