FB Pixel no scriptWhat will define AI? AReaL head Yi Wu points to reinforcement learning
MENU
KrASIA
Features

What will define AI? AReaL head Yi Wu points to reinforcement learning

Written by 36Kr English Published on   11 mins read

Share
Wu Yi, head of the AReaL project. Photo source: Wu Yi.
His work on reinforcement learning and embodied agents is part research, part startup, and all about learning by doing.

Whether in academic research or collaborations with companies such as Ant Group, Yi Wu encourages his team to keep an entrepreneurial mindset: move quickly, iterate often, and avoid fearing failure.

An assistant professor at Tsinghua University’s Institute for Interdisciplinary Information Sciences and head of the AReaL project, Wu studies reinforcement learning algorithms and applications of artificial intelligence. In May, his team and Ant Research jointly open-sourced AReaL-lite, described by the researchers as the first asynchronous reinforcement learning training framework designed to improve training efficiency and reduce GPU waste. The claim has not been independently verified.

As a young tech leader, Wu emphasizes learning through trial and error. He resists the idea that a lack of resources is an acceptable reason for stalled progress, saying that building something new often requires creating the resources along the way.

That philosophy surfaced at the Inclusion Conference on the Bund in September, where Wu argued that teams should release products as soon as they work at a basic level so they can learn from market feedback. The goal, he said, is not to wait for a perfect launch but to identify problems early and refine the product.

His approach is rooted in earlier entrepreneurial experience. In 2023, his team founded Prosocial Intelligence, an agentic AI company that later evolved into AReaL.

Wu is informally grouped with Jianyu Chen, Yang Gao, and Huazhe Xu as part of the “Berkeley Four,” a nickname referencing their shared academic background in AI research. All four studied in the US. Wu was the first to return to China, and he encouraged the others to follow.

At Tsinghua University, Wu often reminds students that innovation requires venturing into unfamiliar territory. He argues that AI breakthroughs benefit more from long-term focus than from trying to chase every potential direction.

He also holds a specific view of AI’s future, where intelligent agents interpret loosely defined human intentions, complete long-horizon tasks, and eventually move from digital spaces into the physical world. At this year’s World Artificial Intelligence Conference (WAIC), he described a scenario where a person could verbally ask a robot to tidy a room, and the robot would spend hours finishing the task.

Reinforcement learning, his area of research, is central to that goal. He notes that the technique enables AI systems to learn through interaction and exploration, in contrast to supervised learning, which depends on continuous human instruction and struggles with long, open-ended tasks.

Despite his rigorous academic work, Wu’s social media presence is lighter. On Xiaohongshu, he posts research updates, responds to questions about careers in AI, and occasionally ranks his favorite bubble tea flavors.

Wu spoke with 36Kr about his take on AI’s future, entrepreneurship, and building efficient teams.

The following transcript has been edited and consolidated for brevity and clarity.

36Kr: AI has yet to achieve mass adoption. Where do you see the next major opportunities, and how will AI show up in people’s daily lives?

Wu Yi (WY): I think enabling AI to complete long-horizon tasks is an irreversible trend. Meanwhile, the commands humans give AI will become increasingly simple and vague.

It’s hard to predict the exact product form, but one thing is certain: we’re moving from users actively driving AI to AI proactively anticipating what users want and completing it.

This pattern already appeared during the mobile internet era. With search engines, users had to look for information themselves. Then came platforms like Zhihu, and later ByteDance’s products, whose algorithms pushed desired content directly to users.

So I think, eventually, people will forget what a “search box” is as AI increasingly caters to human laziness. Ultimately, a whole new kind of product will emerge, and it will mark a generational opportunity.

36Kr: At WAIC, you mentioned that once AI agents gain physical embodiment, they become embodied agents capable of interacting with the physical world. What can they do?

WY: A smart embodied agent can infer user intentions from fuzzy instructions, complete tasks accurately, and even anticipate unspoken needs.

For example, if you tell your home robot that you can’t find your power bank, it may reason and act on its own, searching based on your habits and its memory of where you last used it.

36Kr: Can embodied agents collaborate in teams? How do multiple embodied agents work together?

WY: They can cooperate to handle more complex tasks.

Take robots in soccer for instance. Just like human players, when robots encounter familiar situations, a quick scan of the environment signals what formation to take.

If you have several intelligent agents, the next step is defining how they communicate. In the digital world, one approach is a master agent that coordinates many smaller ones. You can use different models or even a single model structured like a planner directing many executors. That’s the idea behind a multi-agent system.

I often cite Claude Code and Gemini as an example: Claude Code excels at programming but has short context and high cost, while Gemini handles large amounts of content but lacks reasoning power. Let Gemini first read an entire codebase and extract key parts, then let Claude Code write the actual code.

It’s like pairing a smart but frail thinker with a strong but dull worker. The combination makes a highly efficient multi-agent system.

In embodied scenarios, such as cleaning a space, robots “communicate” to plan who sweeps and who mops, working together to finish the job.

36Kr: How do we move from digital agents to physical embodied agents?

WY: Transitioning from the digital to the physical world requires multimodal data, moving training environments from computers into reality.

In the digital world, tools are mostly bits. They execute reliably, but in the physical world, using tools like grabbing a bag or opening a door still involves high error rates. Embodied intelligence, therefore, develops more slowly and with greater complexity.

That said, if we look far enough ahead, once the physical world has been sufficiently digitized, the core technical challenges for all types of agents will converge.

If one day a machine can reliably operate almost any physical tool, building an embodied agent that can function autonomously for an entire day will be technically no different from a digital agent.

36Kr: You’ve interned at ByteDance, founded Prosocial Intelligence, and later collaborated with large companies to advance reinforcement learning tech. Looking back, what have you learned?

WY: At Prosocial, we made plenty of mistakes in early hiring. Many employees treated it like a regular job, not a startup, and didn’t grasp what entrepreneurship really means. Objectively, the team wasn’t fully ready to adopt a mindset catered to running a startup in the AI era. Still, everyone was learning. It was inevitable to stumble.

Now I really dislike hearing that something can’t be done because we don’t have the resources. Startups rarely have abundant resources, and people create them while pursuing their goals.

Entrepreneurial teams need that spark of innovation and the right level of conviction.

Innovation isn’t about placing bets. Startups must believe deeply in what they are doing. We don’t have enough resources to hedge across multiple tracks hoping one will win. That only breeds mediocrity.

Entrepreneurial spirit means believing something is right even if you fail to achieve it yourself. Someday, someone will.

36Kr: Among the “Berkeley Four,” you were the first to return, and you encouraged the others to follow. Why?

WY: In August 2018, I finished my internship at ByteDance in Beijing. Though I earned my PhD at UC Berkeley, my experience at ByteDance had a big influence on me.

Since 2016, I’d interned intermittently in various ByteDance teams and was among the first members of its AI lab, witnessing the end of China’s mobile internet boom. By August 2018, I knew I wanted to return to China.

Partly, I saw enormous opportunity in China’s development. Partly, I felt a clear ceiling for Chinese professionals in the US. Unless you become fully American, you face that question: if you want to make a real impact, do you want to be Chinese or American? I realized I didn’t want to compromise by becoming American.

Many people say that they aren’t ready yet, and that they will wait until they are. Some Chinese scholars in the US say they will develop there for a few more years, then return. But my view is: if you’re sure you’ll do something someday, the best time was yesterday. The second best is today. So I decided to come back.

A month later, I turned down ByteDance’s return offer. In October 2018, I joined Tsinghua University as a faculty member. Then I shared my thoughts with my fellow Berkeley classmates, telling them to seize the opportunity, and indeed, some were persuaded.

Looking back, the timing really was ideal. We early returnees enjoyed the dividends of that wave.

Photo of Wu Yi with Stuart Russell, one of his mentors at UC Berkeley.
Wu Yi (left) with Stuart Russell, one of his mentors at UC Berkeley. Photo source: Wu Yi.

36Kr: Your career seems full of pivots, changing your PhD field to reinforcement learning, starting a company before your peers, then collaborating with big firms while others began to found startups. In a sense, that sounds like reinforcement learning itself.

WY: Exactly. I’ve been “learning by reinforcement” all along, hitting every pitfall as quickly as possible. Honestly, learning through trial and error teaches you more deeply and generalizes better than supervised fine-tuning.

Building products works the same way. I often say: once you make something, release it immediately. In the AI era, even great products need exposure. Get them out there, gather feedback, and iterate fast. Even negative feedback shows you where the pitfalls are.

Of course, with high-quality supervised fine-tuning data, reinforcement learning becomes more efficient. Negative rewards are costly, so I share my experiences to help others learn faster.

36Kr: Pioneering ideas rarely come with ready examples. How do you convince yourself to make bold decisions?

WY: I have a method to help me make decisions quickly: I flip a coin. Before it even lands, I usually already know my answer. I’m always the one who flips first.

36Kr: What matters more to you: doing what you want, or the spotlight? If you could achieve something great but stay anonymous, would you accept that?

WY: Yes. I’ve thought about it: if I built a startup from zero to one and later, as it scaled from one to 100, I was no longer the most visible leader, would I be fine with that? The answer I arrived at is yes.

At that inflection point, I’d likely bring in a professional manager and move on to the next project. Managing hundreds of people isn’t what I enjoy most.

That said, I’m reflecting on whether such idealism might limit me. Maybe when that moment comes, I’ll choose differently. But if you ask me now, I’d still choose to keep creating zero-to-one projects.

36Kr: Why is reinforcement learning so effective for AI training?

WY: Because it lets AI learn from real interaction. Supervised learning or fine-tuning means humans constantly tell AI what to do, but possibilities are infinite, and humans can’t give instructions for ten hours straight.

Human instructions also differ from how AI “thinks.” When AIs simply memorize, they don’t truly understand, and thus generalize poorly.

Reinforcement learning encourages active interaction with environments, even teaching AI to ask questions when uncertain. It cultivates self-driven exploration, something only the technique can achieve.

36Kr: You’ve said good reinforcement learning depends on three hard-to-perfect elements: the reward model, search and exploration, and prompts. How do you tackle that challenge?

WY: I now think the most important factor is prompting, specifically creating large numbers of high-quality prompts.

Here’s an analogy: a teacher tutoring a student in math. Prompts are the teacher’s problems, search and exploration are the student’s problem-solving process, and the reward model is the teacher’s feedback.

Choosing the right difficulty is crucial. Make it too advanced and the student gives up, but if it’s too easy, they learn nothing. The same applies to reinforcement learning: data volume alone doesn’t help. Appropriateness does.

36Kr: How does reinforcement learning relate to embodied agents?

WY: The relationship goes two ways. One is locomotion, where reinforcement learning is already mature and doesn’t require pre-training.

The other involves long-horizon reasoning and planning, usually combined with large pre-trained models. That area only became popular after ChatGPT.

These two aspects form a spectrum from high-frequency control for short tasks to abstract reasoning for complex ones.

Traditional reinforcement learning for control doesn’t need pre-training, think quadruped robots that can run and jump. Tiny neural networks trained in simulation can directly control real robots without pre-training.

In such tasks, reinforcement learning trains the network to output control signals for each joint, enabling motion over durations as short as seconds, not hours. By contrast, models like ChatGPT and DeepSeek’s R1 use reinforcement learning after pre-training to enhance reasoning.

Large models that employ reinforcement learning can think for minutes or hours, use common sense, break complex problems into subtasks, and call tools. But so far, this success remains in the digital realm, not the physical.

In between lies the vision-language-action (VLA) model, which is often discussed in embodied intelligence research.

36Kr: How do we move from VLA to fully embodied intelligence?

WY: VLA applies pre-training to the physical world. Researchers gather massive data to pre-train models that not only complete short tasks like running or jumping but also generalize to minute-long activities like folding towels and pouring water.

To reach longer-range tasks like cooking or cleaning, robots must perform for hours, combining fine control with abstract reasoning and human interaction, just like digital agents, except it’s in the physical world.

So I see embodied agents as systems that merge locomotion or VLA as the “small brain” controlling motion, with language models enhanced using reinforcement learning as the “big brain.”

Unlike digital agents, physical agents still get less attention. Most people focus on hardware aspects such as gripping accuracy, object sorting, and so on. But altering the physical world is always harder.

Given my focus, I’m working on stabilizing long-duration reasoning before combining it with physical control.

36Kr: How will reinforcement learning and VLA eventually integrate?

WY: Our current plan is a layered structure. As I said at WAIC, the higher you go, the more human knowledge you need. Likewise, the lower, the less.

Lower layers handle instinctive reactions like grabbing a cup based on tactile feedback or simple physics. Upper layers need prior knowledge. So there’s a natural division between digital and physical agents.

I don’t think VLA will be the final paradigm, because it isn’t large enough in scale to become a fully capable agent. We’ll perfect the digital agent format first while others explore the embodied side, then merge them when the timing is right.

36Kr: At the September conference, you said your AReaL team seeks an entirely new, minimalist organizational model. Why?

WY: In the internet era, building a product required four or five people which typically includes a frontend developer, a backend developer, and a product manager.

In the AI era, one person might be sufficient to handle all that.

Previously, small companies outsourced many tasks. Now, AI can streamline not just internal work but also outsourcing itself.

If a team can run heavily on AI, its capabilities will naturally scale outward, because if AI can serve us, it can serve others, too. That’s a new product opportunity.

Our AReaL team has only six members, with some external support. Counting everyone, we could still make it leaner. I want the team to stay minimalist, and that’s why it has always been small.

36Kr: But large companies have complex structures. How do you achieve that simplicity within one?

WY: First, a modern agent-focused team must use many agents every day itself.

Second, I combine algorithm and infrastructure teams into a single full-stack unit.

Traditional structures separate algorithms from systems and add data collection teams, creating a segregation dynamic whereby algorithm teams are the “clients,” while engineers become the contractors doing the “dirty work.” That division kills innovation. Once you’re used to being the former, you avoid the grunt work. And if you’re in the latter, you lose creative space.

OpenAI didn’t magically invent new algorithms, it simply perfected the details.

So to excel in infrastructure and data, you need to dig deep. With that groundwork, algorithms can shine. That’s why algorithms and infrastructure must be co-designed and co-evolved. A small, highly capable team can collectively fulfill this.

Large organizations, say with 200 people, can’t avoid silos. Limited communication bandwidth leads to rigid roles and inefficiency. So a compact, full-stack setup and high innovation go hand in hand. Forget the 200-person org chart. In the AI era, it’s all about going from zero to one, so take bold, radical approaches and build anew.

KrASIA Connection features translated and adapted content that was originally published by 36Kr. This article was written by Fu Chong for 36Kr.

Share

Loading...

Loading...