At the Apsara Conference 2024, which kicked off on September 19, the future felt a lot closer. Nearly 300 companies specializing in artificial intelligence—covering everything from computing power to models and applications—came together to showcase close to 1,000 new products. It was a glimpse into what AI has been up to lately, but this year, two trends took the spotlight: multimodal capabilities and embodied intelligence.
Walking into the AI pavilion, the buzz wasn’t about who had the biggest model or the most parameters. That’s last year’s game. Now, multimodal capabilities—where AI understands and processes multiple types of input, like images, video, and sound—are the baseline.
Around 60 companies filled the space with flashy displays, but it was clear: competing solely on model size wasn’t cutting it anymore. Visitors got hands-on with tools that blended audio, video, and text in new ways, making these AI systems feel much more like one-stop solutions.
The frontier tech pavilion, however, was where things got physical. More than 20 robotics companies were showing off their latest creations—human-like robots, dog-like machines, and all kinds of futuristic hardware. One display had visitors mesmerized: bipedal robots performing tricks, doing flips, and withstanding strong kicks without so much as a wobble. The question on everyone’s mind? “Sure, but why does a robot need to be kickproof?”
Despite all the high-tech marvels, the products generating the most buzz were the ones with tangible, everyday applications. For the first time, a delegation of business owners from Yiwu—one of China’s major commercial hubs—showed up to check out the tech. Real-time translation tools, digital human presenters, and AI-powered product image generation had them asking the most practical question of all: “How much money can this help me make?”
That no-nonsense approach struck a chord. For all the advanced robotics and multimodal wizardry, AI’s real value still seems tied to how it can impact daily business operations.
Tongyi Qianwen: Multimodal is the new standard
At the AI pavilion, Alibaba Cloud’s image generation booth wasted no time drawing a crowd. The demo was simple but captivating: participants mimicked a pose from a line drawing displayed on the screen, and moments later, Tongyi Qianwen—a large AI model developed by Alibaba—generated a portrait of them in a tai chi theme.
This image-to-image feature, while impressive, was just the tip of the iceberg. As the conference host, Alibaba Cloud went all in, showcasing a full suite of multimodal tools that covered text-to-image, image-to-video, and combinations of image and audio for video creation.
One standout was a short video generation feature within the Tongyi Qianwen app. Upload a photo of a person or even a pet, add an audio clip, and in just a minute or two, the app churns out a dance video, lip sync clip, or animated emoji. This free tool, powered by Emote Portrait Alive (EMO) technology—launched by Alibaba’s research institute for intelligent computing in early 2024—has quickly become a favorite. Since its debut on April 25, over 100,000 users have jumped on board, churning out short videos by the thousands.
Zhipu: AI teachers accessible to everyone
Zhipu AI’s booth turned into a magnet for parents at the conference, and for good reason. In August 2024, the company rolled out a new feature to its consumer app Zhipu Qingyan, adding a GPT-4o-like video call function that allowed users to interact with AI in a more dynamic way. From asking for fashion advice to object recognition or even a casual chat with a friendly AI companion, the app seemed designed to engage on multiple fronts.
But the real game-changer was its AI teaching tool. Parents were quick to test its practicality: simply point the camera at a homework problem, and Zhipu Qingyan doesn’t just solve it—it walks students through the process step-by-step. The AI breaks down each part of the problem, offering a learning experience that could easily rival a human teacher’s attention.
Take the classic “chicken and rabbit in the same cage” math problem. Rather than just delivering the solution, Zhipu Qingyan guided curious children through setting up the equations, explaining how to think about the problem before arriving at the final answer. It was a hands-on teaching moment that had parents nodding in approval.
Shengshu’s Vidu AI: From stills to Shinkai-style videos
Shengshu Technology, a spinoff from Tsinghua University, made waves at the conference with its video generation model Vidu AI. Launched in April 2024, Vidu AI quickly went viral, earning itself the nickname “China’s version of Sora.” Even OpenAI’s Sora struggles with one of the trickiest challenges in video generation—keeping faces consistent across frames—but Vidu tackled that problem head-on with a new feature it rolled out on September 11.
This update allows users to lock a subject’s appearance and style based on a reference image, ensuring a consistent look throughout the video. During the demo, one visitor uploaded a still from Makoto Shinkai’s movie Suzume, and in seconds, Vidu AI generated a dreamy autumn park scene, complete with the film’s heroine. The precision and smooth transitions left the crowd impressed.
Vast’s Tripo: A glimpse into the future of 3D generation
While Vast might not be a household name in China, it has deep roots in the AI world. Founded by former members of MiniMax and SenseTime, the company has focused on international markets from the get-go. In China, its core clientele is B2B, but overseas, Vast’s ventures have already found solid footing.
At the heart of its offerings is Tripo, a 3D generation model known for both its speed and user-friendly design. With just a text or image input, Tripo can whip up a 3D prototype in as little as eight seconds. It integrates seamlessly with all major 3D editing software and is primed for 3D printing, making it a versatile tool for designers and engineers alike.
In January 2024, Vast made waves by launching Tripo 1.0, boasting tens of billions of parameters—a significant achievement in the 3D modeling space where data is often sparse. By September 19, Tripo 2.0 was unveiled, upping the ante with the ability to generate not only 3D shapes and textures but also handle complex tasks like physically based rendering (PBR) in mere seconds.
Yinfeng: AI-generated music goes viral
In July 2024, a song with the unusual title Huan Wo Ma Sheng Bi—which translates loosely to “give me back my mom’s nose”—took Weibo by storm. The track, released by social media personality Qin Xinyu after a botched cosmetic surgery, wasn’t the product of a professional music team. In fact, Qin didn’t compose it at all. Instead, it came from Yinfeng, an AI music generation platform that had been live for less than two months.
What sets Yinfeng apart is its ability to generate longer, more cohesive tracks—something that’s no small feat in the world of AI-generated music. According to the team, one of the biggest hurdles is maintaining a consistent style throughout a piece, especially when it stretches beyond a minute. Most AI tools struggle to keep the music coherent from start to finish.
But Yinfeng managed to crack that code. It can generate tracks up to four minutes long while preserving a relatively uniform style throughout. Users just input lyrics, pick a genre from the platform’s extensive music and vocal libraries, and let the AI do the rest.
Despite its capabilities, Yinfeng is still most often used to create music for short videos—a popular format in the age of TikTok and Weibo, where quick, catchy clips dominate.
HiDream.ai: AI-generated images tailored for e-commerce
At HiDream.ai’s booth, the buzz was palpable as waves of Yiwu merchants stopped to see what the platform could do for their businesses. Founded by Mei Tao, former vice president of JD Explore Academy, HiDream.ai was built with e-commerce at its heart. While its expertise lies in AI image generation, the platform goes beyond simple automation. Named Zhixiang, it functions like a full-fledged product image studio, covering everything from staging to shooting to post-production.
For e-commerce merchants, HiDream.ai simplifies the process of creating polished, high-quality product images. It offers remarkable flexibility: merchants can set backgrounds, adjust lighting, and even select models. When generating images for clothing, for instance, users can fine-tune everything from the model’s pose and gender to skin tone and ethnicity—all with just a few clicks.
The result? A streamlined process that puts professional-level product photography within reach for businesses of any size.
Galbot: The laidback robot clerk
Galbot G1, developed by Galbot, gave conference-goers a glimpse of the future with its demo in an unmanned store scenario.
In this setup, customers placed orders using a tablet, and Galbot G1 dutifully headed to the shelves to fetch the requested items. However, its performance wasn’t quite ready for prime time—retrieving a simple iced tea took nearly a minute, raising questions about whether the robot could keep up in a fast-paced commercial setting. A little more speed, it seems, might be needed before it’s ready for retail.
But unmanned stores are just one potential application for Galbot G1. The robot also showed off its versatility, picking up randomly placed items like bottled water and umbrellas, opening cabinet doors, and even folding laundry. According to onsite staff, these robots could be commercialized by the end of 2024, but for now, they remain a work in progress.
Qingbao: Lifelike robots ready for factory floors
Working alongside humanoid robots might soon be part of the daily grind. At Qingbao’s booth, a lineup of eerily lifelike robots, striking new poses every few seconds, was on display. Their hyper-realistic eye movements and blank expressions left some attendees feeling unsettled—cue the “uncanny valley” effect. But these machines weren’t built for companionship or entertainment. They are headed straight for the factory floor.
In most factories today, robots are limited to robotic arms handling repetitive tasks, as fully integrated humanoid robots are still costly. But Qingbao’s humanoid models are already shaking things up, taking on quality checks and parts distribution. One customer shared that they opted for these robots because they “wanted the production line to feel warmer.”
Yet, for most clients, the real appeal isn’t warmth—it’s the bottom line. According to Qingbao’s staff, these robots could slash labor costs by as much as 20% annually, making them an attractive option for companies looking to streamline operations.
Coocaa: Cloud TV clings to AI lifeline
As TV sets continue to lose ground to smart mobile devices, cloud television manufacturers have had to scramble to stay relevant. But in 2024, Coocaa found a lifeline in AI.
With its AI-powered operating system, Coocaa lets users search for TV shows, movies, and even online content using voice commands. Much like a smart assistant, it also offers personalized recommendations based on viewing preferences. Ask which shows feature Chen Daoming as an emperor, and you’ll get instant results like Joy of Life, King’s War, and Kangxi Dynasty.
What’s more, all the actor photos and drama stills in the search results are generated by AI, tailored to match the user’s preferences.
Since launching the AI-powered OS, Coocaa has seen a notable uptick in voice interaction usage, all while managing to keep technical costs under control—rising by no more than 10%.
Alibaba Cloud: Auto-generated subtitles for videos
A well-produced TV series should be able to subtitle itself in multiple languages, right? That’s no longer just a wishful thought—it’s a reality, thanks to Alibaba Cloud’s video team.
Subtitling a show in foreign languages used to be a labor-intensive process: first transcribing the Chinese subtitles, then translating them, and finally editing them back into the video. But now, with algorithms developed by Alibaba’s Tongyi Lab, all you need to do is upload the video file, and the system takes care of the rest. For instance, in the case of Empresses in the Palace, the AI can automatically generate English subtitles—no manual editing required.
Liepin’s Doris: AI is sending you your next job offer
The first wave of digital human interviewers is already on the job. Doris, one of Liepin’s flagship products, is an AI-powered interviewer capable of conducting over 400 interviews in just 24 hours.
Right now, Liepin’s AI interviews feature preset questions alongside an intelligent Q&A system. The AI analyzes an applicant’s resume, picking up details like job changes or average tenure, and adjusts its follow-up questions accordingly.
But not everyone’s sold on the experience. Some candidates reported feeling more anxious during these AI interviews, pointing to the lack of real-time interaction and facial feedback from the emotionless digital interviewer.
Liepin’s team recommends using AI for initial screenings, but they acknowledge that when it comes to hiring top talent, the human touch still plays an irreplaceable role.
Motiff: Designing user interfaces with a single sentence
Motiff, China’s leading large model developer for user interface design, is making life a lot easier for designers. What used to take at least a week can now be done in seconds—literally.
All it takes is a sentence. Users just input the type of interface they need, list the components, and add a brief description. In about 20 seconds, Motiff generates two UI design drafts, ready for review.
The real magic is in the details. Motiff’s deep understanding of layout means repetitive tasks—like copying and pasting—have been reduced to a single drag-and-drop action, streamlining the entire process.
And plenty else…
Beyond robots and digital humans, AI-powered takes on the classic werewolf game were a frequent sight in the frontier tech pavilion.
Giant Network, for instance, has woven an AI-powered werewolf game into its seasonal events on platforms like Douyin and Bilibili. The non-playable characters (NPCs) in this game are relentless. If a player’s statement has even a hint of a logical flaw, the AI-driven NPCs jump on it without hesitation. The round-the-clock availability of these sharp-witted NPCs has led to a tenfold increase in user engagement during Giant’s seasonal events.
Meanwhile, Baibian Dazhentan, a murder mystery game app that launched back in 2018, has integrated AI-powered gameplay using the Tongyi Qianwen model. Players can now engage in voice conversations with AI-driven NPCs, but there’s a catch: the number of dialogue rounds is limited. If you want more interaction, you’ll need to pay—unlocking a key revenue stream for AI-driven murder mystery games.
However, adding AI isn’t a magic bullet for guaranteed success. According to 36Kr, incorporating large models has bumped up technical costs, and finding the right game scripts has become a delicate balance: too complicated, and the AI can’t keep up—too simple, and players lose interest.
In the end, as AI keeps evolving, the challenge for humans will be to keep pace.
KrASIA Connection features translated and adapted content that was originally published by 36Kr. This article was written by Zhou Xinyu for 36Kr.