“If we’re still only focused on iterating video generation itself in 2026, that won’t be enough,” said Song Jiaming, chief scientist at Luma AI.
Founded in 2021, Luma AI has emerged as one of the standout startups in video generation. According to 36Kr, the company recently completed a USD 900 million Series C round at a USD 4 billion valuation. The round was led by Humain, a subsidiary of Saudi Arabia’s Public Investment Fund, with AMD Ventures, Andreessen Horowitz (a16z), Amplify Partners, and Matrix Partners all making sizable follow-on investments.
While most video generation startups remain focused on extending clip length or improving resolution, Song offered a contrarian view: the next breakthrough will come not from visual fidelity but from improving a model’s reasoning and understanding of the physical world.
He illustrated this with a filmmaking analogy. If a director realizes after a shoot that an overhead shot is missing, a traditional video generation model might create one based on a prompt, but the result would likely fail to align with surrounding footage. A reasoning-based model, however, can infer spatial layout, character positioning, and camera logic from existing clips, producing a seamless and physically coherent continuation.
This reasoning capability, Song said, is what makes video models viable for professional use in film, advertising, and media production, where commercial value depends on realism and continuity.
“The frenzy sparked by Sora 2 doesn’t mean video models are ready for the consumer market,” he said. “After the novelty wears off, most users stop paying.”
To strengthen reasoning in video models, Song said the key lies in training them across language, image, and video data to form a multimodal unified model. Integrating multiple modalities provides a richer dataset, pushing evolution from “generation” toward “understanding.”
He noted that this progression already occurred in image generation. In 2024, developers debated multimodal architecture. By 2025, most image models had unified text-to-image and image-editing capabilities within a single framework. Competition has since shifted from architecture to data quality. Song believes video generation will follow the same path in 2026.
From generating 3D models to videos
Predicting what comes next, both technically and commercially, has been central to Luma’s trajectory. The company began with 3D generation when it launched in 2021, then pivoted to video generation in late 2023 after identifying greater market potential.
In June 2024, Luma introduced Dream Machine, a video generation model designed for beginners in artificial intelligence and design. Without any marketing spend, it gained one million users in four days, praised for its cinematic camera work and visual quality. Many viewed it as one of the first consumer-focused AI models capable of rivaling Sora.
But Luma didn’t linger in the consumer spotlight. Throughout 2025, it shifted focus toward professional users with clearer monetization prospects in film, advertising, and content production. In September, it launched Ray3, a video reasoning model.
In an interview with 36Kr, Song revealed that Ray3 might be Luma’s last traditional video generation model, as the company now sees multimodal unification as its direction. That pivot demands substantially greater computing power and capital.
One of Luma’s new investors, Humain, is building a two-gigawatt supercomputing cluster in Saudi Arabia called Project Halo, one of the world’s largest such projects. Luma will be a core customer, using the infrastructure to train its next-generation multimodal world model and enhance video reasoning capabilities.
From its 3D roots to consumer success with Dream Machine and now its professional expansion, Luma’s strategy has consistently built outward from its foundation in visual generation.

The following transcript has been edited and consolidated for brevity and clarity.
36Kr: You’ve said that Ray3 might be Luma AI’s last traditional text-to-video model. What did you mean?
Song Jiaming (SJ): I believe future models will no longer treat images, videos, audio, and text as separate modalities. They will exist within one unified framework: a multimodal model. This integration will give video models stronger reasoning abilities, enabling them to make coherent decisions and automatically detect inconsistencies.
Language models are powerful because of their contextual learning and zero-shot reasoning. I think we’ll soon see those same traits in vision and video, not just longer or prettier clips.
36Kr: Can you give an example of how reasoning differs from traditional video generation?
SJ: Imagine a film set where multiple cameras shoot the same scene from different angles. After wrapping, the director realizes an overhead shot is missing and asks AI to create one.
A conventional model might generate a top-down view, but the characters’ positions or props would be off. A reasoning model first understands the footage, identifying object correspondences and inferring spatial layout, then generates a physically accurate continuation that connects seamlessly with prior shots.
36Kr: You’ve said that 2025 will be the last year of divergence for video models. Why?
SJ: Look at image generation. Last year, teams built separate pipelines for separate tasks. This year, most merged those into unified multimodal systems. Few are now attempting radically different architectures from models like GPT-4o or Nano Banana. Once architectures converge, competition shifts to data quality and quantity. I expect the same for video.
36Kr: What role does Ray3 play in Luma’s transition?
SJ: Ray3 is a milestone. It allowed us to strengthen the infrastructure for training, inference, and data handling, which I believe is more important than algorithms alone.
Algorithms haven’t changed drastically in years. We’re still largely building on autoregressive and diffusion frameworks from five years ago. The real progress has come from scaling by growing the model and dataset size.
36Kr: How do reasoning and multimodality relate to artificial general intelligence (AGI)?
SJ: I’m fairly strict about what qualifies as AGI. People say some coding models now outperform most programmers, but if that’s enough, then calculators would count as AGI. To me, if humans can do something that AI can’t, we’re not there yet.
AI still struggles with real-world reasoning across areas like autonomous driving, robotics, embodied intelligence, and long-term physical planning. Multimodal video models bring us closer to AGI by grounding understanding in vision, motion, and time.
36Kr: What can we learn from viral models like Sora 2 and Nano Banana?
SJ: One key takeaway is the importance of designing for user-driven experiences, by identifying use cases that highlight a model’s unique traits and making them shareable.
36Kr: When Dream Machine launched, it targeted creative beginners. Why the later shift to professional clients?
SJ: It’s more of a gradual shift than a sudden pivot. The logic is similar to what happened with language models: chatbots drew consumer buzz last year, but this year, the focus is on coding assistants and autonomous agents, which offer clearer business value.
For consumers, chatbots don’t vary enough to justify high subscription fees. For developers, tools that double productivity justify enterprise spending. The same applies to video.
36Kr: Does Sora 2’s popularity signal a takeoff in consumer video generation?
SJ: Not necessarily. OpenAI’s scale means it must pursue consumer growth, but that doesn’t make B2C the right path for everyone. Olivia Moore from a16z shared data showing Sora 2’s 30-day retention at just 1%, falling below 1% by 60 days. TikTok maintains around 30%. Sora’s cultural impact doesn’t equal a sustainable market.
36Kr: What are the challenges facing consumer use cases for video generation?
SJ: Technically, it’s feasible. The issue is business logic. Platforms like Douyin, YouTube, and Instagram revolve around social connection. If everyone consumes AI-generated videos made only for themselves, we lose shared reference points which form the foundation of social engagement.
36Kr: How fierce has B2B competition been?
SJ: It looks fierce, but the US market is relatively mature and selective. Strict compliance limits the number of qualified providers to a few players including Google, us, and a handful of startups.
US clients are also used to subscription and API-based models, making monetization more predictable.
36Kr: Luma started with 3D generation. How did that compare commercially?
SJ: We experimented with 3D generation, but it wasn’t scalable. Use cases were limited, and quality lagged. Adoption mostly came from gaming and digital human projects, which have small client pools. Large companies like Tencent prefer in-house solutions. 3D data is also scarcer, and the extended reality ecosystem remains too immature to depend on generative AI for content. So, 3D was an early exploration, while video is where commercial traction emerged.
36Kr: Does video generation have a moat?
SJ: Not yet. Most developers are exploring similar frameworks. There haven’t been revolutionary new architectures in years, with differentiation lying mostly in execution speed and engineering quality.
Video models deal with petabytes of data, and that’s hundreds or thousands of times more than text models. The real challenge is managing that scale.
36Kr: How should we judge which video model is leading?
SJ: There’s no universal metric. Unlike text models, video architectures and training methods are still evolving. I don’t claim to be the best. What matters is performance in real production workflows. For example, HDR support. On that front, we’re currently the only one.
36Kr: Which has more potential: diffusion or autoregressive models?
SJ: I wouldn’t choose one, as it depends on data and architecture, not the paradigm itself. The real question is whether we can solve core user problems, and that’s what determines commercial success.
36Kr: Will video and multimodal models consolidate as language models did?
SJ: Almost certainly. In the language space, only a few players have sustained momentum. Others pivoted, were acquired, or faded. Video and multimodal models are part of the same foundation model ecosystem, so they will likely become equally concentrated.
In China, I wouldn’t start a foundation model company from scratch. Large tech firms already dominate in capital, talent, and compute. The US, with its dollar funds, clear exit channels, and M&A culture, offers more room for startups.
36Kr: Ray3 came after Ray2 within just seven months. What has the team focused on since?
SJ: We’ve explored world models, but unification remains our priority. There are quick wins, like integrating third-party models, but our long-term strategy matters more.
36Kr: What drove investor confidence in this Series C round?
SJ: Several factors: our performance, iteration speed, and roadmap validation. Investors are now searching for the next foundation model player, and preferably one capable of scaling globally. Many USD-denominated funds are betting on long-term potential rather than short-term returns.
36Kr: How will you spend the new capital?
SJ: Compute remains our largest cost, both for training and inference. The rest will go toward expanding our engineering, systems, and research teams.
36Kr: What does your team look like today, and what do you value in new hires?
SJ: We have about 130 employees, with 30–40% in R&D and the rest in product, business, marketing, and operations. We don’t rely on traditional product managers. Product thinking is distributed whereby engineers understand user needs, and operators translate feedback into technical goals.
When hiring, we avoid competing for established stars. Think of it like a football club’s youth system. We prefer discovering talent early, and I look for three traits: strong coding skills, fast learning ability, and intrinsic curiosity for the long term.
KrASIA Connection features translated and adapted content that was originally published by 36Kr. This article was written by Fu Chong and Zhou Xinyu for 36Kr.

