LLM prices hit rock bottom in China as Alibaba Cloud enters the fray

On May 21, Alibaba Cloud announced a significant price drop for its large models, offering varying levels of free access and price reductions for both open- and closed-source versions. Among them, The most notable change was for the Qwen-Long large language model (LLM), which saw a 97% price cut from RMB 20 (USD 2.7) to just RMB 0.5 (USD 0.06) per million tokens. This move undercut ByteDance’s Doubao model, which had set its price at RMB 0.8 (USD 0.11) per million tokens on May 15.

However, only four hours after Alibaba Cloud’s announcement, domestic rival Baidu responded by making its Ernie Speed and Ernie Lite models free to use.

In 2023, LLM price reductions followed the natural trend of training efficiency optimization and economies of scale. For instance, in November 2023, Baidu adjusted its token calculation to match the number of Chinese characters, effectively lowering prices by 20%. Similarly, the inference cost of Baidu’s Wenxin model was reduced to 1% of its original cost.

This year, the price war began abruptly and with significant impact. The catalyst for this price drop was the release of the DeepSeek V2 model, backed by the quantitative fund High-Flyer, which owned over 10,000 Nvidia A100 GPUs. On May 6, DeepSeek released the V2 version of its large model, boasting 236 billion parameters. It ranked among the top domestic models based on performance and price, offering API pricing at just RMB 1 per million tokens for computation and RMB 2 per million tokens for inference.

For comparison, Baidu’s Wenxin 4.0-8K model was charging RMB 120 (USD 16.5) per million tokens for inference at the time.

Following this, Zhipu AI also reduced its price for the GLM-3-Turbo model on May 11, from RMB 5 (USD 0.69) per million tokens to just RMB 1 (USD 0.14) per million tokens.

Aside from the disruption in pricing, 36Kr noted a renewed focus on smaller but more cost-effective models. These smaller models, whose potential has not been fully explored, could be leveraged for strategies in areas like data management and efficiency optimization. For downstream consumers, smaller models may also offer better cost and budgeting benefits.

On April 22, Meta released the Llama-3 open-source model with 70 billion parameters. Although Llama-3’s performance is dwarfed by GPT-4, which has over 20 times the parameters, it marked a significant step in the trend towards smaller, cost-effective models. Following this, Microsoft released the 3.8 billion parameter model Phi-3 Mini, claiming performance comparable to GPT-3.5 and capable of running smoothly on Apple’s A16 chip.

As price becomes a major consideration for downstream customers, pricier large models have become less attractive, prompting market players to adjust their offerings accordingly.

However, despite rock-bottom prices, major companies can still turn a profit. Notably, they tend to position large models as the “storefront,” with the real intention being to sell complementary cloud services. For example, High-Flyer, with its own computing cluster, can generate revenue of up to USD 35.4 per hour per server when the utilization rate of its computing service is at its peak, achieving a gross profit margin of over 70%.

However, for smaller companies and startups, the price war presents a less optimistic outlook. For one, Lee Kai Fu, CEO of 01.AI, said on May 21 that the company will not partake in the ongoing price war, retaining its current API pricing of RMB 20 per million tokens for its latest Yi-Large model.

KrASIA Connection features translated and adapted content that was originally published by 36Kr. This article was written by Zhou Xinyu for 36Kr.