是的,我的意思是,我甚至看不出他们是如何赚钱的,感觉他们设定了一个流行的基准,现在变成了付费赢得,我看不出他们还有其他理由能够获得如此高的收入,但我根本不知道客户在为此支付什么。 这开始是为了测试开放模型的氛围,但我们最后一次尝试上去的请求被忽视并延迟了几个月,而与此同时,Meta正在测试数百个模型,专门优化以最大化评估,之后我们就停止了提交。我很久以前就放弃了相信lmarena是一个有用的指标,并且私下听到来自堪萨斯的大佬们说他们讨厌这个东西,这让他们的模型质量下降以击败它。所以,我不知道,这就是全部。
Aakash Gupta
Aakash Gupta2026年1月7日
My read on LMArena is different than most. The headline here is $30M ARR in 4 months. But I'm more interested in the business model underneath. LMArena built something that feels impossible. A crowdsourced evaluation platform that became the single biggest marketing lever in AI, then figured out how to charge the labs using it. Let me break down the math. They went from $600M to $1.7B in 7 months. That's 183% valuation growth. At $30M ARR, they're trading at 57x revenue. But the run rate grew from $0 to $30M in 4 months. That's $7.5M per month of NEW revenue in a category that didn't exist 18 months ago. The real story is the flywheel they built. 35M users show up to play a game. Two anonymous AI responses, pick your favorite. Those users generate 60M conversations per month. That data becomes the most trusted benchmark in the industry. OpenAI, Google, xAI all need their models on that leaderboard. So they PAY to get evaluated. It's genius because the customers are also the product being tested. The harder question is whether this holds. Cohere, AI2, Stanford, and Waterloo dropped a 68-page paper in April accusing LMArena of letting Meta test 27 model variants before Llama 4 while hiding the worst scores. The "Leaderboard Illusion" paper basically said the playing field was rigged toward big labs. LMArena called it inaccurate. But the Llama 4 situation was messy. Meta tuned a model specifically for Arena performance, topped the leaderbaord, then released a different model to the public that performed worse. Here's where it gets interseting. Goodhart's Law says when a measure becomes a target, it ceases to be a good measure. LMArena is now SO important that labs optimize specifically for it. Longer responses win. Bullet points win. Confidence wins even when wrong. The platform acknowledged this. They added "style control" scoring to penalize markdown slop. Claude moved up. GPT-4o-mini moved down. But the core tension remains. LMArena earns $30M+ per year from the same labs it judges. OpenAI, Google, xAI are customers. The referee is getting paid by the players. They say the public leaderboard is "a charity" and you can't pay for placement. I believe them. But the incentive structure is... complicated. The valuation says the market thinks they can thread the needle between commercial success and perceived neutrality. Peter Deng joining the board is interesting. Former VP of Consumer Product at OpenAI. Now GP at Felicis leading this round. He knows exactly how valuable Arena placement is for model marketing. Ion Stoica as cofounder is the credibility anchor. Berkeley professor, created Spark and Ray, runs the Sky Computing Lab. This isn't a random startup. It's infrastructure built by researchers who understand distributed systems. $250M raised in 7 months. Team of 40+. 5M monthly users across 150 countries. Evaluation just became a billion-dollar category.
来自大实验室,而不是大堪萨斯,哈哈,我觉得有人需要在更多的标记上训练这些自动纠正……
36