Trending topics
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.
BullshitBench v2 is out! It is one of the few benchmarks where models are generally not getting better (except Claude) and where reasoning isn't helping.
What's new: 100 new questions, by domain (coding (40 Q's), medical (15), legal (15), finance (15), physics(15)), 70+ model variants tested. BullshitBench is already at 380 starts on GitHub - all questions, scripts, responses and judgements are there so check it out.
TL;DR:
- Results replicated - @AnthropicAI latest models are scoring exceptionally well
- @Alibaba_Qwen is another very strong performer
- OpenAI and Google models are not doing well and are not improving
- Domains do not show much difference - rates of BS detection are about the same across all domains
- Reasoning, if anything, has negative effect
- Newer models don't do that much better than older ones (except Anthropic)
Links:
- Data explorer:
- GitHub:
Highly recommend the data explorer where you can study the data and the questions & sample answers.
Top
Ranking
Favorites
