GLM 5.2 Performance Benchmarks

(artificialanalysis.ai)

64 points | by theanonymousone 6 hours ago

8 comments

wongarsu 2 hours ago
It does really well on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, GPT 5.5 or Fable. I really like that benchmark because it's one of the few benchmarks that allows LLMs to elect not to answer if they are unsure and punishes them for trying to bullshit their way through the benchmark
[-]
- SilverServer 3 minutes ago
  It took me a while to figure out how to interpret the benchmark correctly, because on the overview page it says "AA-Omniscience Non-Hallucination Rate," but on the benchmark page https://artificialanalysis.ai/evaluations/omniscience#aa-omn...
  it said "the lower, the better." Eventually, I realized that the "non" reverses the scores. And indeed, the results are consistent.
- andai 28 minutes ago
  This implies that other benchmarks (for which every AI provider is optimizing?) are actively encouraging bullshitting?
  [-]
  - whimblepop 4 minutes ago
    Bullshitting is how LLMs work. It doesn't require active encouragement. All it takes is a machine without consciousness or physical access to the world and an actually-lived life. A training set that contains lots of confident answers and few to no refusals doesn't help either.
theturtletalks 1 hour ago
I want to trust their benchmarks but when they have Muse Spark over GPT-5.5, it gives me pause.
XCSme 1 hour ago
I also tested it[0]: quite similar to GLM 5, a few percent better, 30% faster and 50% more expensive.
[0]: https://aibenchy.com/?q=glm
[-]
- XCSme 1 hour ago
  PS: Just added a cool feature, so you can filter the leaderboard for multiple models at once, by using a comma, like: https://aibenchy.com/?q=glm,claude
- benxh 23 minutes ago
  benchmark where gemini flash is better than fable btw.
- lousken 1 hour ago
  still 1/4 of the price of anthropic and openai models though
lanycrost 2 hours ago
It's always nice to see how open source models growing, hope we will have good performance with lower tier hardware some day.
hemkeshr 40 minutes ago
Local models are already useful today. The next milestone is getting this level of performance onto truly affordable hardware.
sourcecodeplz 2 hours ago
still quite verbose at 140m output tokens, but this is on max thinking. high should do better.
ChrisArchitect 1 hour ago
Some more discussion: https://news.ycombinator.com/item?id=48567759
DeathArrow 2 hours ago
One or two more releases and they will reach Fable level.
[-]
- vitalyan123 1 hour ago
  by then there will be Fable 5.21, again 5% ahead of every other SotA while still only 500% the size.