It does really well on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, GPT 5.5 or Fable. I really like that benchmark because it's one of the few benchmarks that allows LLMs to elect not to answer if they are unsure and punishes them for trying to bullshit their way through the benchmark
Bullshitting is how LLMs work. It doesn't require active encouragement. All it takes is a machine without consciousness or physical access to the world and an actually-lived life. A training set that contains lots of confident answers and few to no refusals doesn't help either.
PS: Just added a cool feature, so you can filter the leaderboard for multiple models at once, by using a comma, like: https://aibenchy.com/?q=glm,claude
it said "the lower, the better." Eventually, I realized that the "non" reverses the scores. And indeed, the results are consistent.
[0]: https://aibenchy.com/?q=glm