Google rolled out Gemini 3.1 Pro yesterday, touting a 77.1% score on novel logic puzzles that models can't just memorize—more than double 3 Pro's result—and record marks for expert-level scientific ...
OpenAI wants to retire the leading AI coding benchmark—and the reasons reveal a deeper problem with how the whole industry measures itself.
A groundbreaking computational physics framework has demonstrated that the three-dimensional fabric of the universe can be generated from scratch using a simple algorithm with exactly zero free ...
OpenAI and Paradigm have released EVMbench—a framework for evaluating AI agents' ability to find vulnerabilities in Ethereum smart contracts.
Some results have been hidden because they may be inaccessible to you
Show inaccessible results