Google rolled out Gemini 3.1 Pro yesterday, touting a 77.1% score on novel logic puzzles that models can't just memorize—more than double 3 Pro's result—and record marks for expert-level scientific ...
OpenAI wants to retire the leading AI coding benchmark—and the reasons reveal a deeper problem with how the whole industry measures itself.
A groundbreaking computational physics framework has demonstrated that the three-dimensional fabric of the universe can be generated from scratch using a simple algorithm with exactly zero free ...
OpenAI and Paradigm have released EVMbench—a framework for evaluating AI agents' ability to find vulnerabilities in Ethereum smart contracts.