Debates about AI benchmarks and the manner of reporting these by AI research facilities have become a public discussion. Recently, an employee at OpenAI has alleged that Elon Musk’s AI company, xAI, presented misleading benchmark results for its new AI model, Grok 3. In response, xAI co-founder Igor Babushkin asserted that the company acted correctly.
The reality of the situation is nuanced. A post on xAI’s blog featured a graph showcasing Grok 3’s performance on AIME 2025, a tough collection of math problems from a recent invitational mathematics exam. While some experts have raised concerns about AIME’s validity as an AI benchmark, AIME 2025 and its previous versions are commonly applied to evaluate a model’s mathematical capabilities.
The graph from xAI illustrated that two versions of Grok 3, namely Grok 3 Reasoning Beta and Grok 3 mini Reasoning, surpassed OpenAI’s top-performing model, o3-mini-high, on AIME 2025. However, OpenAI employees pointed out that xAI’s graph did not account for o3-mini-high’s AIME 2025 score at “cons@64.”
The term “cons@64” refers to “consensus@64,” where a model is given 64 attempts to solve each problem in a benchmark, with the most frequently generated answers considered final. This approach tends to enhance benchmark scores, and its exclusion from graphs could misleadingly suggest that one model outperforms another when it actually does not.
The initial scores for Grok 3 Reasoning Beta and Grok 3 mini Reasoning on AIME 2025, marked as “@1,” were lower than o3-mini-high’s score. Furthermore, Grok 3 Reasoning Beta slightly lagged behind OpenAI’s o1 model, set to “medium” computing. Despite this, xAI is promoting Grok 3 as the “world’s smartest AI.”
Babushkin also contended on social media that OpenAI has previously provided similarly misleading benchmark charts, although these compared its own models. A neutral analyst created a more “accurate” graph that displayed the performance of nearly all models at cons@64:
An AI researcher pointed out that the crucial metric of computational (and monetary) cost for achieving the best scores remains undisclosed. This illustrates how little information most AI benchmarks provide regarding the limitations and strengths of these models.