Did xAI Fabricate Grok 3’s Benchmark Results?

Date:

Debates about AI benchmarks and the manner of reporting these by AI research facilities have become a public discussion. Recently, an employee at OpenAI has alleged that Elon Musk’s AI company, xAI, presented misleading benchmark results for its new AI model, Grok 3. In response, xAI co-founder Igor Babushkin asserted that the company acted correctly.

The reality of the situation is nuanced. A post on xAI’s blog featured a graph showcasing Grok 3’s performance on AIME 2025, a tough collection of math problems from a recent invitational mathematics exam. While some experts have raised concerns about AIME’s validity as an AI benchmark, AIME 2025 and its previous versions are commonly applied to evaluate a model’s mathematical capabilities.

The graph from xAI illustrated that two versions of Grok 3, namely Grok 3 Reasoning Beta and Grok 3 mini Reasoning, surpassed OpenAI’s top-performing model, o3-mini-high, on AIME 2025. However, OpenAI employees pointed out that xAI’s graph did not account for o3-mini-high’s AIME 2025 score at “cons@64.”

The term “cons@64” refers to “consensus@64,” where a model is given 64 attempts to solve each problem in a benchmark, with the most frequently generated answers considered final. This approach tends to enhance benchmark scores, and its exclusion from graphs could misleadingly suggest that one model outperforms another when it actually does not.

The initial scores for Grok 3 Reasoning Beta and Grok 3 mini Reasoning on AIME 2025, marked as “@1,” were lower than o3-mini-high’s score. Furthermore, Grok 3 Reasoning Beta slightly lagged behind OpenAI’s o1 model, set to “medium” computing. Despite this, xAI is promoting Grok 3 as the “world’s smartest AI.”

Babushkin also contended on social media that OpenAI has previously provided similarly misleading benchmark charts, although these compared its own models. A neutral analyst created a more “accurate” graph that displayed the performance of nearly all models at cons@64:

An AI researcher pointed out that the crucial metric of computational (and monetary) cost for achieving the best scores remains undisclosed. This illustrates how little information most AI benchmarks provide regarding the limitations and strengths of these models.

Source link

DMN8 Partners
DMN8 Partnershttps://salvonow.com/
DMN8 Partners utilizes a strategy of Cross Channel marketing including local search engine optimization, PPC, messaging and hyper-targeted audiences allow our clients to experience results and ROI that fuel growth and expansion in their operations. There are a lot of digital marketing options across the country but partnering with an agency that understands multiple touches on multiple platforms allows your company’s message to be seen at the perfect time, on the perfect platform, by your perfect prospect. DMN8 Partners has had years of experience growing businesses. Start growing your business today and begin DOMINATE-ing your market.

More like this
Related

Chlorinated Water May Increase Our Risk of Certain Cancers

A new study conducted by scientists at the Karolinska...

Vatican Reports Pope Francis Had a Peaceful Night in Hospital

Pope Francis experienced a calm night during his stay...

Gloomy Germans Vote in Election with Predicted Conservative Victory

Germans headed to the polls in an election overshadowed...

Trump to Appoint Patel as Acting ATF Head: Reports

President Trump is anticipated to appoint Kash Patel as...