Uncovering the Truth: Did xAI Mislead Us About Grok 3’s Performance Benchmarks? | TechCrunch

Admin

Uncovering the Truth: Did xAI Mislead Us About Grok 3’s Performance Benchmarks? | TechCrunch

Recently, there’s been a heated debate about AI benchmarks and how they’re shared by different labs. It all started when an OpenAI employee called out Elon Musk’s AI company, xAI, for allegedly sharing misleading results for their model, Grok 3. In response, xAI co-founder Igor Babushkin defended their claims.

The situation is quite complex. xAI published a graph that showed Grok 3 performing well on AIME 2025, a tough math test from a recent competition. While many experts have questioned whether AIME is a strong benchmark for AI, it is still used widely to assess math skills in AI models.

According to xAI’s graph, two versions of Grok 3 outperformed OpenAI’s model, o3-mini-high, on AIME 2025. However, OpenAI employees quickly pointed out an important detail missing from the graph: o3-mini-high’s score with a method known as “consensus@64” (or cons@64). This method allows the model to make 64 attempts to answer each question, counting the most frequent answers as final. This often boosts a model’s score significantly, so leaving it out of the comparison creates a misleading impression.

Without considering cons@64, Grok 3 Reasoning Beta and Grok 3 mini Reasoning scored lower than o3-mini-high on their first attempts. Even Grok 3 Reasoning Beta slightly lagged behind OpenAI’s other model, o1, when set to medium computing power. Nonetheless, xAI has been promoting Grok 3 as the “world’s smartest AI.”

Babushkin noted that OpenAI has also published benchmark charts that some people found misleading, but for their models instead. To further clarify, a more neutral party created a different graph showing how various models perform at cons@64.

Some reacted to this graph as an attack on OpenAI while others defended Grok, but the creator insisted it wasn’t about taking sides. However, they did express that Grok appeared impressive, while they believed OpenAI’s methods deserved more scrutiny.

Adding to the discussion, AI researcher Nathan Lambert raised a critical point: the actual computing and financial costs involved in yielding these benchmark scores for each model are often overlooked. This highlights how many AI benchmarks do not fully reveal a model’s strengths and weaknesses.

Source link

benchmarks,Grok,OpenAI,xAI