OpenAI has recently introduced its advanced AI models, o3 and o4-mini, which are considered state-of-the-art in several aspects. Nonetheless, these new models exhibit a higher tendency to hallucinate, or generate fabricated information, compared to some of OpenAI’s older models.
The problem of hallucinations remains one of the most significant challenges in AI, affecting even the top-performing systems today. Typically, each successive model has shown slight improvements in reducing hallucinations, but this trend does not hold true for o3 and o4-mini.
Internal evaluations by OpenAI reveal that the o3 and o4-mini models, categorized as reasoning models, demonstrate a higher frequency of hallucinations than the company’s earlier reasoning models, such as o1, o1-mini, and o3-mini, as well as traditional models such as GPT-4o.
A concerning aspect of this development is OpenAI’s current lack of understanding of why these hallucinations are increasing. According to a technical report for o3 and o4-mini, more research is essential to comprehend the escalation of hallucinations as reasoning models scale up. Despite enhanced performance in areas like coding and mathematics, the models generate a greater number of claims overall, leading to an increase in both accurate and inaccurate/hallucinated assertions.
In evaluations using PersonQA, OpenAI’s proprietary benchmark for assessing a model’s knowledge accuracy regarding people, o3 was found to hallucinate in response to 33% of questions, which is about double the hallucination rate of prior reasoning models such as o1 and o3-mini, which recorded rates of 16% and 14.8% respectively. The o4-mini model performed even worse with a hallucination rate of 48%.
Third-party testing conducted by Transluce, an AI research lab, also confirmed that o3 frequently invents actions during the process of reaching conclusions. In one instance, o3 erroneously claimed to have executed code on a 2021 MacBook Pro outside of ChatGPT and then transferred the results back, an action it is not capable of performing.
Neil Chowdhury, a researcher at Transluce and former OpenAI employee, suggested that the type of reinforcement learning used for the o-series models might exacerbate issues typically mitigated by standard post-training procedures. Transluce co-founder Sarah Schwettmann noted that the high hallucination rate could decrease the usefulness of o3.
Kian Katanforoosh, an adjunct professor at Stanford and CEO of the upskilling startup Workera, reported that his team, while testing o3 within coding workflows, found it superior to competitors but observed its tendency to hallucinate broken website links.
While hallucinations may enhance creativity and foster interesting ideas, they pose a challenge for businesses that require high accuracy, such as law firms, where factual errors in legal documents would be unacceptable.
Enhancing models’ accuracy through web search capabilities is one potential solution. OpenAI’s GPT-4o with web search function achieves 90% accuracy on SimpleQA, an accuracy benchmark used by OpenAI. Web search could potentially reduce hallucination rates for reasoning models, provided users are willing to allow prompts to be accessed by third-party search providers.
Should the expansion of reasoning models continue to increase hallucinations, finding a resolution will become critical. OpenAI spokesperson Niko Felix emphasized that addressing hallucinations across all models remains an active area of research, with ongoing efforts to enhance accuracy and reliability.
Over the past year, the AI industry has increasingly focused on reasoning models as traditional AI model improvement techniques have shown diminishing returns. Reasoning improves model performance across various tasks without necessitating extensive computational resources and data for training. However, it appears that reasoning might also contribute to increased hallucinations, presenting a significant challenge for developers.