OpenAI’s latest language model, o3, is under scrutiny after independent evaluations revealed it performed below the company’s original benchmark claims.
The model was said to solve over 25 percent of the problems in the highly challenging FrontierMath test set, but new testing shows a far lower success rate for the publicly released version.
Independent evaluation challenges openAI’s results
When OpenAI introduced o3 in December, executives claimed it far outperformed other models on FrontierMath, a benchmark of graduate-level mathematical problems. OpenAI’s Chief Research Officer Mark Chen stated that competing models scored under 2 percent, while o3 achieved over 25 percent using aggressive compute settings. However, the research institute Epoch AI, which created the FrontierMath benchmark, released its results showing the public o3 version solved only 10 percent of the updated 290-question test.
The results presented by Epoch fall within the lower portion of OpenAI’s original published performance values. Epoch specified that a few potential explanations existed for the difference, including varying test-time computational specifications and evaluation segment definitions. FrontierMath’s version in December provided 180 test problems in the benchmark, yet the February upgrade increased the problems to 290 tests, possibly affecting the evaluation.
Public version uses less compute than internal demo
The publicly accessible o3 version of OpenAI operates with fewer computational requirements than the company’s earlier internal model. The ARC Prize Foundation originally reviewed a version of o3 with greater operational power but confirmed that the released model functions as an adjusted, smaller design for chat platforms with expanded user interfaces.
OpenAI employee Wenda Zhou answered questions about system performance during a live broadcast. According to him the production model achieves peak performance for real-world applications at the cost of benchmark test scores. The performance modifications Zhou mentioned could cause inconsistencies between how the model performs internally versus publicly.
Newer models already outperform o3 on frontierMath
Despite the controversy, OpenAI has introduced other models that exceed o3’s performance on the FrontierMath test. The o3-mini-high and the recently launched o4-mini have both posted stronger results. A more capable o3-pro model will be released in the coming weeks.
The competitive nature of AI markets raises concerns about suitable benchmark information presentation techniques. Businesses’ performance claims through marketing techniques can create confusion because they might provide users and analysts with incorrect information about AI technological capabilities.