OpenAI IMO Gold Team Reveals: AI Refuses to Answer Question Six

Select Language:

Recently, OpenAI’s team shed light on how their groundbreaking model, which recently secured a gold medal at the International Mathematical Olympiad (IMO), was developed by an incredibly small core group—just three key members. In a media interview, the team revealed that behind this achievement are Alexander Wei, the project lead, research engineer Sheryl Hsu, and senior research scientist Noam Brown. Interestingly, Hsu only joined the team in March of this year.

The project was initiated with a rapid, intense effort, taking only about two to three months of focused work, yet it produced results that surpassed many expectations. Achieving an IMO gold medal with an AI model represents a significant milestone, signaling advancements not only in the model’s mathematical capabilities but also in its general problem-solving techniques—particularly in tackling complex challenges that are traditionally difficult to verify.

Key insights from the team included discussions about the project’s origins, team size, and the unique approaches taken. They explained that the desire to develop an AI capable of IMO-level reasoning had long been a goal within the AI community and OpenAI itself, with conversations dating back to 2021. Although core algorithms and foundational ideas had been in development for roughly six months beforehand, the critical, focused effort for this breakthrough only began within a few months of the competition.

The team comprises only Alex, Cheryl, and Noam, with Alex Wei leading main technical development. Despite initial skepticism about the feasibility of their approach, Wei’s demonstrated significant progress—particularly in reasoning about tasks that are hard for humans to verify—gradually gained confidence from the team and company leadership.

The AI’s proof-generation style is notably unconventional. The team openly admits that the model’s mathematical proofs are “atrocious” or “creative,” full of logical quirks that make them hard for humans to interpret. Rather than optimizing for readability, OpenAI published these raw, AI-generated proofs directly on GitHub, promoting transparency and allowing anyone to review how the model approaches complex reasoning.

A point of discussion centered on the model’s performance on the most challenging IMO questions—particularly Question 6, often regarded as the toughest. The AI chose not to attempt an answer here, which the team sees as a positive indicator. It demonstrates the model’s awareness of its limitations—unlike older AI systems that would confidently fabricate answers even when unable to solve the problem. Wei explained that such higher-level, abstract problems—think of advanced combinatorics—pose significant challenges to AI because they require leaps of insight or “belief jumps,” areas where current AI still struggles.

When asked about the broader horizon of AI achieving solutions to Millennium Prize Problems—some of the toughest mathematical puzzles—the team responded cautiously. Wei highlighted that, although there has been measurable progress in solving simpler problems within seconds or a couple of hours, tackling problems that consume mathematicians a lifetime—like the Millennium Problems—remains far out of reach. He emphasized a mix of excitement about current advances and humility about future challenges, recognizing the enormous gap between solving hour-long problems and the depth of complexity in these century-old riddles.

One technical challenge discussed was the difficulty in evaluating models that take days or even months to “think” through problems. Noam Brown pointed out that this creates a bottleneck in research, as observing the model’s reasoning process would take proportionally long, severely slowing progress. While current efforts manage a few hours of “thinking,” extending this to days or weeks will necessitate new solutions.

The project also involves multi-agent systems, a concept Brown explained as part of scaling up computational power through parallel processing involving multiple AI agents. Although specific technical details remain confidential, he confirmed that such methods are crucial for broadening the model’s reasoning capabilities and handling complex, long-duration tasks. The team prioritizes generality—developing techniques broadly applicable rather than task-specific solutions, contrasting with earlier specialized systems like their poker AI or the Cicero project for the game Diplomacy. They aim to build versatile, scalable tools that can be integrated into future systems such as ChatGPT.

Another discussion point touched on the potential use of formal proof tools like Lean. While recognizing Lean’s value for mathematicians, the team indicated that their primary focus remains on natural language reasoning since many real-world problems are better addressed through more flexible, less formal methods. Brown emphasized that specialized AI tools and general AI are not mutually exclusive; he sees value in combining their strengths to achieve greater overall power.

The infrastructure underpinning this work utilizes OpenAI’s existing, versatile systems—no customized hardware or software for IMO was necessary. The team envisions that options developed here for long-duration reasoning and handling unverifiable tasks could be applied across other fields, enhancing models like ChatGPT.

Looking forward, a key challenge identified was enabling models to generate truly innovative questions—such as creating entirely new IMO-level problems. This skill, considered the next frontier after solving existing problems, requires models not only to analyze but to conceive and formulate novel challenges.

On the question of whether models might eventually solve the Millennium Prizes, Wei responded that such breakthroughs remain “very distant,” citing the enormous difference in complexity. While models have advanced enough to solve certain elementary problems quickly, the profound depths and hours of human thought embedded in these grand challenges are still beyond reach.

Finally, the team shared an inspiring anecdote: a Stanford professor who has been testing OpenAI’s models with progressively harder math problems. Although the latest IMO-level model cannot yet solve these difficult questions, it has shown an important step forward—acknowledging its inability to solve certain problems, a sign of growing self-awareness and reasoning maturity in AI systems. The team hopes to eventually release these tools for wider use among mathematicians and researchers eager to test their limits.

As AI continues to evolve, the progress toward solving humanity’s most enduring scientific and mathematical mysteries remains cautious yet optimistic. The efforts of this small but dedicated team mark a significant chapter in that journey, illustrating both the incredible potential and the ongoing challenges ahead.