Reasoning is at the core of scientific work. Beyond recalling facts, scientists generate hypotheses, test and refine them, and synthesize ideas across fields. As our models become more capable, the central question is how they can reason deeply to contribute to scientific research.
Over the last year, our models have reached major milestones, including achieving gold-medal performance at the International Math Olympiad and the International Olympiad in Informatics. In parallel, we’re starting to see our most capable models, such as GPT‑5, meaningfully accelerate real scientific workflows. Researchers are using these systems for tasks such as literature search across disciplines and languages and working through complex mathematical proofs. In many cases, the model shortens work that might have taken days or weeks to hours. This progress is documented in our paper Early science acceleration experiments with GPT‑5, released in November 2025, which presents early evidence that GPT‑5 can measurably accelerate scientific workflows.
Introducing FrontierScience
As accelerating scientific progress is one of the most promising opportunities for AI to benefit humanity, we’re improving our models on difficult math and science tasks and working on the tools that will help scientists get the most from them.
When GPQA(opens in a new window), a “Google-Proof” science benchmark of questions written by PhD experts, was released in November 2023, GPT‑4 scored 39%, below the expert baseline of 70%. Two years later, GPT‑5.2 scored 92%. As models’ reasoning and knowledge capabilities continue to scale, more difficult benchmarks will be important to measure and forecast models’ ability to accelerate scientific research. Prior scientific benchmarks largely focus on multiple-choice questions, are saturated, or are not centrally focused on science.
To bridge this gap, we’re introducing FrontierScience: a new benchmark built to measure expert-level scientific capabilities. FrontierScience is written and verified by experts across physics, chemistry, and biology, and consists of hundreds of questions designed to be difficult, original, and meaningful. FrontierScience includes two tracks of questions: Olympiad, which measures Olympiad-style scientific reasoning capabilities, and Research, which measures real-world scientific research abilities. Providing more insight into models’ scientific capabilities helps us track progress and advance AI-accelerated science.
In our initial evaluations, GPT‑5.2 is our top performing model on FrontierScience-Olympiad (scoring 77%) and Research (scoring 25%), ahead of other frontier models. We’ve seen substantial progress on solving expert-level questions while leaving headroom for more progress, especially on open-ended research-style tasks. For scientists, this suggests that current models can already support parts of research that involve structured reasoning, while highlighting that significant work remains to improve their ability to carry out open-ended thinking. These results align with how scientists are already using today’s models: to accelerate research workflows while relying on human judgment for problem framing and validation, and increasingly to explore ideas and connections that would otherwise take much longer to uncover—including, in some cases, contributing new insights that experts then evaluate and test.
In the end, the most important benchmark for the scientific capabilities of AI is the novel discoveries it helps generate; those are what ultimately matter to science and society. FrontierScience sits upstream of that. It gives us a north star for expert-level scientific reasoning, letting us test models on a standardized set of questions, see where they succeed or fail, and identify where we need to improve them. FrontierScience is narrow and has limitations in key respects (for example, focusing on constrained, expert-written problems) and does not capture everything scientists do in their everyday work. But the field needs more difficult, original, and meaningful science benchmarks, and FrontierScience provides a step forward in this direction.
What FrontierScience measures and how we built it
The full FrontierScience evaluation spans over 700 textual questions (with 160 in the gold set) covering subfields across physics, chemistry, and biology. The benchmark is composed of an Olympiad and a Research split. FrontierScience-Olympiad contains 100 questions designed by international olympiad medalists to assess scientific reasoning in a constrained, short answer format. The Olympiad set was designed to contain theoretical questions at least as difficult as problems at international olympiad competitions. FrontierScience-Research consists of 60 original research subtasks designed by PhD scientists (doctoral candidates, professors, or postdoctoral researchers) that are graded using a 10-point rubric. The Research set was created to contain self-contained, multi-step subtasks at the level of difficulty that a PhD scientist might encounter during their research.
The Olympiad questions were created in collaboration with 42 former international medalists or national team coaches in the relevant domains, totalling 109 olympiad medals. The research questions were created in collaboration with 45 qualified scientists and domain experts. All scientists were either doctoral candidates, post-doctoral researchers, or professors. Their areas of expertise spanned an array of specialized and important scientific disciplines, from quantum electrodynamics to synthetic organic chemistry to evolutionary biology.
The task creation process for both sets included some selection against OpenAI internal models (e.g., discarding tasks that models successfully got right, so we expect the evaluation to be somewhat biased against these models relative to others). We open-source the Olympiad gold set of 100 questions and Research gold set of 60 questions, holding out the other questions to track contamination.
Tasks go through four stages: Creation, Review, Resolution, Revision. Independent experts review each other tasks to verify it aligns with the criteria.
How we grade model performance
The Olympiad set is gradable with a short answer: either with a number, expression, or fuzzy string match, which helps with verifying correctness. However, this verification often trades off with the expressivity and open-endedness of the problem. For the Research set, we introduce a rubric-based architecture for grading more open-ended tasks. Each question includes a scoring rubric with multiple independent and objectively assessable items, totaling 10 points. The grading rubric assesses not only the accuracy of the final answer, but also the correctness of intermediate reasoning steps, allowing for nuanced model performance and failure analysis. A solution is considered “correct” if it’s awarded at least 7/10 rubric points.
Responses are evaluated by a model-based grader (GPT‑5) against either the short answer or the rubric criteria. While we’d ideally use an expert human to grade each response, this approach is not scalable, so we designed the rubric to be checkable using a model grader. We developed a verification pipeline to help ensure rubrics and questions were well-calibrated to difficulty and correctness.
Model performance
We evaluated several frontier models: GPT‑5.2, Claude Opus 4.5, and Gemini 3 Pro, GPT‑4o, OpenAI o4-mini, and OpenAI o3 on FrontierScience-Olympiad and FrontierScience-Research. All reasoning models were evaluated at “high” reasoning effort with the exception of GPT‑5.2 at “xhigh”. In our initial evaluations, GPT‑5.2 is our top performing model on FrontierScience-Olympiad (scoring 77%) and Research (scoring 25%), ahead of other frontier models. Gemini 3 Pro is comparable to GPT‑5.2 on the Olympiad set (scoring 76%).
We’ve seen substantial progress on solving expert-level questions, especially on open-ended research-style tasks. There is still more room to grow: from analyzing the transcripts for failures, frontier models sometimes made reasoning, logic, and calculation errors, didn’t understand niche scientific concepts, and made factual inaccuracies.
Limitations and what’s next
While FrontierScience represents a step forward in difficulty of scientific benchmarks, there are still many limitations. FrontierScience is composed of questions with a constrained problem statement, which focuses on evaluating the final answer (Olympiad) or evaluating the reasoning to complete a research task (Research). In addition, using rubrics with multiple components on longer tasks is less objective than checking the final answer.
FrontierScience offers a higher resolution snapshot of models’ reasoning on difficult, expert-written questions, but not a full picture of how science gets done in practice. In particular, it does not assess a significant part of scientific research: how models generate genuinely novel hypotheses, or interact with multiple modalities, including video data and real experimental systems in the physical world.
Looking ahead, we expect progress in scientific reasoning to come from both better general-purpose reasoning systems and focused effort on improvising scientific capabilities. FrontierScience is one tool among many, and as models improve, we plan to iterate on this benchmark, expand it to new domains, and pair it with more real-world evaluations that look at what these systems actually enable scientists to do. Benchmarks like FrontierScience help us understand the weaknesses of today’s AI systems to focus our work on making models be reliable partners in scientific discovery.