ChatGPT shouldn’t have political bias in any direction.
People use ChatGPT as a tool to learn and explore ideas. That only works if they trust ChatGPT to be objective. We outline our commitment to keeping ChatGPT objective by default, with the user in control, in our Model Spec principle Seeking the Truth Together(opens in a new window).
Building on our July update, this post shares our latest progress towards this goal. Here we cover:
- Our operational definition of political bias
- Our approach to measurement
- Results and next steps
This post is the culmination of a months-long effort to translate principles into a measurable signal and develop an automated evaluation setup to continually track and improve objectivity over time.
Overview and summary of findings
We created a political bias evaluation that mirrors real-world usage and stress-tests our models’ ability to remain objective. Our evaluation is composed of approximately 500 prompts spanning 100 topics and varying political slants. It measures five nuanced axes of bias, enabling us to decompose what bias looks like and pursue targeted behavioral fixes to answer three key questions: Does bias exist? Under what conditions does bias emerge? When bias emerges, what shape does it take?
Based on this evaluation, we find that our models stay near-objective on neutral or slightly slanted prompts, and exhibit moderate bias in response to challenging, emotionally charged prompts. When bias does present, it most often involves the model expressing personal opinions, providing asymmetric coverage or escalating the user with charged language. GPT‑5 instant and GPT‑5 thinking show improved bias levels and greater robustness to charged prompts, reducing bias by 30% compared to our prior models.
To understand real-world prevalence, we separately applied our evaluation method to a sample of real production traffic. This analysis estimates that less than 0.01% of all ChatGPT responses show any signs of political bias.
Based on these results, we are continuing work to further improve our models’ objectivity, particularly for emotionally charged prompts that are more likely to elicit bias.
Landscape and evaluation scope
Political and ideological bias in language models remains an open research problem. Existing benchmarks, such as the Political Compass(opens in a new window) test, often rely on multiple-choice questions. Such evaluations cover only a narrow slice of everyday use and overlook how bias can emerge in realistic AI interactions. We set out to build an evaluation that reflects real-world usage—nuanced, open-ended scenarios—in order to test and train our models in the way people actually apply them, where bias can surface in both obvious and subtle ways.
Our evaluation focuses on ChatGPT’s text-based responses, which represent the majority of everyday usage and best reveal how the model communicates and reasons. We leave behavior tied to web search out of scope for this evaluation, as it involves separate systems for retrieval and source selection.
Measuring political bias in realistic ChatGPT conversations
To operationalize a definition of political bias, we developed an evaluation framework that measures how bias appears in realistic AI usage. The framework combines a representative set of user prompts with measurable axes of bias derived from observed model behavior.
Bias can vary across languages and cultures; we began with a detailed evaluation of U.S. English interactions before testing generalization elsewhere. Early results indicate that the primary axes of bias are consistent across regions, suggesting our evaluation framework generalizes globally.
The first step in our process was constructing a dataset of prompts. Users engage ChatGPT across a wide spectrum of political, policy, and cultural topics, ranging from concrete factual questions to open-ended value discussions. To reflect this diversity, the dataset includes both explicit policy queries and everyday social or cultural questions that may appear apolitical but can elicit subtle bias in framing or emphasis.
To test our models’ robustness, we combined examples of what most users might ask ChatGPT with a distinct subset of challenging prompts targeting politically sensitive or emotionally charged contexts. The latter are more adversarial, challenging test cases designed to stress-test our models: by incorporating polarized language and provocative framing, we can assess how they perform when objectivity is most difficult to maintain.
Our dataset consists of approximately 500 questions that cover 100 topics, each with 5 corresponding questions written from different political perspectives. Topics were derived from major U.S. party platforms (e.g., energy independence, immigration) and culturally salient issues (e.g., gender roles, parenting). For each question, we then created reference responses that aim to illustrate the objectivity standards defined in our Model Spec—which we used to guide the development of our evaluation rubric.
We next analyzed model responses across the dataset to identify consistent patterns associated with bias. Through this we derived five measurable axes that capture distinct ways bias can manifest in model outputs and that can be evaluated reliably across prompts.
- User invalidation: Language that implicitly dismisses or delegitimizes the user’s viewpoint in political or ideological terms, beyond factual disagreement (e.g., placing user phrasing in scare quotes).
- User escalation: Language that mirrors and amplifies the political stance expressed in the prompt, reinforcing its framing rather than maintaining objectivity.
- Personal political expression: The model presenting political opinions as its own rather than contextualizing them as external viewpoints.
- Asymmetric coverage: Responses that selectively emphasize one perspective or omit others in domains where multiple, legitimate viewpoints exist and the user has not requested a single-sided explanation.
- Political refusals: Instances where the model declines to engage with a politically oriented query without a valid justification under the Model Spec.
These axes mirror what we see in humans. Human bias isn’t only “what one believes”; it’s also how one communicates through what is emphasized, excluded, or implied. The same is true for models: bias may appear as one-sided framing, selective evidence, personal subjective opinions, or style that amplifies a slant, even when individual facts are correct.
Finally we incorporated these five axes into a set of detailed evaluation instructions for an LLM grader, a model tasked with assessing the outputs of another model (in this case, GPT‑5 thinking) against our bias criteria. This required an iterative design process, refining definitions and annotation guidelines to ensure the grader reliably captured each bias dimension. Reference responses were used throughout our iteration process to validate the accuracy of the LLM grader scores.
Ultimately, we hope this will provide an interpretable and automatic evaluation we can apply to any model. The goal is to enable not only detection of bias, but also fine-grained analysis of what types of bias occur and in which contexts, to help track and understand progress over time.
The following are examples of biased and reference responses, and their corresponding grader scores. Illustrative examples of bias are highlighted.
Results and insights
- Does bias exist?
- Under what conditions does bias emerge?
- When bias emerges, what shape does it take?
Notes on interpretation: our evaluation scores political bias in a given model response on a 0 to 1 scale. The scores reflect a strict rubric; lower is better, and perfect objectivity is not observed even for our reference responses.
Measuring aggregate performance on our evaluation, we found that bias appears infrequently and at low severity. The latest GPT‑5 models are most aligned with objectivity targets, reducing bias scores by ~30% compared to prior models. Worst-case scores for older models are 0.138 for o3 and 0.107 for GPT‑4o. Notably, under our strict evaluation rubric even reference responses do not score zero.
Applying the same evaluation method to a representative sample of production traffic rather than our evaluation prompt set, we estimate that <0.01% of all model responses exhibit signs of political bias. This low rate reflects both the rarity of politically slanted queries and the model’s overall robustness to bias.
We evaluate the conditions where bias emerges by comparing results on neutral, slightly liberal/conservative, and emotionally charged liberal/conservative prompts. Model objectivity should be invariant to prompt slant—the model may mirror the user’s tone, but its reasoning, coverage, and factual grounding must remain neutral.
We find that in neutral or slightly slanted scenarios our models exhibit strong objectivity and little to no bias with behavior closely matching our principles. These scenarios reflect what we observe of typical ChatGPT usage. Under challenging, emotionally charged prompts, moderate bias emerges. There is some asymmetry to this effect: strongly charged liberal prompts exert the largest pull on objectivity across model families, more so than charged conservative prompts.
Similar to the aggregate performance metrics, the results show that GPT‑5 models exhibit lower bias than previous models (GPT‑4o and o3): GPT‑5 instant and thinking are not only less biased on average, but also more resilient under pressure from more challenging, charged prompts.
We evaluate the shape of bias by measuring scores separately for each axis. We found that models struggle with certain axes and excel in others, with stable patterns across model families.
When bias presents, it is most often in one of three forms: (1) personal-opinion—the model frames political views as its own rather than attributing to sources; (2) asymmetric coverage—responses emphasize one side where multiple perspectives are warranted; and (3) emotional escalation—language that amplifies the user’s slant. Political refusals and invalidating the user are rare, with scores on these axes aligning more closely with our intended behavior.
Similar to the previous results, we find that GPT‑5 instant and thinking outperform GPT‑4o and o3 across all measured axes.
What’s next
While GPT‑5 improves bias performance over prior models, challenging prompts expose opportunities for closer alignment to our Model Spec(opens in a new window). We are investing in improvements over the coming months and look forward to sharing results.
By discussing our definitions and evaluation methods, we aim to clarify our approach, help others build their own evaluations, and hold ourselves accountable to our principles. This work acts on our operating principle commitments to Technical Leadership and Cooperative Orientation; we hope it supports industry efforts to advance AI objectivity through shared definitions and empirical evaluation.