OpenAI’s ChatGPT Agent Outperforms the o3 Model Alone: What This Means for Developers

25 min read Original article ↗

A new ChatGPT agent was introduced in OpenAI’s July 17, 2025 announcement as a new agentic model in the same family as OpenAI o3, but with better benchmarks. Do the new features only affect ChatGPT product users, or do they also impact OpenAI platform customers and third-party developers?

! Do not confuse the Agent mode in ChatGPT introduced in July 2025 with the completely different agent-building platform (Agent Builder/AgentKit) launched in October 2025.

ChatGPT Agent Meaning

ChatGPT Agent operates on its own virtual computer accessible from within the ChatGPT.com interface and alternates between reasoning and direct action to complete complex, multi-step tasks. 

Unlike a standard chatbot, it can click and scroll through pages, fill out forms, download and manipulate files – all in response to natural language instructions. 

At the core of this capability is a model that combines the web-interaction skills of OpenAI’s earlier "Operator" prototype with the deep analytical and synthesis skills of the "Deep Research" mode. 

ChatGPT Agent News and Availability

Click the "+" icon to the left of the prompt field, open the Tools menu, and then choose "Agent mode" (it’s the last item in the drop-down list). If you don’t see "Agent mode" there, the feature hasn’t been rolled out to your account yet, and there’s currently no other way to enable it.

A new ChatGPT agent

Not everyone can try it yet. On launch day OpenAI began rolling the feature out to paying Pro, Plus, and Team subscribers. Enterprise and Education customers are queued for "the coming weeks", while free-tier users must wait. European Economic Area and Swiss residents are still in the waiting room while regulatory work proceeds, so if you don’t see "Agent mode" in your Tools menu, the feature simply hasn’t reached your region or plan.

OpenAI ChatGPT Agent EU availability

Users based in the EU, who have a Pro subscription, on the evening of 18 June still can’t see the new feature.

Pro users can try a stop-gap at https://operator.chatgpt.com, but this isn’t the announced ChatGPT Agent — just one component: a simple browser agent that loads pages and executes tasks. It’s quite buggy, for example, CAPTCHAs on websites can’t be solved within the agent’s interface, either automatically or manually. We’ll see how the full ChatGPT Agent handles jobs like these once it finally rolls out. Deep Research mode is available from the Tools menu, but it runs more slowly and cannot click through sites as mode "Operator" does.

OpenAI has not committed to a timeline. Tech press summaries simply repeat that official EU access is "coming" but "indefinite for now".

However, if you need unofficial access and has PRO account (not PLUS), just use an American VPN. Even if you have a European card, it still works. However, this method is unofficial and access could be blocked at any time.

OpenAI ChatGPT Agent availability UK

Many UK Pro users received access the day the feature launched. UK users get the same virtual computer with visual/text browsers, terminal, and connectors described in the launch post. OpenAI. If "Agent mode" still doesn’t appear, give it a little time - OpenAI is enabling accounts in batches.

Capabilites the new ChatGPT agent brings to web users of ChatGPT.com

Real-World Utility Examples

ChatGPT Agent’s unified skills significantly broaden its usefulness in both professional and personal contexts. 

  • Professional tasks: It can automate burdensome office chores, such as updating spreadsheets with new financial data while preserving all formulas and formatting. Instead of you manually doing these repetitive tasks, you can delegate them in natural language and the agent will handle the execution end-to-end.
  • Personal tasks: In everyday life, it can serve as a smart concierge. For example, it can find medical specialists and book appointments for you based on your schedule. It can orchestrate complex personal projects or errands that involve multiple steps and information sources, saving you a lot of time and effort.

User Control, Oversight & Privacy

As ChatGPT Agent works, it provides a live narration of its actions and explicitly asks your permission before taking any step with real-world consequences (such as making a purchase or sending an email). 

You can intervene at any time – pausing or stopping the agent, taking over the browser manually, or steering it in a different direction. 

Certain critical actions run in a supervised "Watch Mode" that requires you to approve each step. Likewise, the agent is trained to proactively refuse obviously high-risk requests – for example, it will decline to carry out a bank transfer or other sensitive financial operation on its own. If a task is long-running, you’ll get a mobile push notification when it finishes so you can review the results.

Strong privacy controls are also built in. With a single click, you can wipe all browsing data the agent has accumulated and log it out of any websites it accessed on your behalf. (Otherwise, site cookies persist according to each site’s policy, which helps speed up repeated visits by avoiding extra logins.) 

In secure browser takeover mode – when you temporarily take control to enter credentials or navigate – anything you type (like passwords) is kept private and not stored or sent to the model. This ensures the AI never actually sees your sensitive inputs, since it doesn’t need them, and thereby prevents those details from being retained in the AI’s context. 

You can also disable any connected apps or accounts (connectors) at any time to limit what data the agent can access. 

Finally, ChatGPT Agent supports scheduling recurring tasks: once the agent completes a task, you can set it to run automatically on a schedule (for example, "generate my weekly metrics report every Monday morning") to further automate your workflow.

Built-in Tools, Connectors & Environment

To accomplish its tasks, ChatGPT Agent has access to a suite of built-in tools and can choose the best tool for the job. 

These include a GUI visual web browser (to interact with websites like a human would), a text-based browser (for quick retrieval and reasoning over large text documents), a full terminal for code execution, and the ability to make direct API calls

The agent can also use ChatGPT connectorsintegrations to third-party services such as Gmail or GitHub – which allow it to securely fetch information from your email, calendars, or other apps and incorporate that data into its responses.

If needed, you can hand the agent an authenticated session on a website via a secure takeover mode (logging it into a private portal yourself) -  this lets it access gated or user-specific content, and you can resume manual control at any moment.

All these tools run within one coherent virtual computer environment that preserves state across tool switches. This means the agent can move from one modality to another without losing context. 

For example, it might retrieve information through an API (say, querying your calendar events), then use the text browser to reason over a lengthy document, and finally switch to the visual browser to interact with a site’s interface – all within one continuous session. 

The virtual machine allows it to download files from the web, manipulate them via terminal commands or code, then open the results back in the browser for you to view. 

This adaptive approach enables the agent to choose the most efficient path to complete tasks: it might process structured data via API, handle large text analysis in the text browser for speed, and use the visual browser only when necessary for human-oriented pages. By preserving memory and state, the agent can carry out multi-step workflows with speed and accuracy, without redundant back-and-forth or losing track of intermediate results. 

You can interact with it in a highly iterative and collaborative way – you can interrupt at any time to give new instructions or clarifications, and the agent will incorporate your feedback and continue the task without starting from scratch. Similarly, the agent may ask you for additional details if needed to ensure it’s on the right track.

Risk Mitigation & Safety Framework

Because this is the first time ChatGPT can directly take actions on the live web, OpenAI has implemented extensive safety measures. 

The controls from the Operator preview have been expanded to address the new risks of a general-purpose agent with web access. This includes safeguards around handling sensitive user data, limiting what it can do via the terminal (network access is restricted), and especially defending against prompt-injection attacks that could be hidden in webpages or metadata. 

Prompt injection is when malicious instructions are embedded in content the agent might read (for example, invisible text on a webpage) to trick the AI into doing something harmful or revealing private information. 

To counter this, ChatGPT Agent is trained to recognize and resist malicious or hidden prompts, and OpenAI runs a real-time monitor on all tool outputs to catch suspicious patterns. 

If the agent encounters anything that looks like a hidden instruction or an attempt to hijack its behavior, it will refuse or seek confirmation. 

Moreover, the system requires explicit user reconfirmation before any high-impact action (like those involving sensitive data or transactions), which adds a human check that further reduces the chance of a hidden prompt causing damage. Users are also advised to be mindful of what data they expose the agent to and to disable connectors when not needed, as additional precautions.

Given the agent’s expanded capabilities, OpenAI has classified ChatGPT Agent as having "High Biological and Chemical Capability" potential, so they have nonetheless activated the highest tier of safety measures for this category. That safety stack includes threat modeling for misuse scenarios, special training to refuse or flag any dual-use dangerous content, always-on classifier systems and reasoning monitors watching the agent’s behavior, and strict policy enforcement pipelines to intercept potentially harmful actions. 

OpenAI has also been coordinating with external experts in biosecurity, government and academia – even hosting a biodefense workshop with national labs and NGOs – to get outside input on shoring up these defenses. 

As part of the safety push, OpenAI has published a detailed system card describing the agent’s risk analysis and mitigations, and launched a public bug bounty program to encourage outside researchers to find any vulnerabilities or unintended behaviors so they can be fixed quickly.

Availability, Roll-out & Quotas

ChatGPT Agent became available starting July 18, 2025 for subscribers on the Pro, Plus, and Team plans. Pro users (who pay for the highest tier) got access immediately on launch day, while Plus and Team users were slated to get it over the following few days as the rollout progressed. 

Enterprise and Education tier customers are expected to gain access in the coming weeks after launch. 

Usage of the agent is metered: Pro subscribers receive a generous quota of 400 agent messages per month, whereas Plus and Team users get 40 agent messages per month included. If users need more, OpenAI allows purchasing additional agent message capacity via a flexible credit system. (These quotas are separate from regular ChatGPT usage and exist because agent tasks consume more resources.)

Currently, the feature is not enabled in certain regions – pending regulatory clearance or additional compliance measures. OpenAI has acknowledged this regional hold and will presumably turn it on in those locations once they address local data and privacy requirements.

The introduction of ChatGPT Agent also means some earlier beta features are being retired or folded in. The standalone Operator preview site (operator.chatgpt.com), which previously demonstrated the web-browsing agent capabilities, will remain live only for a few more weeks and then be shut down. Its functionality is effectively merged into the new agent mode. Additionally, the original "deep research" mode (which provided very lengthy, in-depth research answers with extensive citations) has been integrated as part of the agent’s capabilities. For users who prefer the old deep-research style – which can take longer but gives more exhaustive reports by default – that option is still available in the ChatGPT UI by selecting "Deep Research" from the mode dropdown in the message composer. 

Benchmark Performance Highlights

OpenAI tested ChatGPT Agent on a variety of challenging benchmarks and the results show substantial performance gains over previous models:

Humanity’s Last Exam (HLE)

On this extremely difficult expert-level QA test across many subjects, the agent achieved 41.6% accuracy (pass@1), which is a new state-of-the-art and a huge jump from prior models. This beats the previous best from the dedicated Deep Research mode (26.6%) on the same benchmark. In other words, ChatGPT Agent answers correctly on roughly 4 out of 10 graduate-level questions, compared to only about 1 out of 5 for its predecessor – a significant improvement on a test explicitly designed to be harder than what any AI had seen.

FrontierMath (Tier 1–3)

On the hardest open-ended math problems (novel, unpublished challenges that even human math experts struggle with), the agent scored 27.4% accuracy. This vastly outperforms earlier models – for context, the previous o3 model was around 10.3% on these problems. Having an agent that can solve over a quarter of these elite math problems is a major leap forward (prior systems were in the single-digits or low teens percentage range).

Complex Knowledge-Work Tasks

In an internal benchmark mimicking real-world knowledge work (conducting research and analysis tasks that might take a human hours), ChatGPT Agent’s outputs were rated comparable to or better than expert human results in roughly half the cases. It also significantly outperformed the older models (o3 a) on these tasks across all tested time budgets. In other words, given difficult professional assignments (like writing an analytical report or creating a detailed project plan), the agent’s work was on par with human experts about 50% of the time. This demonstrates that the agent isn’t just good at quizzes – it can handle practical, economically valuable tasks and often do as well as top human professionals in those domains.

DSBench (Data Science Benchmark)

This benchmark tests realistic data analysis and data modeling workflows. ChatGPT Agent not only surpassed other AI models here, it even exceeded human performance on some metrics. For the data analysis portion, the agent scored 89.9%, versus around 64% for human analysts performing the same tasks. On data modeling tasks, the agent reached 85.5%, well above the human baseline (around 65%). These are remarkable results – the AI is actually outperforming experienced humans in interpreting and modeling data in this test, highlighting how effective the agent can be when it can use tools (like Python) to assist its analysis.

SpreadsheetBench

This is a suite of 912 spreadsheet editing questions (requiring formula adjustments, data manipulations, etc). ChatGPT Agent achieved 35.27% accuracy overall on this benchmark – which may sound modest, but it’s significantly higher than other AI models: OpenAI’s o3 managed 23.3%, Microsoft’s Copilot in Excel about 20.0% on the same test. Moreover, when the agent was allowed to directly edit Excel files (rather than just instructing changes), its score jumped to 45.54%. While still below the 71.3% human expert accuracy on these tasks, the agent outperformed specialized tools like Copilot. It also maintained formulas and formatting, not just values.

Investment Banking Financial Modeling

On a set of first- to third-year investment banking analyst tasks ( building a three-statement financial model for a company, complete with correct formulas and citations), ChatGPT Agent averaged 71.3% accuracy. This is a strong result considering humans are far from perfect on these complex multi-step problems. It substantially beat the older "Deep Research" agent mode (which scored ~55.9%) as well as the base o3 model (~48.6%) on the same evaluations. This means the AI can perform many financial analysis tasks at a level that approaches (and in many cases exceeds) a well-trained junior analyst – a notable achievement in a field that requires both domain knowledge and attention to detail.

BrowseComp (Agentic Web Browsing)

This benchmark measures how well an AI agent can find hard-to-locate information on the web. ChatGPT Agent set a new record with 68.9% success on BrowseComp, which is about 17.4 percentage points higher than the previous best achieved by the Deep Research mode (51.5%). It also well exceeds the older o3-based agent’s performance (49.7%) in this category. The agent is much better at using a browser to answer obscure questions or gather information than past models – an indication of its improved search strategies and tool use on the open web.

WebArena (Real-World Web Tasks)

In the WebArena benchmark (which tests an agent’s ability to complete practical tasks through a web interface), ChatGPT Agent scored 65.4%. This is a few points higher than the previous agent built on the o3 model (62.9%), showing that it has surpassed prior agents in real web task performance. However, it’s still below the human performance level on WebArena (humans score about 78.2% on these tasks). So while the agent is closing the gap in web task proficiency, there remains room to improve before it can match an expert human operator in all cases.

API Equivalents

ChatGPT Agent can click around websites, scroll, fill forms, run code, and complete end-to-end tasks on its own.

The Responses API already provides a computer_use tool (since March 2025) that lets developers programmatically control a headless browser. In other words, developers have had this capability behind the scenes.

The flashy on-stage demo is built on tech that third-party devs already use. However, now that it’s packaged neatly in ChatGPT’s UI, users will expect the same seamless automation from all agents. Developers need to match that user experience in their products (or focus on specialized niches where they can do even more).

ChatGPT has one-click connectors to Gmail, GitHub, Google Drive, etc., allowing it to directly interact with those services.

There’s no official Connectors API for developers (yet). Developers can integrate services by writing their own OAuth flows or using unofficial means (the Deep Research agent or community plugins), but OpenAI hasn’t released a standard connector framework in the API.

This is a missing piece in the developer stack. Until OpenAI offers connectors to everyone, custom-agent developers can add value by building these integrations themselves. If you build a domain-specific agent (say, for finance or healthcare), you can create connectors to the services your users care about. You’ll be stepping in where the official platform currently doesn’t go, which can be a competitive advantage. (Keep in mind OpenAI has hinted connectors will come to the API once stabilized, so plan for that future.)

ChatGPT now supports scheduled runs (recurring Tasks) with notifications (like a weekly report every Monday).

There is no built-in scheduling or cron feature in the API. A developer using the API must implement their own scheduling system (using cron jobs or backend services) or wait for an official "Tasks SDK" that might be in the works.

If your agent’s value proposition involves doing things on a schedule (daily summaries, periodic reports, etc.), you currently have to handle scheduling logic yourself. This is extra engineering work, but it also means you can differentiate on reliability and customization of scheduled tasks. The fact that OpenAI showed scheduling in the product hints that a developer solution might come, but until then, consider it a space to innovate.

The ChatGPT Agent emphasizes user control, with live narration of its actions, a ‘watch mode’ for sensitive tasks (like sending emails) that requires user approval, and a privacy toggle to wipe data.

The Agents SDK and API provide hooks and events for similar guardrails. Developers get callbacks or flags when the agent is about to do something sensitive, and can build in confirmations. The Moderation API and safety guardrails are also available for content filtering. However, these are not automatically enforced in a custom UI — you as the developer must make use of them.

OpenAI has put a lot of work into a friendly interface that keeps users in charge (narrating steps, asking permission, etc.). If you’re building your own agent app, you should mirror these trust features. By using the guardrail hooks in the SDK, you can show users what the agent is doing, ask for confirmations on critical steps, and provide easy ways to pause/stop. This will not only match user expectations set by ChatGPT, but also protect you legally and ethically. Safety and transparency should be part of your UX just as they are in ChatGPT’s product.

OpenAI’s ChatGPT Agent launch includes quotas: Pro subscribers get 400 agent actions per month, Plus users get 40 per month (with more purchasable via credits).

The API’s usage is metered by tokens and processing time, not by a fixed number of actions. There’s no concept of a fixed monthly action count in the API - it’s pay-as-you-go. Developers pay per token and for tool usage seconds, with their own limits set by organization quotas or rate limits.

If you build an agent for end-users, you have flexibility in the business model. You could offer unlimited use (within fair limits) if the cost in tokens is manageable, or set your own pricing tiers. The ChatGPT quotas might make some users feel constrained, which could drive them toward specialized third-party solutions. Keep an eye on whether OpenAI aligns API pricing with the ChatGPT-style quotas in the future.

No new model family, but major performance jump

The big boost does not come from a brand-new model but from smarter coordination — specifically, how the agent chains together browsing, coding, and file tools during execution. 

These tools have been available in the API since March, but combining them in a continuous agent loop enables the system to think ahead and reuse context. 

This orchestration now leads expert benchmarks and, in many cases, even surpasses human performance on high-value tasks. The takeaway from July’s launch: With the right setup, you can get much more out of existing models.

User expectation has jumped

Before, an average user might have been delighted if a third-party GPT-based app could, say, give a good answer with references. Now, after seeing ChatGPT Agent, users will expect AI agents to take action and complete tasks autonomously. 

Agents are now expected to not just chat, but act. For developers, this means that simply providing an answer isn’t enough to impress — your agent may need to book the restaurant, format the spreadsheet, or orchestrate the workflow just like ChatGPT can. 

OpenAI Agents SDK 0.2.x Release

On July 17, 2025, OpenAI shipped version 0.2 of its Agents SDK. The release folds three production-grade capabilities into the library we already use to build AI assistants.

Guardrails now run alongside every model or tool step. They apply the same policy filters OpenAI uses in ChatGPT, so unsafe content or prompt injection attempts are blocked before they reach users.

A persistent session object is included. By default, it stores chat history in a local SQLite file, but the interface lets us plug in Postgres, Redis, or any other store that meets our data governance requirements. With sessions in place, the agent automatically “remembers” prior turns - developers no longer need to stitch conversation history into each prompt ourselves.

Tracing records every loop iteration — each model call, tool invocation, and agent handoff — and exposes the log through a simple UI and JSON feed. Developers can replay a conversation after the fact, pinpoint slow or expensive steps, and provide auditors with a full trail of agent decisions.

Together, these features replace much of the custom glue code developers maintain today for safety checks, state management, and debugging. 

Migrating an existing prototype agent is expected to take one to two developer-days: change the import paths, enable guardrails, choose a session backend, and turn on tracing.

Areas that still need solutions, which developers can address

Connectors not open to API

This gap means a developer can create a custom agent with connectors to niche services or internal systems today, while OpenAI’s own connectors are limited or beta. If you have industry-specific knowledge (say an agent for medical records or for enterprise CRM systems), you can integrate that now. By the time OpenAI releases an official connectors API, you could already have a reputation or customer base for that integration.

No built-in scheduling for API

If your use-case benefits from periodic or background tasks, you can build that infrastructure and offer something ChatGPT can’t (yet) do outside its interface. It’s more effort, but for certain enterprise clients or power users, that could be a deciding factor.

Session length and persistence

ChatGPT’s UI keeps the virtual computer session alive 30 minutes, whereas API sessions time out after 5 minutes idle. If long-lived sessions matter (monitoring something continuously or lengthy workflows), developers might soon get tools for it, but if not, you could engineer a workaround (perhaps by intelligently re-initializing context) and advertise that stability.

Geographic and compliance constraints

The ChatGPT Agent is geofenced out of the EU (EEA) at launch, likely due to regulatory concerns. This means EU-based enterprises or users can’t use ChatGPT Agent yet. If you are a developer or company operating in Europe, this is a moment to pitch custom solutions that run on the API (which can be configured for data residency and compliance). You can provide AI agent services without the geofence, addressing privacy or legal requirements in ways the main product might not initially.

Advice for developers

Since the core capabilities (browse the web, execute code, handle files, etc.) are becoming commodities available to anyone using OpenAI’s platform, the strategic advice is to differentiate in other ways.

Proprietary data or knowledge

Incorporate company-specific data or industry expertise that a general ChatGPT agent won’t have by default. For example, a financial agent that knows your accounting system or a medical agent that has access to up-to-date research in a specialized field.

User experience & integration

Design a user interface or workflow that fits a particular job better than ChatGPT’s general UI. Maybe a browser extension, or integration directly into a team's Slack/SharePoint, or a voice interface for use on the go. Provide a streamlined experience for specific tasks.

Trust and custom guardrails

As mentioned, use the guardrail primitives to build trustworthy agents. Enterprises might prefer an agent whose every action is logged, auditable, and with admin controls — something you can build with the API on top of OpenAI’s models.

Branding and relationships

Companies might prefer a branded solution (their own AI assistant) for psychological or business reasons. As a developer, you can offer white-label agents using OpenAI under the hood. You maintain the client relationship and can tailor the solution exactly to their needs, which the general ChatGPT cannot do for each client.

Beyond the Hype: The Hidden Costs of "Mostly Correct" AI Automation

Today’s AI "agents" promise to handle whole jobs for us, but leaders should view them less as veteran employees and more as unreliable interns: fast, tireless, and eloquent, but prone to costly blunders if left unsupervised. Even a two-percent error rate sounds tiny until you chain dozens of automated steps together. At that scale, the math virtually guarantees something will break. A mistyped year in a financial report, a stray decimal in pricing, or a logic bug in a data pipeline can ripple outward and cost millions—and chasing those hidden flaws often erases any time the AI saved.

Security risks compound the problem. Because an agent blindly interprets any text it is fed, a malicious instruction buried inside an email, PDF, or even hidden HTML can hijack the system—what researchers call "prompt injection". If that agent also holds privileged credentials, one invisible line of text can trigger data exfiltration or an unauthorized purchase, turning the model into an unwitting insider threat. Handing high-powered tools to a model that can be tricked so easily is why regulators and risk teams are starting to worry.
Regional regulations complicate deployment: Europe’s stricter privacy rules already delay some advanced features, forcing companies to navigate a patchwork of compliance before rolling agents out at scale.

Yet squeezing the next few "nines" of accuracy from these systems is ruinously expensive. Improving a model from 98 percent to 99.99 percent reliability can demand orders of magnitude more data, compute, and money—much like the stalled quest for full self-driving cars. Hardware limits, energy costs, and a shortage of fresh, high-quality training data all hint that brute-force scaling is hitting a wall, and future gains will come from new architectures: smarter algorithms, not simply bigger models.

History shows businesses tolerating small failure rates when the efficiency gain is significant, and AI will likely follow that pattern. Routine, rules-based office functions are already being automated, shifting human roles toward exception handling and firefighting. Far from ushering in leisure, faster tools usually raise management expectations. Employees often find their workload grows as they supervise the AI in addition to their original duties.

Oversight remains the best defense. The most successful teams break work into micro-steps, subject every result to automated tests or human review, and immediately pull the agent off the task if something looks odd. Treating the model like a junior colleague who drafts at lightning speed but needs line-by-line checks prevents "tech debt" from snowballing. Without that discipline, whatever time you saved on first drafts you will pay back—often with interest—during debugging and clarification.

The safest technical blueprint emerging today isolates privileges: one sub-agent can read untrusted text but has no authority to act, another holds the keys to sensitive tools but never sees raw external data. "Sandwiching" these models behind strict guardrails, capping their spending power with prepaid cards, and inserting human approval for anything high-impact are becoming basic.