Realtime regression in non-English production voice agents: gpt-realtime-mini vs gpt-realtime-mini-2025-10-06

We operate a production AI voice platform built on the OpenAI Realtime API via WebSocket/SIP/Twilio.

We are currently deploying our solution across a few dozen locations nationwide for an enterprise client, and we have encountered a material regression when comparing the dated Realtime snapshot we validated in production against the listed replacement model.

Model validated in production:
gpt-realtime-mini-2025-10-06

Listed replacement:
gpt-realtime-mini

Issue:
In non-English voice-agent flows, with Romanian as our current production case, the replacement model shows noticeably worse language quality and worse faithfulness to supplied business data.

The most serious issue is not just spelling or phrasing. We have observed the newer model hallucinating non-existing departments, services, and operational details that were not present in the database/context. The older snapshot, gpt-realtime-mini-2025-10-06, has been significantly more faithful to the provided information and less prone to confabulating unavailable services or internal departments.

This is important because the older snapshot was not selected casually. It was selected after thousands of hours of testing, R&D, and practical validation in Romanian-language voice-agent scenarios. Its reliability in staying faithful to provided business information is one of the reasons we currently depend on it for production deployments.

Impact:
This affects an active enterprise rollout across a few dozen locations nationwide. The regression impacts:

live AI phone conversations;
appointment and call summaries;
CRM/customer records;
operational reporting;
client trust during rollout.

We are concerned that this may not be limited to Romanian, but may reflect broader non-English quality/faithfulness differences between the dated snapshot and the current gpt-realtime-mini alias.

Evidence:
We have transcription evidence and can provide side-by-side examples comparing the same or similar flows between gpt-realtime-mini-2025-10-06 and gpt-realtime-mini.

Questions:

Has anyone else observed worse non-English performance or worse faithfulness to supplied data on gpt-realtime-mini compared to dated Realtime snapshots?
Is OpenAI tracking language-specific regressions for Realtime models before snapshot deprecations?
Is there a path for production customers to request temporary extended access or a migration path when a listed replacement model is not behaviorally equivalent?

We are committed to building on OpenAI’s Realtime infrastructure, but we need a reliable migration path before moving production enterprise traffic away from the currently working snapshot.

VeitB

May 10, 2026, 5:51pm 2

Hi!

Thank you for raising this.

You are probably already aware that you can still use gpt-realtime-mini-2025-10-06. You do not need to use the undated model slug, which currently points to the December snapshot.

For production systems, dated model versions are usually the better choice because they help keep behavior more consistent.

Since I do not see a deprecation notice for the older snapshot, using it seems reasonable here.

In this case, I would suggest staying on the older snapshot for now and waiting for a possible gpt-realtime-mini-2 release, or reviewing the Realtime model best practices here:

Thank you for the reply.

Just to clarify one important point: gpt-realtime-mini-2025-10-06 does appear to be listed on the official OpenAI deprecations page Deprecations | OpenAI API

On the current deprecations page, under the section for legacy GPT model snapshots, I see the following entry:

Shutdown date: 2026-07-23
Model snapshot: gpt-realtime-mini-2025-10-06
Substitute model: gpt-realtime-mini

So while we can still use the dated snapshot for now, the problem is that it appears to have a scheduled shutdown date, and the listed substitute model does not currently behave equivalently in our non-English/Romanian production voice-agent flows.

That is the core issue we are trying to surface: the dated snapshot is currently production-stable for us, but the documented replacement introduces worse language quality and more hallucination/confabulation against supplied business data.

For that reason, we are looking for either:

confirmation of whether this deprecation entry is accurate;
a migration path for production users affected by language-specific regressions;
or guidance on how to escalate this as a production-impacting model quality regression before the shutdown date.

We are happy to provide side-by-side transcripts showing the behavioral difference.

VeitB May 10, 2026, 7:25pm 4

You’re right. I missed the deprecation notice, and my previous answer was incorrect.

The snapshot gpt-realtime-mini-2025-10-06 is scheduled to shut down on July 23, 2026, with gpt-realtime-mini listed as the replacement.

Right now, the best option may be to wait and see if a newer snapshot addresses these regressions, or revisit the Realtime prompting guide. Sorry for the confusion, and thanks for catching this.

Thank you for the clarification.

The difficulty for us is that revisiting the Realtime prompting guide is unlikely to solve the core issue. We have already spent thousands of hours internally prototyping, testing, and comparing different Realtime/audio-capable model configurations for Romanian production voice-agent use cases.

The stack we currently run, based on gpt-realtime-mini-2025-10-06, is not something we selected casually. It is the result of substantial internal R&D, repeated production-like testing, and comparison against other available models. For our specific use case — Romanian AI voice agents that need to remain faithful to database-fed business information — this snapshot has been by far the most optimal option.

The issue we are seeing with the replacement model is not just something that can be corrected with minor prompt changes. The newer gpt-realtime-mini shows worse Romanian/non-English quality and, more importantly, worse faithfulness to supplied business data. In our tests, it has hallucinated non-existing departments, services, or operational details that were not present in the database/context, while the older snapshot behaved much more reliably.

That reliability is precisely what allowed us to enter the market quickly and attract a large number of voice and call-center clients, both small and large. Although we have not been in the market for long, this OpenAI Realtime-based stack enabled us to grow quickly and explosively, and it is now central to several major enterprise deployments.

This is why the scheduled shutdown is such a serious concern for us. We are currently operating on enterprise deployments where the model behavior is not a minor implementation detail, but the foundation of the product’s trustworthiness. A forced migration to a replacement model with materially worse Romanian/non-English performance could jeopardise major contracts we are currently involved in along with future ones we’re negotiating.

It also affects OpenAI commercially. Based on our current pipeline, we project our OpenAI API usage could reach the ~$50,000/month threshold by Q3 as these deployments scale. Our preference is to continue building and scaling on OpenAI’s Realtime infrastructure, but we need a reliable migration path before moving production traffic away from the validated snapshot.

That is why we were hoping to speak with someone from OpenAI who could look at this from a production/API customer perspective, not only as a general prompting issue. Ideally, we would like to understand whether OpenAI can consider one of the following:

delaying the shutdown of gpt-realtime-mini-2025-10-06 for production users affected by language-specific regressions;
providing temporary extended access while regressions are investigated;
recommending or releasing an alternative Realtime/audio-capable model with equivalent Romanian/non-English faithfulness;
or routing this as a production-impacting model quality regression for Realtime API users.

We can provide side-by-side transcript and audio recorded evidence comparing gpt-realtime-mini-2025-10-06 and gpt-realtime-mini under comparable flows/configuration.

f10w May 12, 2026, 2:42pm 6

Same issues for `gpt-realtime` upgrade: GPT-Realtime 1.5 — Major regression in voice expressiveness and accent quality - #4 by rich.gilmour

For voice agents, the first thing the clients care about is actually the voice! Intelligence only comes second (later in the tests/demos).

Yes, voice quality and accent quality definitely matter in demos, especially for client perception.

In our case, though, the bigger issue is not raw intelligence or even voice expressiveness. The older gpt-realtime-mini-2025-10-06 snapshot was already good enough for Romanian voice quality and much more reliable for our production constraints.

The serious regression for us is faithfulness to supplied business data.

For example, in one clinic test case, the newer gpt-realtime-mini told a patient that the clinic had a neurology department, even though no such department existed in the database/context it had access to. That kind of hallucination did not happen with the dated snapshot under the same type of flow. The older model stayed much more faithful to the information we provided.

Mispronunciations or less natural phrasing would be more forgivable. Annoying, but manageable. Inventing non-existing departments, services, or operational details in a healthcare/clinic setting is a completely different level of risk. That can directly damage client trust and create operational problems for the business using the agent.

So for us the issue is not simply:

“the newer model sounds worse”

but rather:

“the newer model is less reliable as a production voice agent because it confabulates facts that were not in the business data.”

I also can’t really comment on whether the larger non-mini Realtime models solve this, because our unit economics are razor-thin by design. We are trying to offer some of the lowest possible AI voice pricing per minute in our local market, while competing with local companies that already offer very cheap per-minute voice solutions. Even if a larger gpt-realtime model performed better, switching from gpt-realtime-mini to the more expensive Realtime tier is not a viable migration path for our business model at scale.

That is why gpt-realtime-mini-2025-10-06 matters so much to us. It gave us the right balance of cost, Romanian performance, and faithfulness to supplied data. The current replacement does not appear behaviorally equivalent for our use case.

f10w May 13, 2026, 9:35am 8

You definitely should switch to `gpt-realtime-2` if the voice is not a major issue.

Unfortunately, switching to gpt-realtime-2 is out of the question for us at scale.

The issue is unit economics. gpt-realtime-2 is roughly 3x more expensive than gpt-realtime-mini on tokens, which would force us to price our AI voice services above what the local market can realistically absorb.

In Romania, we are competing with local providers offering very cheap per-minute AI voice pricing. More importantly, our product needs to remain meaningfully cheaper than the hourly cost of human labor at local wage levels. If our AI call center pricing cannot undercut local minimum-wage labor economics, it loses a large part of its commercial appeal.

This is especially important because our local market is currently under strong economic pressure. Many clients are not buying AI voice agents as a luxury upgrade; they are evaluating them as a way to reduce costs, handle more calls with fewer resources, or improve ROI through cheaper operational methods.

If moving from gpt-realtime-mini to gpt-realtime-2 forces our per-minute pricing above local competitor offers and too close to or above local labor-cost equivalents, it would kill our sales momentum.

That is why the mini tier is critical for us. The problem is not that we need “the most intelligent model available.” We need a mini-class Realtime model that preserves the cost profile of gpt-realtime-mini while maintaining the Romanian/non-English faithfulness and production reliability we saw in gpt-realtime-mini-2025-10-06.

For our use case, gpt-realtime-mini-2025-10-06 hit the viable balance: low enough cost, acceptable Romanian voice quality, and strong faithfulness to supplied business data. A 3x cost increase is not a workable migration path in our market.

Interesting topic. I’m going to test gpt-realtime-mini-2025-10-06 in Portuguese as well.

We are currently using gpt-realtime-mini, but the quality has been quite poor for Portuguese voice agents. We are seeing issues with context loss, weaker understanding, and overall lower reliability in real customer scenarios.

This is directly affecting our ability to expand the project to more customers and use cases. The standard gpt-realtime model performs better, but it is too expensive for our use case and not economically viable at scale.

It would be very helpful to understand whether gpt-realtime-mini-2025-10-06 is expected to improve non-English voice quality, especially for Portuguese, or if there are recommended settings to reduce these issues.

I have tried contacting support, enterprise sales about these issues, nothing. And gpt-realtime-mini seems to become worse and worse. Now it’s getting scheduling appointments wrong where we’ve never had this issue with gpt-realtime-mini-2025-10-06 and at this point, given the complete silence from OpenAI’s support and enterprise team, I think we’ll be considering switching over to a voice model from another provider such as elevenlabs or retell. It may be more expensive on a per minute basis, but given OpenAI’s approach to communicating with us (as in, not at all) this gives us no option. We’re handling millions of minutes per month at this point, and to say openAI has been unprofessional with us is an understatement.