OpenAI: Streaming is now available in the Assistants API
platform.openai.comAssistant API is too much of a beta still.
I was about to release an app based on the new Assistant API but just a day before the release the response times increased to 8s flat. When I have function calls, that meant up to a minute to get a response.
I had to dismantle everything Assistant API and implement it with Chat API. Which turned out to be great because in Assistant API the context management was very bad and after a few back and forth messages the cost ballooned to over 10K tokens per message.
When I looked closely at the Assistant API and Chat API, I noticed that Assistant API is just a wrapper over Chat API and acts as a web service that stores the previous messages(so slow response problem was probably due to the web server which keeps track of the context). So I went ahead and implemented my own Assistant API which has more control. For example, I set max token cost per message and if the context balloons over that, I make a request with the context and ask OpenAI to create a summary with all the facts so far, add that summary as a system prompt and my context gets compressed back into reasonable territory.
It does considerably more than (poorly) managing the context window. It also (poorly) enables persistent document storage, knowledge retrieval, function calling and code execution.
I still don't even know what the Assistant API is supposed to afford me.
It's useful if you just need to hook up a chat assistant and don't want to bother with the busywork doing it. All you care is loading the messages from the thread(which are conveniently kept for you) and add new messages.
Is the training method similar? For example, a company chatbot would need to know it’s a chatbot for Company Y.
So, the Assistant API in OpenAI is just a wrapper over the Chat API. They let you choose which model you would like to use, so as a result of you fine tune a model you should be able to use it.
However I never tried fine tuning, I rely on RAG and the Assistant API does provide you some tools to make this a bit easier. What tools? They provide an "editor interface" where you can set function calls, upload some files and access the code interpreter.
So if you are making a chatbot for Company Y, you can create an assistant which has information about Company Y in the system prompt and also can access up to date information about the company through function calls you define and the files you upload.
If you use only Chat API, you will have to handle these stuff yourself. Actually, though I'm using Chat API I do use the Assistant Editor UI to manage the functions and the system prompts. What I do is, I retrieve the assistant info from the OpenAI Assistant API and then I use this on Chat API. This way I don't have to bother with creating my own UI or fiddle with text files or the code.
As Assistant API is just a wrapper, most the data structures I receive from Assistant API directly work in Chat API.
What training? Beyond supplying context, I don't think assistants has any fine tuning involved.
Yeah that was kind of my idea, it does not serve much if any purpose and only limits the capability.
Finally! I've been using the assistants api in building an ai mock interviewer (https://comp.lol) but the responses were painfully slow when using the latest iterations of the gpt-4 model. This will make things so much more responsive
I'd still want to see the entire response all at once. Having it stream in while I read it would be very distracting and make it difficult for me to read.
It's a request the front-end developer should be confronted with, not OpenAI.
The website could as well buffer the incoming stream until the used clicks an area to request the display of the next block of the response, once he has finished reading the initial sentences.
yes, it like surfing porn in the early internet year using a dialup modem. One line a the time until you finally can see enough of the picture (reply) to realize that is was not the reply you were looking for.
LLM streaming must be a cost saving feature to prevent you from overloading the servers by asking to many questions with in a short time frame. Annoying feature IMHO
How is hiding it behind a loading spinner any better? You still can't spam it with questions since you need to wait for it to finish. With streaming you can at least hit the stop button if it looks incorrect, so you actually spam it more with it enabled.
For me, the constant visual changes of new parts being streamed in are annoying, and straining on the eyes. Ideally, web frontends would honor `prefers-reduced-motion` and buffer the response when set.
Personally, I've fallen in love with that visual effect of streaming text you're talking about. It's a bit pavlovian, but I think in my head it signifies that I'm reading something high signal (even though it isn't always).
It's more about UX, to reduce the perceived delay. LLMs inherently stream their responses, but if you wait until the LLM has finished inference, the user is sitting around twiddling their thumbs.
Same it was super slow and unusable when I tried. 10 seconds for a reply or smth. GPT4 API itself was way faster
This was one of the limitations of the Assistants API that made me entirely ignore it up until now.
I am curious if the Assistants API lets you edit/remove/retry messages yet. I don't see anything implying this has changed. It's annoying that the Assistants API doesn't give you enough control to support basic things that the ChatGPT app does.
Like the other commenter said, edit/remove/retry messages can be implemented by the API client already. The API doesn't maintain state so every new message in a "conversation" includes previous messages as context. To edit a message you would re-submit the conversation history with the desired changes.
I get what you're asking for though. It would be nice if this was easier. But that would require OpenAI changing their API model to one where conversation history is stored on their server. It would be more of a "ChatGPT conversation API" then just an GPT-4/3.5 API.
That is what "assistant api" is, you create a thread and add new user message to the thread. The messages are stored on the server.
There is an API to modify messages, though I am not sure of its constraints.
Edit/remove/retry is just including the whole conversation over again (IIUC this is even how the app works.) It's part of why the API is so expensive
The Assistants API doesn't let you recreate the conversation (with edits or not) because you can't (re)create messages with role=assistant.
This was indeed true in the beginning, and I don’t know if this has changed. Inserting messages with Assistant role is crucial for many reasons, such as if you want to implement caching, or otherwise edit/compress a previous assistant response for cost or other reason.
At the time I implemented a work-around in Langroid[1]: since you can only insert a “user” role message, prepend the content with ASSISTANT: whenever you want it to be treated as an assistant role. This actually works as expected and I was able to do caching. I explained it in this forum:
https://community.openai.com/t/add-custom-roles-to-messages-...
[1] the Langroid code that adds a message with a given role, using this above “assistant spoofing trick”:
https://github.com/langroid/langroid/blob/main/langroid/agen...
Not true
How do you create messages as role assistant?
For all the brilliance in the AI and infra departments of OpenAI, their official Python library (which is the flagship one as I understand) feels pretty unidiomatic, designed without much thought for common patterns in the language.
2012 JavaScript called, it wants its callbacks wrapped in objects back. Why do we have a context manager named "stream" for which you call `.until_done()`? This could've been an iterator, or better - an asynchronous iterator, since this is streaming over the network. We could be destructing instances of named tuples with pattern matching, or even just doing `"".join(delta.text for delta in prompt (...)`. But no here subclass this instead, tells me the wrapper around a web API.
Hey there, I helped design the Python library.
The `stream` context manager actually does expose an async iterator (in the async client), so you could instead do this for the simple case:
which I think is roughly what you want.with client.beta.threads.runs.create_and_stream(…) as stream: async for text in stream.text_deltas: print(text, end="", flush=True)Perhaps the docs should be updated to highlight this simple case earlier.
We are also considering expanding this design, and perhaps replacing the callbacks, like so:
which I think is also more in line with what you expect. (you could also `match event: case TextDelta: …`).with client.beta.threads.runs.create_and_stream(…) as stream: async for event in stream.all_events: if event.type == 'text_delta': print(event.delta.value, end='') elif event.type == 'run_step_delta': event.snapshot.id event.delta.step_details...Note that the context manager is required because otherwise there's no way to tell if you `break` out of the loop (or otherwise stop listening to the stream) which means we can't close the request (and you both keep burning tokens and leak resources in your app).
Context managers are a great abstraction.
Everything feels unidiomatic. The API design is bad, the frontends they build are horrific, reliability and availability are shocking.
And yet the AI is so good I put up with them everyday
If they ever grow into a proper product org they'll be unstoppable.
Hi there, I help design the OpenAI APIs. Would you be able to share more?
You can reply here or email me at atty@openai.com.
(Please don't hold back; we would love to hear the pain points so we can fix them.)
does your team do usability tests on the apis before launching them?
if you got 3-5 developers to try and use one of the sdks to build something, i bet you'd see common trends.
e.g. we recently had to update an assistant with new data everyday and get 1 response, and this is what the engineer came up with. probably it could be improved, but this is really ugly
``` const file = await openai.files.create({ file: fs.createReadStream(fileName), purpose: 'assistants', }) await openai.beta.assistants.update(assistantId, { file_ids: [file.id], })
```const { id: threadId } = await openai.beta.threads.create({ messages: [ { role: 'user', content: 'Create PostSuggestions from the file. Remember to keep the style fun and engaging, not just regurgitating the headlines. Read the WHOLE article.', }, ], }) const getSuggestions = async (runIdArg: string) => { return new Promise<PostSuggestions>(resolve => { const checkStatus = async () => { const { status, last_error, required_action } = await openai.beta.threads.runs.retrieve(threadId, runIdArg) console.log({ status }) if (status === 'requires_action') { if (required_action?.type === 'submit_tool_outputs') { required_action?.submit_tool_outputs?.tool_calls?.forEach(async toolOutput => { const parsed = PostSuggestions.safeParse(JSON.parse(toolOutput.function.arguments)) if (parsed.success) { await openai.beta.threads.runs.cancel(threadId, runIdArg) resolve(parsed.data) } else { console.error(`failed to parse args from openai to my type (errors=${parsed.error.errors}`) } }) } else { console.error(`requires_action, but not submit_tool_outputs (type=${required_action?.type})`) } } else if (status === 'completed') { throw new Error(`status is completed, but no data. supposed to go to requires_action`) } else if (status === 'failed') { throw new Error(`message=${last_error?.message}, code=${last_error?.code}`) } else { setTimeout(checkStatus, 500) } } checkStatus() }) } const { id: runId } = await openai.beta.threads.runs.create(threadId, { assistant_id: assistantId, }) console.time('openai create thread') const newsSuggestions = await getSuggestions(runId) console.timeEnd('openai create thread')just to add to this, it's not helped by the docs. either they don't exist, or the seo isn't working right.
e.g. search term for me "openai assistant service function call node". The first 2 results are community forums, not what i'm looking for. The 3rd is seemingly the official one but doesn't actually answer the question (how to use the assistance service with node and function calling) with an example. The 4th is in python.
https://community.openai.com/t/how-does-function-calling-act...
https://community.openai.com/t/how-assistant-api-function-ca...
https://platform.openai.com/docs/guides/function-calling
https://learn.microsoft.com/en-us/azure/ai-services/openai/h...
I'm sorry for your experience, and thanks very much for sharing the code snippet - that's helpful!
We did indeed code up some sample apps and highlighted this exact concern. We have some helpers planned to make it smoother, which we hope to launch before Assistants GA. For streaming beta, we were focused just on the streaming part of these helpers.
Hey, random question.
Is there a technical reason why log probs aren't available when using function calling? It's not a problem, I've already found a workaround. I was just curious haha.
In general I feel like the function calling/tool use is a bit cumbersome and restrictive so I prefer to write the typescript in the functions namespace myself and just use json_mode.
Have you seen/tried the `.runTools()` helper?
Docs: https://github.com/openai/openai-node?tab=readme-ov-file#aut...
Example: https://github.com/openai/openai-node/blob/bb4bce30ff1bfb06d...
(if what you're fundamentally trying to do is really just get JSON out, then I can see how json_mode is still easier).
Who can I reach out to for feedback on the web UI? Specifically, the chat.openai.com interface.
Web developer/designer for 24 years so I have a lot of ideas
...except for all the others.
Use Claude in Safari and the browser completely locks up after a single response.
My experience is their official Python library was easy to use, no surprises, everything is typed and generated from the OpenAPI spec in a thoughtful way.
The tools are great because they don't invent their own DSL, they "just" use JSON schemas.
Maybe they ought to contribute changes to OpenAPI to support streaming APIs better.
In contrast so many startups make their own annotation-driven DSLs for Python with their branding slapped over everything. It gives desperate-for-lock-in vibes. The last people OpenAI should be taking advice from for their API design is this forum.
How is suggesting the use of iterators and named tuples related to creating domain specific languages? If anything I'd say they're a much more generic and universally recognizable approach than having users subclass `AssistantEventHandler` to be passed to `client.beta.threads.runs.create_and_stream`, the context manager. This is very much a long way past just using JSON schemas but that part is ok - there's a REST API, and there's a library. If you're keen on the simplicity of JSON schema then by all means use the API with `requests` or your preferred http client library. Since that's always an option, it stands to reason that the point of having a dedicated library is to provide thoughtful abstractions that make it easier to use the service.
What I'm arguing is precisely that the abstractions in the library (such as the `AssistantEventHandler` shown in the article) are ineffective in making things simpler. They force you to over-engineer solutions and distribute state unnecessarily and be aware of that specific class interface when it could've just been something you use in a `for x in y` loop like everyone would know to do without spending an afternoon looking over docs and figuring out how the underlying implicit FSM works.
Probably written by GPT4
It’s not the case. The SDK is a collaboration between OpenAI and Stainless.
As a Stainless contributor I can guarantee you a lot of thoughts has been put into the design, and it definitely isn’t written by an ML model
Thanks for posting. I got an example working with functions and tool_calls if anyone needs it. I could not find good examples in the docs. https://medium.com/@hawkflow.ai/openai-streaming-assistants-...
Has anyone put out a voice-to-text interface for OpenAI? Or anything in the Ollama-verse?
OpenAI has a voice to text interface for OpenAI…
The mobile app is pretty good
Horrendous in non english languages though, the accents are extremely American
Is there a way to use the mobile app on PCs?
I tried with Windows Subsystem for Android but the app refused to work.
There’s whisper.
I am interested to use the assistant api for my commercial project but it is not clear from the article what the token count looks like?
- is it counted for a single user message or the sum of all previous messages?
- if there's a file, will it be counted every time a user interacts or only the first time?
I think
- it is correlated to the sum, every new interaction adds the whole history again
- yes, but you probably pay for the retrieved fragments, not the whole file
On the second point, there was an issue on launch where it would not find a relevant fragment and appear to load the whole file into the context. Unsure if this has changed but it freaked quite a few folks out OpenAI discussion forums w/ escalating costs.
Throwing a feature request in here just in case someone from OpenAI sees it.
I'd really like it if the streaming versions of their APIs could return a token usage count at the end.
The non-streaming APIs do this right now:
curl https://api.openai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" -d '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "user",
"content": "A short fun fact about pigeons"
}
]
}'
Returns: {
"id": "chatcmpl-92UiIWQaf442wq7Eyp7kF8ge0e3fE",
"object": "chat.completion",
"created": 1710381746,
"model": "gpt-3.5-turbo-0125",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Pigeons are one of the few bird species that can drink water by sucking it up through their beaks, rather than tilting their heads back to swallow."
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 14,
"completion_tokens": 33,
"total_tokens": 47
},
"system_fingerprint": "fp_4f0b692a78"
}
Note the "usage" block there telling me how many tokens were used (which tells me how much this cost).But if I add "stream": true I get back an SSE stream that looks like this:
...
data: {"id":"chatcmpl-92Uk81oNjrcUJQnPX8fSNqFINLfSI","object":"chat.completion.chunk","created":1710381860,"model":"gpt-3.5-turbo-0125","system_fingerprint":"fp_4f0b692a78","choices":[{"index":0,"delta":{"content":"."},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-92Uk81oNjrcUJQnPX8fSNqFINLfSI","object":"chat.completion.chunk","created":1710381860,"model":"gpt-3.5-turbo-0125","system_fingerprint":"fp_4f0b692a78","choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}]}
data: [DONE]
There's no "usage" block, which means I have to try and account for the tokens myself. This is really inconvenient!I noticed the other day that the Claude streaming API returns a "usage" block with the last message. I'd love it if OpenAI's API did the same thing.
I need this right now because I'm starting to build features for end users of my own software, and I want to be able to give them X,000 tokens "free" before starting to charge them for extras. Counting those tokens myself (probably using tiktoken) is code I'd rather not have to write - especially since features like tools/functions or images make counting tokens a lot less obvious.
We do the token counting on our end literally just running tiktoken on the content chunks (although I think usually its one token per chunk). Its a bit annoying and I too expected they'd have the usage block but its one line of code if you already have tiktoken available. I've found the accounting on my side lines up well with what we see on our usage dashboard.
As an FYI, this is fine for rough usage, but it's not accurate. The OpenAI APIs inject various tokens you are unaware of into the input for things like function calling.
This and/or being able to fetch the responses with their token usage by id. What is that ID for without a way to retrieve the completions with it?
they should do streaming for voice inputs on the chatgpt app. right now it's very slow. Voice interfaces need to be streaming
Any way to have a consistent system prompt across queries without sending it (and using tokens) for each completion?
The assistant has its own "instructions" (replacement for system prompt)
and then on each run, you have the option to add more guidance to the run explicitly, without modifying the assistant instructions (system prompt)
It's a little bit different but kind of the same
No, adding run instructions will replace existing instructions for that run
That's what they say ;)
The Assistant API handles that, it has the system prompt as part of the assistant that you interact with.
And can you share the assistant with other users?
Also, the system prompt in assistants doesn't consume tokens?
Adore. Congrats team. For us the API is epic. We'd just ask for focus on performance.
Has tool use accuracy improved?
Sigh another week lost to the void
Elaborate?
"YET ANOTHER shiny new toy to distract me. Can't help myself even though I think it's mostly a waste of time"
Am I just projecting? Relatable, in any case :)
I immediately implemented streaming into my rocketchat gpt bot, was definitely a distraction but my colleagues liked it. No more waiting until the complete response is sent.
Yep, you captured the moment ^_^
Openai banned my account for suspicious payment activities, and I never was able to talk to a real person. Just several layers of chat bots posing as people.
I literally want to give them my money and can't. Every few weeks for shirts and giggles i send an email to them saying, "any update on this?"
I suspected as much when one of their support "personnel" used the phrase "I apologize for the earlier confusion..." (there was no confusion, I was simply contradicting what they were saying)
One of the reasons I tend to use any of their options through Azure where available. Azure support has a more straight forward (though still sometimes slow) process for account issues.
I guess it's time for Claude 3 (I imagine you were using it for the LLMs).
My Anthropic account was suspended for suspicious activity, even though I never used it. I had forgotten I had signed up, and tried to sign up using a new email with the same phone number. Locked out forever.
please contact support: https://support.anthropic.com/en/ . we'll get it fixed. sorry!
Welcome to the future. You might be able to get an enterprise sales contract with human support.
I thought this was about making the OpenAI app available as a digital assistant on Android, as a replacement to Google.
Oh well..
This website is now like 30% about this probability based autocomplete nonsense. Feels like all those bitcoin hypes and "running everything on blockchain" fad of few years ago. Now it's running everything through "large autocomplete" model.
I really hope this will fade and focus will turn back to highlighting some broader actual human ingenuity in IT, rather than constant stream of "we used autocomplete for this new thing" or "we build this new API for this glorified autocomplete".
Boring.
"old man yells at cloud"
Seriously though, it's not going away no matter how much anyone hates it. Emails and blogs will continue to be written with it, letters of recommendation will be/are written with it, Presidential speeches will be written with it, academic articles will be / are written with it (almost all ml and cs research is), news is written with it... It's not going to stop, but it will _probably_/_very likely_ get better.
There is no tool, no human, no method to determine if text is generated with one of these models at high F-score (only sometimes high precision, low recall domains for silly examples).
We're stuck with it. Like the English teacher and their despised spell check.
It occurs to me that over time, reading comprehension will become significantly more important than the ability to write. Anyone will be able to write something smart-sounding with AI's help, but it'll take real skill to make sure the output is correct and appropriate.
I just added this "autocomplete" in my app, and customers emailed to say they actually love it: https://docs.uxwizz.com/guides/ask-ai-new
Yes, customers will love anything that helps them. You can get customers to love you by adding any kind of automation for stuff they had to do by hand up to that point. Does this mean there should be 10 articles per day shared about "I added XLSX import to my app, so my customers don't have to do data entry via dialogs"?
My point is about repetitiveness of LLM topics. Not about usefullness of LLM itself. And LLMs are glorified autocomplete. Their internals are maybe interesting, but that's often not what's being discussed here or even written about in the shared articles.
I've gotten so used to having an LLM integrated into my editor that when I work on the occasional spreadsheet (or really anything with syntax that I only use occasionally and no integrated AI) it's pretty jarring to have to go to another tab to look up what function to use for a formula (even if that other tab is ChatGPT).
Nah it's got legs as a google replacement / competitor if they keep costs lower and take a smaller rent. WHEN they start advertising they'll explode. Which is why google is trying to snuff them out in the cradle (sorry about the visual).
If deep learning algorithms are "autocomplete" then so is the human mind when it strings words together. No, that's not how it works.
[citation needed]
Just because that makes for a nice narrative in the copyright infringement argument, doesn't make it so.
We know next to nothing about how the human brain works.
Citation: Decades of research in artificial neural networks
Here's a paper from 1990 by the Godfather himself https://www.cs.toronto.edu/~hinton/absps/AIJmapping.pdf
"This 1990 paper demonstrated how neural networks could learn to represent and reason about part-whole hierarchical relationships, using family trees as the example domain.
By training on examples of family relations like parent-child and grandparent-grandchild, the neural network was able to capture the underlying logical patterns and reason about new family tree instances not seen during training.
This seminal work highlighted that neural networks can go beyond just memorizing training examples, and instead learn abstract representations that enable reasoning and generalization"
> We know next to nothing about how the human brain works
We understand how parts of it work.