Settings

Theme

OpenAI: Streaming is now available in the Assistants API

platform.openai.com

135 points by jonbraun 2 years ago · 88 comments

Reader

mrtksn 2 years ago

Assistant API is too much of a beta still.

I was about to release an app based on the new Assistant API but just a day before the release the response times increased to 8s flat. When I have function calls, that meant up to a minute to get a response.

I had to dismantle everything Assistant API and implement it with Chat API. Which turned out to be great because in Assistant API the context management was very bad and after a few back and forth messages the cost ballooned to over 10K tokens per message.

When I looked closely at the Assistant API and Chat API, I noticed that Assistant API is just a wrapper over Chat API and acts as a web service that stores the previous messages(so slow response problem was probably due to the web server which keeps track of the context). So I went ahead and implemented my own Assistant API which has more control. For example, I set max token cost per message and if the context balloons over that, I make a request with the context and ask OpenAI to create a summary with all the facts so far, add that summary as a system prompt and my context gets compressed back into reasonable territory.

  • SoulAuctioneer 2 years ago

    It does considerably more than (poorly) managing the context window. It also (poorly) enables persistent document storage, knowledge retrieval, function calling and code execution.

  • infecto 2 years ago

    I still don't even know what the Assistant API is supposed to afford me.

    • mrtksn 2 years ago

      It's useful if you just need to hook up a chat assistant and don't want to bother with the busywork doing it. All you care is loading the messages from the thread(which are conveniently kept for you) and add new messages.

      • lobsterthief 2 years ago

        Is the training method similar? For example, a company chatbot would need to know it’s a chatbot for Company Y.

        • mrtksn 2 years ago

          So, the Assistant API in OpenAI is just a wrapper over the Chat API. They let you choose which model you would like to use, so as a result of you fine tune a model you should be able to use it.

          However I never tried fine tuning, I rely on RAG and the Assistant API does provide you some tools to make this a bit easier. What tools? They provide an "editor interface" where you can set function calls, upload some files and access the code interpreter.

          So if you are making a chatbot for Company Y, you can create an assistant which has information about Company Y in the system prompt and also can access up to date information about the company through function calls you define and the files you upload.

          If you use only Chat API, you will have to handle these stuff yourself. Actually, though I'm using Chat API I do use the Assistant Editor UI to manage the functions and the system prompts. What I do is, I retrieve the assistant info from the OpenAI Assistant API and then I use this on Chat API. This way I don't have to bother with creating my own UI or fiddle with text files or the code.

          As Assistant API is just a wrapper, most the data structures I receive from Assistant API directly work in Chat API.

        • infecto 2 years ago

          What training? Beyond supplying context, I don't think assistants has any fine tuning involved.

      • infecto 2 years ago

        Yeah that was kind of my idea, it does not serve much if any purpose and only limits the capability.

andher 2 years ago

Finally! I've been using the assistants api in building an ai mock interviewer (https://comp.lol) but the responses were painfully slow when using the latest iterations of the gpt-4 model. This will make things so much more responsive

  • cosmotic 2 years ago

    I'd still want to see the entire response all at once. Having it stream in while I read it would be very distracting and make it difficult for me to read.

    • qwertox 2 years ago

      It's a request the front-end developer should be confronted with, not OpenAI.

      The website could as well buffer the incoming stream until the used clicks an area to request the display of the next block of the response, once he has finished reading the initial sentences.

    • TowerTall 2 years ago

      yes, it like surfing porn in the early internet year using a dialup modem. One line a the time until you finally can see enough of the picture (reply) to realize that is was not the reply you were looking for.

      LLM streaming must be a cost saving feature to prevent you from overloading the servers by asking to many questions with in a short time frame. Annoying feature IMHO

      • Kiro 2 years ago

        How is hiding it behind a loading spinner any better? You still can't spam it with questions since you need to wait for it to finish. With streaming you can at least hit the stop button if it looks incorrect, so you actually spam it more with it enabled.

        • silversmith 2 years ago

          For me, the constant visual changes of new parts being streamed in are annoying, and straining on the eyes. Ideally, web frontends would honor `prefers-reduced-motion` and buffer the response when set.

          • Prosammer 2 years ago

            Personally, I've fallen in love with that visual effect of streaming text you're talking about. It's a bit pavlovian, but I think in my head it signifies that I'm reading something high signal (even though it isn't always).

      • SoulAuctioneer 2 years ago

        It's more about UX, to reduce the perceived delay. LLMs inherently stream their responses, but if you wait until the LLM has finished inference, the user is sitting around twiddling their thumbs.

  • pieterhg 2 years ago

    Same it was super slow and unusable when I tried. 10 seconds for a reply or smth. GPT4 API itself was way faster

AgentME 2 years ago

This was one of the limitations of the Assistants API that made me entirely ignore it up until now.

I am curious if the Assistants API lets you edit/remove/retry messages yet. I don't see anything implying this has changed. It's annoying that the Assistants API doesn't give you enough control to support basic things that the ChatGPT app does.

  • varenc 2 years ago

    Like the other commenter said, edit/remove/retry messages can be implemented by the API client already. The API doesn't maintain state so every new message in a "conversation" includes previous messages as context. To edit a message you would re-submit the conversation history with the desired changes.

    I get what you're asking for though. It would be nice if this was easier. But that would require OpenAI changing their API model to one where conversation history is stored on their server. It would be more of a "ChatGPT conversation API" then just an GPT-4/3.5 API.

    • blackoil 2 years ago

      That is what "assistant api" is, you create a thread and add new user message to the thread. The messages are stored on the server.

      There is an API to modify messages, though I am not sure of its constraints.

  • xvector 2 years ago

    Edit/remove/retry is just including the whole conversation over again (IIUC this is even how the app works.) It's part of why the API is so expensive

    • AgentME 2 years ago

      The Assistants API doesn't let you recreate the conversation (with edits or not) because you can't (re)create messages with role=assistant.

pedrovhb 2 years ago

For all the brilliance in the AI and infra departments of OpenAI, their official Python library (which is the flagship one as I understand) feels pretty unidiomatic, designed without much thought for common patterns in the language.

2012 JavaScript called, it wants its callbacks wrapped in objects back. Why do we have a context manager named "stream" for which you call `.until_done()`? This could've been an iterator, or better - an asynchronous iterator, since this is streaming over the network. We could be destructing instances of named tuples with pattern matching, or even just doing `"".join(delta.text for delta in prompt (...)`. But no here subclass this instead, tells me the wrapper around a web API.

  • rattray 2 years ago

    Hey there, I helped design the Python library.

    The `stream` context manager actually does expose an async iterator (in the async client), so you could instead do this for the simple case:

        with client.beta.threads.runs.create_and_stream(…) as stream:
          async for text in stream.text_deltas:
            print(text, end="", flush=True)
    
    which I think is roughly what you want.

    Perhaps the docs should be updated to highlight this simple case earlier.

    We are also considering expanding this design, and perhaps replacing the callbacks, like so:

        with client.beta.threads.runs.create_and_stream(…) as stream:
          async for event in stream.all_events:
            if event.type == 'text_delta':
              print(event.delta.value, end='')
            elif event.type == 'run_step_delta':
              event.snapshot.id
              event.delta.step_details...
    
    which I think is also more in line with what you expect. (you could also `match event: case TextDelta: …`).

    Note that the context manager is required because otherwise there's no way to tell if you `break` out of the loop (or otherwise stop listening to the stream) which means we can't close the request (and you both keep burning tokens and leak resources in your app).

  • willsmith72 2 years ago

    Everything feels unidiomatic. The API design is bad, the frontends they build are horrific, reliability and availability are shocking.

    And yet the AI is so good I put up with them everyday

    If they ever grow into a proper product org they'll be unstoppable.

    • athyuttamre 2 years ago

      Hi there, I help design the OpenAI APIs. Would you be able to share more?

      You can reply here or email me at atty@openai.com.

      (Please don't hold back; we would love to hear the pain points so we can fix them.)

      • willsmith72 2 years ago

        does your team do usability tests on the apis before launching them?

        if you got 3-5 developers to try and use one of the sdks to build something, i bet you'd see common trends.

        e.g. we recently had to update an assistant with new data everyday and get 1 response, and this is what the engineer came up with. probably it could be improved, but this is really ugly

        ``` const file = await openai.files.create({ file: fs.createReadStream(fileName), purpose: 'assistants', }) await openai.beta.assistants.update(assistantId, { file_ids: [file.id], })

          const { id: threadId } = await openai.beta.threads.create({
           messages: [
            {
             role: 'user',
             content:
              'Create PostSuggestions from the file. Remember to keep the style fun and engaging, not just regurgitating the headlines. Read the WHOLE article.',
            },
           ],
          })
          const getSuggestions = async (runIdArg: string) => {
           return new Promise<PostSuggestions>(resolve => {
            const checkStatus = async () => {
             const { status, last_error, required_action } = await openai.beta.threads.runs.retrieve(threadId, runIdArg)
        
             console.log({ status })
             if (status === 'requires_action') {
              if (required_action?.type === 'submit_tool_outputs') {
               required_action?.submit_tool_outputs?.tool_calls?.forEach(async toolOutput => {
                const parsed = PostSuggestions.safeParse(JSON.parse(toolOutput.function.arguments))
                if (parsed.success) {
                 await openai.beta.threads.runs.cancel(threadId, runIdArg)
                 resolve(parsed.data)
                } else {
                 console.error(`failed to parse args from openai to my type (errors=${parsed.error.errors}`)
                }
               })
              } else {
               console.error(`requires_action, but not submit_tool_outputs (type=${required_action?.type})`)
              }
             } else if (status === 'completed') {
              throw new Error(`status is completed, but no data. supposed to go to requires_action`)
             } else if (status === 'failed') {
              throw new Error(`message=${last_error?.message}, code=${last_error?.code}`)
             } else {
              setTimeout(checkStatus, 500)
             }
            }
        
            checkStatus()
           })
          }
          const { id: runId } = await openai.beta.threads.runs.create(threadId, {
           assistant_id: assistantId,
          })
          console.time('openai create thread')
          const newsSuggestions = await getSuggestions(runId)
          console.timeEnd('openai create thread')
        ```
      • msp26 2 years ago

        Hey, random question.

        Is there a technical reason why log probs aren't available when using function calling? It's not a problem, I've already found a workaround. I was just curious haha.

        In general I feel like the function calling/tool use is a bit cumbersome and restrictive so I prefer to write the typescript in the functions namespace myself and just use json_mode.

      • lobsterthief 2 years ago

        Who can I reach out to for feedback on the web UI? Specifically, the chat.openai.com interface.

        Web developer/designer for 24 years so I have a lot of ideas

    • mvkel 2 years ago

      ...except for all the others.

      Use Claude in Safari and the browser completely locks up after a single response.

  • doctorpangloss 2 years ago

    My experience is their official Python library was easy to use, no surprises, everything is typed and generated from the OpenAPI spec in a thoughtful way.

    The tools are great because they don't invent their own DSL, they "just" use JSON schemas.

    Maybe they ought to contribute changes to OpenAPI to support streaming APIs better.

    In contrast so many startups make their own annotation-driven DSLs for Python with their branding slapped over everything. It gives desperate-for-lock-in vibes. The last people OpenAI should be taking advice from for their API design is this forum.

    • pedrovhb 2 years ago

      How is suggesting the use of iterators and named tuples related to creating domain specific languages? If anything I'd say they're a much more generic and universally recognizable approach than having users subclass `AssistantEventHandler` to be passed to `client.beta.threads.runs.create_and_stream`, the context manager. This is very much a long way past just using JSON schemas but that part is ok - there's a REST API, and there's a library. If you're keen on the simplicity of JSON schema then by all means use the API with `requests` or your preferred http client library. Since that's always an option, it stands to reason that the point of having a dedicated library is to provide thoughtful abstractions that make it easier to use the service.

      What I'm arguing is precisely that the abstractions in the library (such as the `AssistantEventHandler` shown in the article) are ineffective in making things simpler. They force you to over-engineer solutions and distribute state unnecessarily and be aware of that specific class interface when it could've just been something you use in a `for x in y` loop like everyone would know to do without spending an afternoon looking over docs and figuring out how the underlying implicit FSM works.

  • jilles 2 years ago

    Probably written by GPT4

    • dgellow 2 years ago

      It’s not the case. The SDK is a collaboration between OpenAI and Stainless.

      https://www.stainlessapi.com/

      As a Stainless contributor I can guarantee you a lot of thoughts has been put into the design, and it definitely isn’t written by an ML model

bytemonitor 2 years ago

Thanks for posting. I got an example working with functions and tool_calls if anyone needs it. I could not find good examples in the docs. https://medium.com/@hawkflow.ai/openai-streaming-assistants-...

ProjectArcturis 2 years ago

Has anyone put out a voice-to-text interface for OpenAI? Or anything in the Ollama-verse?

jerrygoyal 2 years ago

I am interested to use the assistant api for my commercial project but it is not clear from the article what the token count looks like?

- is it counted for a single user message or the sum of all previous messages?

- if there's a file, will it be counted every time a user interacts or only the first time?

  • visarga 2 years ago

    I think

    - it is correlated to the sum, every new interaction adds the whole history again

    - yes, but you probably pay for the retrieved fragments, not the whole file

    • brandall10 2 years ago

      On the second point, there was an issue on launch where it would not find a relevant fragment and appear to load the whole file into the context. Unsure if this has changed but it freaked quite a few folks out OpenAI discussion forums w/ escalating costs.

simonw 2 years ago

Throwing a feature request in here just in case someone from OpenAI sees it.

I'd really like it if the streaming versions of their APIs could return a token usage count at the end.

The non-streaming APIs do this right now:

    curl https://api.openai.com/v1/chat/completions \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer $OPENAI_API_KEY" -d '{
        "model": "gpt-3.5-turbo",
        "messages": [
          {
            "role": "user",
            "content": "A short fun fact about pigeons"
          }
        ]
      }'
Returns:

    {
      "id": "chatcmpl-92UiIWQaf442wq7Eyp7kF8ge0e3fE",
      "object": "chat.completion",
      "created": 1710381746,
      "model": "gpt-3.5-turbo-0125",
      "choices": [
        {
          "index": 0,
          "message": {
            "role": "assistant",
            "content": "Pigeons are one of the few bird species that can drink water by sucking it up through their beaks, rather than tilting their heads back to swallow."
          },
          "logprobs": null,
          "finish_reason": "stop"
        }
      ],
      "usage": {
        "prompt_tokens": 14,
        "completion_tokens": 33,
        "total_tokens": 47
      },
      "system_fingerprint": "fp_4f0b692a78"
    }
Note the "usage" block there telling me how many tokens were used (which tells me how much this cost).

But if I add "stream": true I get back an SSE stream that looks like this:

    ...
    data: {"id":"chatcmpl-92Uk81oNjrcUJQnPX8fSNqFINLfSI","object":"chat.completion.chunk","created":1710381860,"model":"gpt-3.5-turbo-0125","system_fingerprint":"fp_4f0b692a78","choices":[{"index":0,"delta":{"content":"."},"logprobs":null,"finish_reason":null}]}
    
    data: {"id":"chatcmpl-92Uk81oNjrcUJQnPX8fSNqFINLfSI","object":"chat.completion.chunk","created":1710381860,"model":"gpt-3.5-turbo-0125","system_fingerprint":"fp_4f0b692a78","choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}]}
    
    data: [DONE]
There's no "usage" block, which means I have to try and account for the tokens myself. This is really inconvenient!

I noticed the other day that the Claude streaming API returns a "usage" block with the last message. I'd love it if OpenAI's API did the same thing.

I need this right now because I'm starting to build features for end users of my own software, and I want to be able to give them X,000 tokens "free" before starting to charge them for extras. Counting those tokens myself (probably using tiktoken) is code I'd rather not have to write - especially since features like tools/functions or images make counting tokens a lot less obvious.

  • gtoubassi 2 years ago

    We do the token counting on our end literally just running tiktoken on the content chunks (although I think usually its one token per chunk). Its a bit annoying and I too expected they'd have the usage block but its one line of code if you already have tiktoken available. I've found the accounting on my side lines up well with what we see on our usage dashboard.

    • tristanz 2 years ago

      As an FYI, this is fine for rough usage, but it's not accurate. The OpenAI APIs inject various tokens you are unaware of into the input for things like function calling.

  • harrisonjackson 2 years ago

    This and/or being able to fetch the responses with their token usage by id. What is that ID for without a way to retrieve the completions with it?

zerop 2 years ago

they should do streaming for voice inputs on the chatgpt app. right now it's very slow. Voice interfaces need to be streaming

XCSme 2 years ago

Any way to have a consistent system prompt across queries without sending it (and using tokens) for each completion?

  • arthurcolle 2 years ago

    The assistant has its own "instructions" (replacement for system prompt)

    and then on each run, you have the option to add more guidance to the run explicitly, without modifying the assistant instructions (system prompt)

    It's a little bit different but kind of the same

  • eightysixfour 2 years ago

    The Assistant API handles that, it has the system prompt as part of the assistant that you interact with.

    • XCSme 2 years ago

      And can you share the assistant with other users?

      Also, the system prompt in assistants doesn't consume tokens?

johnfurneaux 2 years ago

Adore. Congrats team. For us the API is epic. We'd just ask for focus on performance.

milar 2 years ago

Has tool use accuracy improved?

arthurcolle 2 years ago

Sigh another week lost to the void

  • nextworddev 2 years ago

    Elaborate?

    • castles 2 years ago

      "YET ANOTHER shiny new toy to distract me. Can't help myself even though I think it's mostly a waste of time"

      Am I just projecting? Relatable, in any case :)

      • __m 2 years ago

        I immediately implemented streaming into my rocketchat gpt bot, was definitely a distraction but my colleagues liked it. No more waiting until the complete response is sent.

      • arthurcolle 2 years ago

        Yep, you captured the moment ^_^

potsandpans 2 years ago

Openai banned my account for suspicious payment activities, and I never was able to talk to a real person. Just several layers of chat bots posing as people.

I literally want to give them my money and can't. Every few weeks for shirts and giggles i send an email to them saying, "any update on this?"

  • ukuina 2 years ago

    I suspected as much when one of their support "personnel" used the phrase "I apologize for the earlier confusion..." (there was no confusion, I was simply contradicting what they were saying)

  • dbish 2 years ago

    One of the reasons I tend to use any of their options through Azure where available. Azure support has a more straight forward (though still sometimes slow) process for account issues.

  • GaggiX 2 years ago

    I guess it's time for Claude 3 (I imagine you were using it for the LLMs).

  • slimsag 2 years ago

    Welcome to the future. You might be able to get an enterprise sales contract with human support.

m-p-3 2 years ago

I thought this was about making the OpenAI app available as a digital assistant on Android, as a replacement to Google.

Oh well..

megous 2 years ago

This website is now like 30% about this probability based autocomplete nonsense. Feels like all those bitcoin hypes and "running everything on blockchain" fad of few years ago. Now it's running everything through "large autocomplete" model.

I really hope this will fade and focus will turn back to highlighting some broader actual human ingenuity in IT, rather than constant stream of "we used autocomplete for this new thing" or "we build this new API for this glorified autocomplete".

Boring.

  • chaxor 2 years ago

    "old man yells at cloud"

    Seriously though, it's not going away no matter how much anyone hates it. Emails and blogs will continue to be written with it, letters of recommendation will be/are written with it, Presidential speeches will be written with it, academic articles will be / are written with it (almost all ml and cs research is), news is written with it... It's not going to stop, but it will _probably_/_very likely_ get better.

    There is no tool, no human, no method to determine if text is generated with one of these models at high F-score (only sometimes high precision, low recall domains for silly examples).

    We're stuck with it. Like the English teacher and their despised spell check.

    • romanhn 2 years ago

      It occurs to me that over time, reading comprehension will become significantly more important than the ability to write. Anyone will be able to write something smart-sounding with AI's help, but it'll take real skill to make sure the output is correct and appropriate.

  • XCSme 2 years ago

    I just added this "autocomplete" in my app, and customers emailed to say they actually love it: https://docs.uxwizz.com/guides/ask-ai-new

    • megous 2 years ago

      Yes, customers will love anything that helps them. You can get customers to love you by adding any kind of automation for stuff they had to do by hand up to that point. Does this mean there should be 10 articles per day shared about "I added XLSX import to my app, so my customers don't have to do data entry via dialogs"?

      My point is about repetitiveness of LLM topics. Not about usefullness of LLM itself. And LLMs are glorified autocomplete. Their internals are maybe interesting, but that's often not what's being discussed here or even written about in the shared articles.

    • kfajdsl 2 years ago

      I've gotten so used to having an LLM integrated into my editor that when I work on the occasional spreadsheet (or really anything with syntax that I only use occasionally and no integrated AI) it's pretty jarring to have to go to another tab to look up what function to use for a formula (even if that other tab is ChatGPT).

  • ametrau 2 years ago

    Nah it's got legs as a google replacement / competitor if they keep costs lower and take a smaller rent. WHEN they start advertising they'll explode. Which is why google is trying to snuff them out in the cradle (sorry about the visual).

  • xcv123 2 years ago

    If deep learning algorithms are "autocomplete" then so is the human mind when it strings words together. No, that's not how it works.

    • dns_snek 2 years ago

      [citation needed]

      Just because that makes for a nice narrative in the copyright infringement argument, doesn't make it so.

      We know next to nothing about how the human brain works.

      • xcv123 2 years ago

        Citation: Decades of research in artificial neural networks

        Here's a paper from 1990 by the Godfather himself https://www.cs.toronto.edu/~hinton/absps/AIJmapping.pdf

        "This 1990 paper demonstrated how neural networks could learn to represent and reason about part-whole hierarchical relationships, using family trees as the example domain.

        By training on examples of family relations like parent-child and grandparent-grandchild, the neural network was able to capture the underlying logical patterns and reason about new family tree instances not seen during training.

        This seminal work highlighted that neural networks can go beyond just memorizing training examples, and instead learn abstract representations that enable reasoning and generalization"

        > We know next to nothing about how the human brain works

        We understand how parts of it work.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection