Settings

Theme

Why is Chat GPT so expensive to operate?

82 points by beavis000 3 years ago · 53 comments · 1 min read


Altman has said "it's a few cents per chat", which probably means it closer to high single digit cents per chat. Does that estimate include amortization of upfront development costs, or is it actually the marginal cost of a chat?

vineyardmike 3 years ago

All these answers are good, but I can share more concrete numbers…

Meta released their OPT model which they claim is comparable to the GPT-3 model. Guidance for running that model [1] suggests a LOT of memory - at least 350GB of gpu memory which is roughly 4 A1000s, which are pricy.

Running this on AWS with the above suggestion would cost $25/hr - just for one model running. That’s almost $0.50 a minute. If you imagine it takes a few seconds to run the model for one request… easily you’ll hit $0.05 per request once you factor in the rest of the infra (storage, CDN, etc) and the engineering cost, and the research cost, and the fact that they probably have a scale to hundreds of instances for heavy traffic and that may mean less efficient purchased servers.

OpenAI has a sweetheart deal with Azure, but this is roughly the cost structure for serving requests. And this doesn’t include the upfront cost of training.

https://alpa.ai/tutorials/opt_serving.html

  • mr_00ff00 3 years ago

    Really makes you appreciate the brain, which presumably operates with some sort of similar demand.

    • unsupp0rted 3 years ago

      Hard to tell. Similar to how it takes a lot of resources for a human to hang from monkey bars but for a sloth it takes basically no resources at all, because the sloth comes out of the box designed for it.

    • smnrchrds 3 years ago

      Another mind-boggling thing about brain is how little power it uses to do all the complex things it does.

      • wordpad25 3 years ago

        calories are a unit of energy, so it’s a straight forward comparison

        if we assume that a computer can be powered by 100 watts, over a day it will use 2.4 kW h, which is about 2000 Calories

        GPU will consume a lot more, but we aren’t that far off in efficiency

        • pattrn 3 years ago

          Doesn't that assume 100% of a human's daily calories burn is due to brain activity?

          • unsupp0rted 3 years ago

            The brain uses about 20% of a human's calories. It's not 100%, but it's a substantial fraction.

          • sannee 3 years ago

            The other components of the human body are also required for brain function.

    • imtringued 3 years ago

      The brain doesn't use a synchronous digital architecture. It is asynchronous. Spiking neural networks implemented in neuromorohic hardware are equally efficient. They consume milliwatts for a million neurons.

      • awesomeMilou 3 years ago

        Do you have links on novel hardware architectures for neuromorphic hardware? In my country , the leading research group for neuromorphic computing does not cite any novel hardware approaches, only what existing hw architectures are most suitable.

    • iosystem 3 years ago

      How do you know that the universe isn't just rendering everything.

    • badrabbit 3 years ago

      To have ML produce meaningful content you need tp give it some input or a sense of what the outcome should be and this is after billions of trial and errors.

      Yet people these days believe something like the brain was bruteforced by nature into an accidental existence.

      • ericathegreat 3 years ago

        Some input: The organism's environment.

        Outcome should be: The organism successfully produces offspring

        Natural selection is doing exactly what you describe.

        • badrabbit 3 years ago

          Except natural selection can't start over. It onlu works if there are always a high rate of survivors and even if that was not an issue consider 4 billion years and a generous generation life of one year (natural selection cycle), 4 billion isn't a whole lot even for small features when you don't have an enormous population and birth rate. Let's say there were 100000 humans at some point and only a 1000 fatal features (being generous) it's not just the replacement rate of defective humans that needs to exceed the elimination rate, a certain percent of replacements must be free of all fatal defects and survive. Also, consider how there should be many failed species that attempted to evolve into a human like species or a primate. You can't always luck out, at some point the entire branch has to fail, requiring subsequent attmepts meanwhile the fatal conditions that required the evolution will not go away.

          • KingMob 3 years ago

            > It onlu works if there are always a high rate of survivors

            There doesn't have to be a high rate of survival if the reproductive rate compensates for losses.

            E.g., if 80% of wild rabbits are eaten, but the remaining 20% can give birth to 5 bunnies per parent per lifetime, the population will be stable.

            I have no idea where you're getting your beliefs, but most of it is wrong in both the math and biology.

            • badrabbit 3 years ago

              What I am saying is that rate needs to continue to be positive and out of 20% survivors many will not carry the survival gene. And on top of that, it isn't just one thing that kills a rabbit in your example, the climate, not finding mates, predators, disease and more all must be overcome at once. Survivors must overcome a wide array of adversity and succesfully pass on that combination of abilities and this needs to happen every generation.

              Look at it in bits and bytes. For each adversity overcoming feature that a species has inherited, let that a be a bit set to 1. With 2 adversaries you have only two bits where only need one out of 4 individuals that has both bits on. For a realistic adversity of 32, you need 4billion bits all set to one. And this is without considering how a survival trait against one adversity can be a fatal trait against another. Now these bits need to be passed on, if one of them is missing then the only chance that individual has to survive is by pure chance they avoid that adversary.

              Think of the endless adversities we face and overcome, you are saying for millions of generations, there has been an unbroken chain of survivors that kept overcoming a geometrically expanding adversity. Just a degree increasing in the global temperature causes entire ecosystems to collapse.

              Survival is the exception, not the default.

              • wruza 3 years ago

                Yet creatures survive and reproduce, the fact that you can observe, analyze and even control.

                I think you’re ignoring a bunch of dynamics by trying to model it with binary.

                Prey population going down means a predator population also going down and a competitor population going up. It’s not an endgame, it’s just an “ear” of a very complex attractor, which with time only sharpens it ability to have as little escape points as possible.

                1°C fluctuation by itself does nothing, because life usually has much wider tolerance due to long seasonal fluctuations. A global +-degree means there will be a tipping point somewhere which would bring a local drastic change. Locally life may suffer, but it counteracts with migration and preexisting diversity. It simply suffers everywhere, always. It’s a modus operandi. A little bit more is barely fatal.

                So yes, survival is the default because life naturally specializes in it.

      • melagonster 3 years ago

        it is natural selection. this is most famous mechanism of evolution.

  • ramblerman 3 years ago

    It's interesting that the requirements for a text model are so much greater than for images.

    Stable diffusion can run on a home pc, while it seems you need a super computer for GPT3. I'm not sure that would have been my intuition.

    • sadpasture 3 years ago

      I think it has to do with text being much more precise. Your stably diffused cartoon avatar having 6 finger is not nearly as noticeable as a language model's chat mispelling every second word. So you need less resources to get to a human acceptable result

  • mike_hearn 3 years ago

    Don't forget training costs, labor costs used for RLHF and (most likely) the money required for such large volumes of training data.

  • hansvm 3 years ago

    Doesn't ChatGPT fine-tune one of the smaller GPT-3s, not the 175B parameter model?

sdrg822 3 years ago

For things like BERT where you just want to extract an embedding, the naive way you reach full utilization at inference time is that you :

- run tokenization of inputs on CPU

- sort inputs by length

- batch inputs of similar length and apply padding to make of uniform length

- pass the batches through so a single model can process many inputs in parallel.

For GPT-style decoder models however, this becomes much more challenging because inference requires a forward pass for every token generated. (Stopping criteria also may differ but that’s another tangent).

Every generated token performs attention on every previous token, both the context (or “prompt”) and the previously generated tokens (important for self consistency). this is a quadratic operation in the vanilla case.

Model sizes are large , often spanning multiple machines, and the information for later layers depends on previous ones, meaning inference has to be pipelined.

The naive approach would be to have a single transaction processed exclusively by a single instance of the model. this is expensive! even if each model can be crammed into a single A100 , if you want to run something like Codex or ChatGPT for millions of users with low latency inference, you’d have to have thousands of GPUs preloaded with models, and each transaction would take a highly variable amount of time.

If a model spans multiple machines, you’d achieve a max of 1/n% utilization because each shard has to remain loaded while the others process, and then if you want to do pipeline parallelism like in pipe dream, you’d have to deal with attention caches since you don’t want to have to recompute every previous state each time

JoeyBananas 3 years ago

> Does that estimate include amortization of upfront development costs?

The answer is almost certainly "no." A service like Chat GPT is expensive because it requires heavy-duty GPU computations.

Jensson 3 years ago

The gpu required to run it (A100) is said to cost about $150k. If each query is said to cost about 3 cents, then that means the card could execute the model about 5 million times before it makes profit. Maybe a bit more if we include the electricity bill, and even more if Microsoft charges extra for the service since they want to make profit.

I don't think these numbers sounds very out of line. It would be easier to understand the feasibility of this if we knew how fast those cards could execute the model. If it takes a second to run it then a few cents seems about right, if it takes a few milliseconds then it is a lot less than a few cents unless Microsoft charges huge premium for the servers.

toomuchtodo 3 years ago

https://twitter.com/tomgoldsteincs/status/160019698195510069...

https://threadreaderapp.com/thread/1600196981955100694.html

jhoelzel 3 years ago

Because its a language model and really does not query information but assumes relationships with them. Meaning that "the words" have to be encoded and brought into relation by text.

Now there are different ways to achieve this, but in essence because it has to know everything all at once plus instructions on how to handle that.

You can actually ask it to explain to you how you could create a natural language processing algorithm yourself and it will even give you a starter framework in the language of your choice. But a fair warning, for me it was like a 6 hour deep rabbit hole :D

sinenomine 3 years ago

The model is large and every instance likely (not sure about the absolute degree they optimized the model) requires several GPUs (or high-grade accelerators) to run at a moderate speed.

Read the papers.

ilaksh 3 years ago

Apparently each query requires hundreds of GBs of GPU RAM on several expensive accelerator cards.

Is the H100 deployed at Azure? I wonder how much more efficient that would be over A100s.

lee101 3 years ago

Basically gpu/compute costs being so expensive. Probably just the chat cost itself. also a whole boat load of Development costs will eventually be passed on to consumers, for a cheaper alternative try https://text-generator.io It also analyses images which OpenAI doesn't do

scarface74 3 years ago

My question is how long will it be before the average high end computer can run it? How long before your average smart phone?

Memory shipped with computers have been stagnate for a decade

  • faebi 3 years ago

    Maybe that will be the next use case to make larger amounts of memory mainstream. At the same time, somehow Tesla still manages to cram more and more neural nets into that small memory. So it could also be that many neural nets are just not really efficient yet.

  • est31 3 years ago

    People are already trying to put cutting edge models onto consumer hardware: https://news.ycombinator.com/item?id=32678664

    We live in a really exciting age :). Local AI models will also finally give Microsoft reasons again to require hardware for coming Windows versions. Now they have to require obscure security chips and stuff but in the future they might have some local cortana thingy or something that requires a certain amount of computational power.

thealch3m1st 3 years ago

Could models like chatGPT run on hardware like the Tesla Dojo ? If so maybe Elon should donate some...

  • sidibe 3 years ago

    Does dojo even exist? He kept talking about how it was almost ready a couple years ago, no word since which is strange from such a braggart

    • cypress66 3 years ago

      They unveiled it on AI day 2021, talked more about it on AI day 2022, and in theory should start operating Q1 2023.

  • thealch3m1st 3 years ago

    It would be a good idea, no ?

DoesntMatter22 3 years ago

I think it's actually quite cheap for what it is

  • MuffinFlavored 3 years ago

    what useful purpose have you found for ChatGPT given the “it can return inaccurate results posed as accurate” problem?

    • DoesntMatter22 3 years ago

      It does coding things extremely well. Are there some errors here or there at times? Yes but in general it does it excellently. I think this is a good example of not letting the perfect be the enemy of the good.

      It will write 200 lines of code for me which would maybe take me a few hours. I have to spend 15 minutes cleaning it up, but still it saved me 80% of the time. It's a massive win.

      Also great for writing articles, or emails. I write what I want to say into ChatGPT and tell it to state rewrite it to be pleasant and less harsh and it does a great job of that.

    • elbear 3 years ago

      Have it do stuff you know how to do, just a lot faster. Or, even if you don't know exactly how to do it, check what it gave you to see if it produces expected results.

      For example, it gives you code. You run that code to see if the outputs are as expected.

z3r0k00l 3 years ago

python

  • sinenomine 3 years ago

    If you really measure what is being run, it is more likely well-optimized CUDA GPU assembly kernels - or, at this point - might be already some exotic TPU-like accelerator assembly.

    This hubris over the top-level language in the system is so passe, so 2000s.

  • mulligan 3 years ago

    at this scale, the ml models are usually compiled into a format that runs independent of python. so the answer isn't "python"

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection