Why is Chat GPT so expensive to operate?
Altman has said "it's a few cents per chat", which probably means it closer to high single digit cents per chat. Does that estimate include amortization of upfront development costs, or is it actually the marginal cost of a chat? All these answers are good, but I can share more concrete numbers… Meta released their OPT model which they claim is comparable to the GPT-3 model. Guidance for running that model [1] suggests a LOT of memory - at least 350GB of gpu memory which is roughly 4 A1000s, which are pricy. Running this on AWS with the above suggestion would cost $25/hr - just for one model running. That’s almost $0.50 a minute. If you imagine it takes a few seconds to run the model for one request… easily you’ll hit $0.05 per request once you factor in the rest of the infra (storage, CDN, etc) and the engineering cost, and the research cost, and the fact that they probably have a scale to hundreds of instances for heavy traffic and that may mean less efficient purchased servers. OpenAI has a sweetheart deal with Azure, but this is roughly the cost structure for serving requests. And this doesn’t include the upfront cost of training. Really makes you appreciate the brain, which presumably operates with some sort of similar demand. Hard to tell. Similar to how it takes a lot of resources for a human to hang from monkey bars but for a sloth it takes basically no resources at all, because the sloth comes out of the box designed for it. Human babies come out of the box designed for hanging from monkey bars as well. Another mind-boggling thing about brain is how little power it uses to do all the complex things it does. calories are a unit of energy, so it’s a straight forward comparison if we assume that a computer can be powered by 100 watts, over a day it will use 2.4 kW h, which is about 2000 Calories GPU will consume a lot more, but we aren’t that far off in efficiency Doesn't that assume 100% of a human's daily calories burn is due to brain activity? The brain uses about 20% of a human's calories. It's not 100%, but it's a substantial fraction. The other components of the human body are also required for brain function. The brain doesn't use a synchronous digital architecture. It is asynchronous. Spiking neural networks implemented in neuromorohic hardware are equally efficient. They consume milliwatts for a million neurons. Do you have links on novel hardware architectures for neuromorphic hardware? In my country , the leading research group for neuromorphic computing does not cite any novel hardware approaches, only what existing hw architectures are most suitable. How do you know that the universe isn't just rendering everything. To have ML produce meaningful content you need tp give it some input or a sense of what the outcome should be and this is after billions of trial and errors. Yet people these days believe something like the brain was bruteforced by nature into an accidental existence. Some input: The organism's environment. Outcome should be: The organism successfully produces offspring Natural selection is doing exactly what you describe. Except natural selection can't start over. It onlu works if there are always a high rate of survivors and even if that was not an issue consider 4 billion years and a generous generation life of one year (natural selection cycle), 4 billion isn't a whole lot even for small features when you don't have an enormous population and birth rate. Let's say there were 100000 humans at some point and only a 1000 fatal features (being generous) it's not just the replacement rate of defective humans that needs to exceed the elimination rate, a certain percent of replacements must be free of all fatal defects and survive. Also, consider how there should be many failed species that attempted to evolve into a human like species or a primate. You can't always luck out, at some point the entire branch has to fail, requiring subsequent attmepts meanwhile the fatal conditions that required the evolution will not go away. > It onlu works if there are always a high rate of survivors There doesn't have to be a high rate of survival if the reproductive rate compensates for losses. E.g., if 80% of wild rabbits are eaten, but the remaining 20% can give birth to 5 bunnies per parent per lifetime, the population will be stable. I have no idea where you're getting your beliefs, but most of it is wrong in both the math and biology. What I am saying is that rate needs to continue to be positive and out of 20% survivors many will not carry the survival gene. And on top of that, it isn't just one thing that kills a rabbit in your example, the climate, not finding mates, predators, disease and more all must be overcome at once. Survivors must overcome a wide array of adversity and succesfully pass on that combination of abilities and this needs to happen every generation. Look at it in bits and bytes. For each adversity overcoming feature that a species has inherited, let that a be a bit set to 1. With 2 adversaries you have only two bits where only need one out of 4 individuals that has both bits on. For a realistic adversity of 32, you need 4billion bits all set to one. And this is without considering how a survival trait against one adversity can be a fatal trait against another. Now these bits need to be passed on, if one of them is missing then the only chance that individual has to survive is by pure chance they avoid that adversary. Think of the endless adversities we face and overcome, you are saying for millions of generations, there has been an unbroken chain of survivors that kept overcoming a geometrically expanding adversity. Just a degree increasing in the global temperature causes entire ecosystems to collapse. Survival is the exception, not the default. Yet creatures survive and reproduce, the fact that you can observe, analyze and even control. I think you’re ignoring a bunch of dynamics by trying to model it with binary. Prey population going down means a predator population also going down and a competitor population going up. It’s not an endgame, it’s just an “ear” of a very complex attractor, which with time only sharpens it ability to have as little escape points as possible. 1°C fluctuation by itself does nothing, because life usually has much wider tolerance due to long seasonal fluctuations. A global +-degree means there will be a tipping point somewhere which would bring a local drastic change. Locally life may suffer, but it counteracts with migration and preexisting diversity. It simply suffers everywhere, always. It’s a modus operandi. A little bit more is barely fatal. So yes, survival is the default because life naturally specializes in it. it is natural selection. this is most famous mechanism of evolution. It's interesting that the requirements for a text model are so much greater than for images. Stable diffusion can run on a home pc, while it seems you need a super computer for GPT3. I'm not sure that would have been my intuition. I think it has to do with text being much more precise. Your stably diffused cartoon avatar having 6 finger is not nearly as noticeable as a language model's chat mispelling every second word. So you need less resources to get to a human acceptable result no, diffusion models are just more efficient Don't forget training costs, labor costs used for RLHF and (most likely) the money required for such large volumes of training data. Doesn't ChatGPT fine-tune one of the smaller GPT-3s, not the 175B parameter model? For things like BERT where you just want to extract an embedding, the naive way you reach full utilization at inference time is that you : - run tokenization of inputs on CPU - sort inputs by length - batch inputs of similar length and apply padding to make of uniform length - pass the batches through so a single model can process many inputs in parallel. For GPT-style decoder models however, this becomes much more challenging because inference requires a forward pass for every token generated. (Stopping criteria also may differ but that’s another tangent). Every generated token performs attention on every previous token, both the context (or “prompt”) and the previously generated tokens (important for self consistency). this is a quadratic operation in the vanilla case. Model sizes are large , often spanning multiple machines, and the information for later layers depends on previous ones, meaning inference has to be pipelined. The naive approach would be to have a single transaction processed exclusively by a single instance of the model. this is expensive! even if each model can be crammed into a single A100 , if you want to run something like Codex or ChatGPT for millions of users with low latency inference, you’d have to have thousands of GPUs preloaded with models, and each transaction would take a highly variable amount of time. If a model spans multiple machines, you’d achieve a max of 1/n% utilization because each shard has to remain loaded while the others process, and then if you want to do pipeline parallelism like in pipe dream, you’d have to deal with attention caches since you don’t want to have to recompute every previous state each time > Does that estimate include amortization of upfront development costs? The answer is almost certainly "no." A service like Chat GPT is expensive because it requires heavy-duty GPU computations. The gpu required to run it (A100) is said to cost about $150k. If each query is said to cost about 3 cents, then that means the card could execute the model about 5 million times before it makes profit. Maybe a bit more if we include the electricity bill, and even more if Microsoft charges extra for the service since they want to make profit. I don't think these numbers sounds very out of line. It would be easier to understand the feasibility of this if we knew how fast those cards could execute the model. If it takes a second to run it then a few cents seems about right, if it takes a few milliseconds then it is a lot less than a few cents unless Microsoft charges huge premium for the servers. an 80gb A100 is not $150k, more like $10-15k. But model need about 350gb so I'm not sure one A100 with 80gb will be enough? Because its a language model and really does not query information but assumes relationships with them. Meaning that "the words" have to be encoded and brought into relation by text. Now there are different ways to achieve this, but in essence because it has to know everything all at once plus instructions on how to handle that. You can actually ask it to explain to you how you could create a natural language processing algorithm yourself and it will even give you a starter framework in the language of your choice. But a fair warning, for me it was like a 6 hour deep rabbit hole :D The model is large and every instance likely (not sure about the absolute degree they optimized the model) requires several GPUs (or high-grade accelerators) to run at a moderate speed. Read the papers. Apparently each query requires hundreds of GBs of GPU RAM on several expensive accelerator cards. Is the H100 deployed at Azure? I wonder how much more efficient that would be over A100s. Which seems insane considering stable diffusion can run on a
M1 MacBook. Sure but they are totally different algorithms doing different things. We’ll yeah Basically gpu/compute costs being so expensive.
Probably just the chat cost itself. also a whole boat load of Development costs will eventually be passed on to consumers, for a cheaper alternative try https://text-generator.io
It also analyses images which OpenAI doesn't do My question is how long will it be before the average high end computer can run it? How long before your average smart phone? Memory shipped with computers have been stagnate for a decade Maybe that will be the next use case to make larger amounts of memory mainstream. At the same time, somehow Tesla still manages to cram more and more neural nets into that small memory. So it could also be that many neural nets are just not really efficient yet. People are already trying to put cutting edge models onto consumer hardware: https://news.ycombinator.com/item?id=32678664 We live in a really exciting age :). Local AI models will also finally give Microsoft reasons again to require hardware for coming Windows versions. Now they have to require obscure security chips and stuff but in the future they might have some local cortana thingy or something that requires a certain amount of computational power. Could models like chatGPT run on hardware like the Tesla Dojo ? If so maybe Elon should donate some... Does dojo even exist? He kept talking about how it was almost ready a couple years ago, no word since which is strange from such a braggart They unveiled it on AI day 2021, talked more about it on AI day 2022, and in theory should start operating Q1 2023. It would be a good idea, no ? I think it's actually quite cheap for what it is what useful purpose have you found for ChatGPT given the “it can return inaccurate results posed as accurate” problem? It does coding things extremely well. Are there some errors here or there at times? Yes but in general it does it excellently. I think this is a good example of not letting the perfect be the enemy of the good. It will write 200 lines of code for me which would maybe take me a few hours. I have to spend 15 minutes cleaning it up, but still it saved me 80% of the time. It's a massive win. Also great for writing articles, or emails. I write what I want to say into ChatGPT and tell it to state rewrite it to be pleasant and less harsh and it does a great job of that. Have it do stuff you know how to do, just a lot faster. Or, even if you don't know exactly how to do it, check what it gave you to see if it produces expected results. For example, it gives you code. You run that code to see if the outputs are as expected. Yes it works well for that! I used ChatGPT recently to write a quick code snippet that turned out better than what I found on SO or could have written myself 50X slower. https://www.robnugen.com/journal/2023/01/14/chatgpt-helped-m... python If you really measure what is being run, it is more likely well-optimized CUDA GPU assembly kernels - or, at this point - might be already some exotic TPU-like accelerator assembly. This hubris over the top-level language in the system is so passe, so 2000s. at this scale, the ml models are usually compiled into a format that runs independent of python. so the answer isn't "python"