Settings

Theme

Fork of Facebook’s LLaMa model to run on CPU

github.com

246 points by __anon-2023__ 3 years ago · 177 comments

Reader

dsign 3 years ago

The thing I like the most about the current AI wave is the pressure is putting on computing hardware. Yes, mobile phones with long battery lives are cool and all of that, but most cool things I like are locked behind huge computational requirements.

  • TaylorAlexander 3 years ago

    Agree. I work in robotics and we never have enough compute. I want to see us get to the point where the most advanced robot ever has all the compute it needs onboard, and that means huge growth in compute density and efficiency are needed.

    • ben_w 3 years ago

      That's genuinely surprising.

      What sort of on-board compute do you typically have today?

      • RugnirViking 3 years ago

        a common example from my robotics experience (mainly mobile robots) has been getting something powerful enough to run our image recognition/interpreting sensor data. We often have something like several microprocessors (think:arduino equivalent running c++ or c) which run all the motor control etc and a high level system (used to often be raspberry pi, now more often nvidia jetson nano) listening to all of those and using most of it's computing power on some kind of sensor data, usually image recognition or processing TOF camera/lidar/radar data etc. We often have to optimise hard to get a couple of cycles or "frames" per second with these, which really puts limitations on how robots respond (250ms delay is veeeery noticable, especially if it's in obstacle avoidance - relatively common)

        • zamnos 3 years ago

          Limiting ourselves to onboard compute available on mobile robots is one thing, but even for fixed installation robots, aka an arm in a factory where space and power aren't limited, we're very much still limited by compute capacity. Trying to use robots to do something as simple as folding clothes still cannot be done at a reasonable speed. Yeah, on a personal level, just buck up and spend the 20 minutes folding your clothes, or hire a maid to do it for you, but the complexity of automating the task of folding clothes by a robot is a stand in for other tasks in industry that we still can't automate because the complexity is still too high for our current computing power, and have to hire a human for.

          Researchers at US Berkeley came out with the algorithm they named SpeedFolding in October of last year. Watch https://youtu.be/UTMT2WAUlRw?t=511 and then realize that linked excerpt is sped up 9x.

          If we had 9x faster compute we could have laundry folding robots which is one thing, but that amount of compute would enable robots to do tons more tasks in industry.

          • mlsu 3 years ago

            Robotics is a double whammy, you have compute problems but you also have actuation.

            Getting robots to move quickly is easy; getting them to move quickly to exactly where you want them, or with exactly as much force... that is much, much more difficult. Double for mobile robots where you don't have a good energy source. If cost is an issue that is another dimension -- powerful and accurate actuators are extremely expensive.

      • bick_nyers 3 years ago

        I don't work in the field but just to kind of put it into perspective, a 12v 100A LiFePO4 battery has 1200 Watts capacity and weighs 30 pounds. A typical gaming PC (which to be fair, is more willing to trade power for performance) consumes about 600 Watts per hour. Problem for a Tesla? Not so much. Problem for a lightweight drone? Definitely.

        • idiotsecant 3 years ago

          ahhh the units in this post are making my eye twitch.

          • imtringued 3 years ago

            I would be mad too, if my gaming PC was demanding 300 Watts of power but it took half an hour to ramp up ;)

          • bick_nyers 3 years ago

            I know Watts per hour is not the right way to phrase it but I feel it helps for those that don't know. Also, I just don't like saying Amp. Hours :)

            • idiotsecant 3 years ago

              Watts per hour implies watts/hour. Watt-hour implies a number of watts multiplied by a length of time. Also known as energy. Watts are power. Watt hours are energy. Two different things. Watts/hour is nothing.

              • slaw 3 years ago

                Watt/hour is speed of power. It doesn't make any sense.

                • Tostino 3 years ago

                  No, you misunderstand what it means in that case. Watt-hours are comparable to joules. 1 watt-hour = 3600 joules.

                  • slaw 3 years ago

                    watt*hour=joule

                    watt-hour: you cannot subtract time from power, it doesn't make any sense.

                    watt/hour: delta of power per time, something very weird.

      • TaylorAlexander 3 years ago

        The NVIDIA Jetson boards are popular, but even with a full desktop processor and state of the art GPU, you can easily down them in data from a LIDAR sensor or a few cameras. Especially since robots may also need fast response times.

        There is another reply to your comment that shares a lot of what I have experienced. You have so many pieces of code that need to run and a good handful of them are working on something like LIDAR point clouds with a million 3D points in them, plus some cameras running several different image recognition and segmentation algorithms, and you want to have fast cycle times, it just all adds up. Every serious robot I have ever worked on is maxing out its system, even ones at Google X with a full desktop CPU, a high end NVIDIA graphics card, and a couple secondary ARM CPUs.

  • aqme28 3 years ago

    Crazy to me that as soon as one GPU wave is dying (crypto), another one is picking up slack.

    • atleastoptimal 3 years ago

      Which is a good thing. So glad all that GPU compute is being used on cool stuff rather than running SHA-256 18 quintillion times

      • kkielhofner 3 years ago

        Definitely a good thing but FYI it hasn't been profitable/feasible to mine bitcoin (SHA-256) on GPU for many many years as ASIC based miners have completely taken over. I've talked about it plenty on HN but any way you slice it crypto is an unbelievable waste of resources in every possible way regardless.

        What really (finally) more or less killed GPU mining was the Ethereum move to PoS (Proof of Stake).

      • DJBunnies 3 years ago

        They can’t both be cool?

        • moffkalast 3 years ago

          Nothing cool in throwing away lots of resources for no reason. In fact there's substantial heating involved.

        • nightfly 3 years ago

          One is a contest to waste the most resources, one has potential to actually have useful results

        • ben_w 3 years ago

          Only if you really like big numbers for the sake of them. Otherwise, one is just straight up snake oil[0], and the other… is kinda hard to tell yet, because while I'm really impressed, I don't know if it's {a toy, a tool, the first sign of a major transformation}.

          [0] did you know the original snake oils contains more omega-3 and therefore improves cognitive function when compared to lard? I did not. But you can get omega-3 elsewhere, and the people who made the term synonymous with fraud didn't use those snakes, so…

      • varispeed 3 years ago

        > running SHA-256 18 quintillion times

        or games. People could have been studying or doing something more important than wasting time and energy. I get that it is entertainment, but so are board games and that don't require mining rare earth minerals or putting pressure on the grid as you can always play board games with candles on.

        • counttheforks 3 years ago

          Or TV shows or movies. People could have been studying or doing something more important than wasting time and energy.

          Or going outside. People could have been studying or doing something more important than wasting time and energy.

          Or not being locked in the education facility. People could have been studying or doing something more important than wasting time and energy.

        • kkielhofner 3 years ago

          I'm of two minds on this. I'm not a gamer so part of me thinks gaming is a complete waste of time and resources. Then again, the same could be said about almost any hobby/pastime.

          That said, gaming is what gave us GPUs (which have developed for gaming over the course of decades) so that we can now utilize them for more interesting and "productive" applications.

          So, for me, in the end I'm happy the PC gaming industry and user base has been pushing GPU capability.

        • KMnO4 3 years ago

          Be careful. The gaming industry has successfully conditioned people into believing they need a $1500 GPU with the TDP of a microwave so they can play the next unfinished-at-release AAA title.

        • mrtranscendence 3 years ago

          Monopoly by candlelight, just the future I had always envisioned.

          • BizarroLand 3 years ago

            Why are you wasting candlestick on playing games? It's such a waste! Don't you know bees died to make that candle?

            The most ecologically friendly thing you can do is go to sleep. If you want to play games, do it while the sun is out!

            (/s, just in case)

          • varispeed 3 years ago

            If more people played Monopoly, they would have realised the Western economy is at the stage were a few players bought all properties and utilities.

    • saurik 3 years ago

      One day we'll find out that all of the VR, crypto, and maybe now AI bubbles were nothing but conspiracies being driven by big-GPU to keep their share price up.

      • schaefer 3 years ago

        Speaking for myself, I have already gotten more use out of 2 weeks of chatgpt than I have out of 16 years of Bitcoin

      • taneq 3 years ago

        VR has been a godsend for forcing hardware, OS and driver developers to actually pay attention to jitter and max latency. If crypto means we get nice fast pretty games and fancy AI then I’m for it. :)

      • toxicFork 3 years ago

        The universe was a hoax invented by a GPU company

      • psychphysic 3 years ago

        DLSS and DLAA were at least, FSR proved that.

    • messe 3 years ago

      Charlie Stross (cstross on here) had a fun blog post[1] about this phenomenon just a week and a half ago.

      > As for what you should look to invest in?

      > I'm sure it's just a coincidence that training neural networks and mining cryptocurrencies are both applications that benefit from very large arrays of GPUs. [...]

      > If I was a VC I'd be hiring complexity theory nerds to figure out what areas of research are promising once you have Yottaflops of numerical processing power available, then I'd be placing bets on the GPU manufacturers going there

      [1]: https://www.antipope.org/charlie/blog-static/2023/02/place-y...

    • favaq 3 years ago

      Gee, it's almost as if GPUs are useful.

    • nirav72 3 years ago

      For most part of the 20th, a bulk of the energy humanity was able to extract was used for industrialization. Now it seems that a vast bulk of the energy being extracted will go towards computation.

      • xmcqdpt2 3 years ago

        I doubt it frankly. Computation consumes a lot of energy, true, but it is dwarfed by how much energy we use in transportation and food production. Energy use per capita in most of the Global North is about 75,000kWh per year,

        https://ourworldindata.org/grapher/per-capita-energy-use

        That's like the average person running 27 NVIDIA A100s at max capacity at all times!

        • vasco 3 years ago

          Yeah but every time we discover a new interesting thing to do with computers the requirements go up by several orders of magnitude. How many more orders of magnitude of energy can we spend on food production from here with current projections of world population to peak in 2100?

      • alchemist1e9 3 years ago

        The singularity and fusion power are probably interlinked because of this. Once one happens the other will and in either order.

        • ben_w 3 years ago

          Doubt it, unless you include solar as fusion, but then we've already got it.

    • college_physics 3 years ago

      GPU conspiracy or just the side-effect of the decline of Intel?

      • swyx 3 years ago

        theres an economics theory of “supply creates its own demand”. we wanted to do ml, gpus were around for games, we repurposed them for ml, and ml architectures that benefit from gpus won the “hardware lottery” (influential paper from sara hooker in case you are unaware)

    • layer8 3 years ago

      Just imagine if Bitcoin, GPT and Half-Life had come out at the same time.

  • pjmlp 3 years ago

    What we will get are specialized hardware, with not so open APIs anyway.

    With a bunch of people trailling behind with "it kind of works" open alternatives.

    • dsign 3 years ago

      It sounds like you are complaining about capitalism :-)

      It's not so bad. Nvidia could come and say, "hey, I'm going to lock down your GPU so that you can only use it to render polygons in my whitelisted list of video-games, and then you pay us $$$$$$ to buy our 'datacenter' thingy for anything else." But if they do it, people will go and buy the competitor's product.

      And yes, probably their 4090 are being bought by some rich kids with their parents' money, but I reckon most of it are sales to professionals, people who would justify their purchase decision with more than playing First-person-shooters. I for example play videogames with my gf, and we have equivalent GPUs. Hers is AMD and costs less than mine, even if it does the same, but I went for Nvidia so that PhysX were available and I could use Pytorch and Numba+GPU and even C++ CUDA. The moment Nvidia locks that down, I'll have to switch to AMD.

      • zirgs 3 years ago

        > hey, I'm going to lock down your GPU so that you can only use it to render polygons in my whitelisted list of video-games

        You just described gaming consoles.

      • pjmlp 3 years ago

        As you will find on my comment history, I am perfectly fine with commercial products and APIs.

        Good luck with AMD.

  • HeartStrings 3 years ago

    John Hopkins are working on organoids that will replace silicon GPUs for AI.

mrtksn 3 years ago

Unlike Stable Diffusion, I don't stumble upon people who actually use it. Are there examples of the output this can generate? What happens once you manage to run the model?

  • thot_experiment 3 years ago

    I've been playing around with LLMs recently and it's definitely interesting stuff. I've mostly focused on roleplay/MUD applications and it's not quuitteee there but it's pretty good, and it's idiosyncrasies are often hilarious.

    (when fed the leaked bing prompt, my AI decided it was Australian and started tossing in random shit like "but here in Australia, we'd call it limey green" when asked about chartreuse, i assume because the codename for bing chat is 'sydney')

  • sdrinf 3 years ago

    This is very new, give it a few days. Here's one from Shawn: https://twitter.com/theshawwn/status/1632595934839177216

  • knaik94 3 years ago

    I have been using similar models like LLM for helping draft fictional stories. The community fine tuned models are geared towards SFW and/or NSFW story competition.

    https://github.com/KoboldAI/KoboldAI-Client To read more about current popular models.

    https://koboldai.net/ is a way to run some of these models in the "cloud". There's no account required and the prompts are run on other people's hardware, with priority weighting based on how much compute you have used or donated. There's an anonymous api key and there's no expectation that the output can't be logged.

    The models that run on hardware locally are very basic in the quality of output. Here's an example of a 6B output used to try to emulate chatgpt. https://mobile.twitter.com/Knaikk/status/1629711223863345154 The model was finetuned on story completion so it's not meaningfully comparable.

    It's less popular because the hardware required for the great output is still above the top of line consumer specs. 24 gb vram is closer to a bare minimum to get meaningful output, and fine-tuning is still out of reach. There's some development with using services like runpod.

  • redox99 3 years ago

    On /g/ there's always a very active AI chatbot general that focuses on these models.

  • yieldcrv 3 years ago

    We just need some better GUIs

    Stable Diffusion was in the same place as this in the same time frame of the model getting released. Its only been a few days.

  • KyeRussell 3 years ago

    Pretty sure you wouldn’t see anyone using it commercially as IIRC it’s only public due to a leak.

  • RugnirViking 3 years ago

    ive used LLMS a lot for filling out details in my dnd worlds. Both openai products but also the open source GPT-J from euluther Things like writing the text of some books for players to read, of I have to curate just like people do with stable diffusion. Also used it to write songs, its surprisingly good at taking things like chord progressions written in notation and rolling with variations on them

visarga 3 years ago

It's useless before the model gets instruction and preference tunings. Won't even follow a simple ask, it will just assume it is a list of questions and generate more, or continue with slightly related comments.

FB trained a LLaMA-I (instruction tuned) variant for sports, just to show they can, but I don't think it got released.

popol12 3 years ago

0.35 words/s on my 11th gen i5 with 7B model (framework laptop)

not so bad !

  • fguerraz 3 years ago

    How long did you have to wait for it to load? On my machine it's been running for 15mins, I'm still waiting for a prompt...

    • popol12 3 years ago

      You get the full answer after completion, so it’s normal if you don’t get an output immediately

      I computed the speed by doing speed=number of words/ total run time

    • kristianp 3 years ago

      How much RAM do you both have?

haolez 3 years ago

Would it be possible to run the 65B one like this as well? Is the bottleneck just the RAM, or would I need an absurd number of CPUs as well?

It's not that hard to create a consumer-grade desktop with 256GB in 2023.

  • gpm 3 years ago

    I don't know about this fork specifically, but in general yes absolutely.

    Even without enough ram, you can stream model weights from disk and run at [size of model/disk read speed] seconds per token.

    I'm doing that on a small GPU with this code, but it should be easy to get this working with the CPU as compute instead (and at least with my disk/CPU, I'm not even sure that it would run even slower, I think disk read would probably still be the bottleneck)

    A lack of an absurd number of CPUs just means it's slow, not impossible.

    https://github.com/gmorenz/llama/tree/ssd

    • haolez 3 years ago

      Yeah, I find this area fascinating. Like, it's very cool to run a 7B params model locally, but it must feel like a toy when compared to ChatGPT, for example.

      However, the 65B parameter, according to the benchmarks, is such a beast that you might be able to do some things on it that are not possible on ChatGPT (despite all of ChatGPT's quality of life features). Amazing times.

  • downvotetruth 3 years ago

    You don't need 256 GB. A pair of the new 48GB DDR5 will work along with a pair of 32GB sticks should work in a consumer DDR5 MB to fit the weights. It does burst when initially loading. So, a fast disk with about the same swap size as RAM seems necessary. It took about 25 mins to generate a single 500 character response using a 5800X & 32 GB DDR4, but I was not able to get to it to run on more than 1 thread with the 7B model.

  • basch 3 years ago

    I wonder if we will start to see complex prune functions and tools start to pop up.

    So before you start a task, you sort of describe the domain, and the model is separated into the third most useful and relevant to that topic/query, and 2/3rd most distant from that realm. Then either just the 1/3rd is used in a detached fashion, or it works as 2 layers of cache, one in ram one on disk.

benenglish 3 years ago

Wondering how difficult this would be to get running on a m1 max?

Havoc 3 years ago

Would it not be possible to run on both gpu and cpu at same time in whatever proportion the hardware is available ?

Most gaming desktops have a solid gpu but not enough vram. Pity having the gpu idle here

ur-whale 3 years ago

> 1. Create a conda environment

Uh-oh, bad start.

  • _flux 3 years ago

    Why is it a bad start?

    It could be venv as well, I suppose, I haven't used conda.

    • disgruntledphd2 3 years ago

      Conda is gonna work much, much, much better for these kinds of applications, as that's what it's mostly used for, i.e. scientific/numerical computing with C/C++ dependencies.

      • ur-whale 3 years ago

        Conda is an abomination that will download 4Gig of unnecessary shite and carelessly dump it into your system, thereby ruining your existing configuration in the process.

        Use it in a container or a VM unless you enjoy re-installing your system from scratch.

        Or better still, don't use it al all and let it wither away: these kind of braindead projects need to be put down with extreme prejudice.

        • bshipp 3 years ago

          The download size is large but conda doesn't ruin any existing configuration unless you explicitly tell it to be your native python environment. Conda is set up as a self-contained set of independent environments. Why would your system care what's inside the Anaconda directory unless you explicitly add it to your PATH/bash?

          • ur-whale 3 years ago

            I haven't touched that steaming pile of shite in a looong while, so - who knows - they might have managed to minimize the amount of havoc their wreak on their user's systems.

            But ... I seem to recall ... Conda tries to install GPU drivers does is not? ... Is that not the case anymore?

            Because if it still does, your theory about "Why would your system care" and all that doesn't really hold water.

            • bshipp 3 years ago

              I use miniconda on Linux but it's never attempted to install graphics drivers on Windows.

        • orangepurple 3 years ago

          Chill. Use Miniconda3 as a light alternative. Conda is unnecessary. I agree nobody should ever use Conda unless they are extreme noobs. We all have to start somewhere.

Smith42 3 years ago

Since this is pytorch it should run on cpu anyway. What am I missing?

  • progman32 3 years ago

    Reading the patch: https://github.com/facebookresearch/llama/compare/main...mar...

    Looks like this is just tweaking some defaults and commenting out some code that enables cuda. It also switches to something called gloo, which I'm not familiar with. Seems like an alternate backend.

    • markasoftware 3 years ago

      you don't actually need to switch to gloo, I just have no idea what I'm doing.

      • refulgentis 3 years ago

        Lol, all my best work has been when I don’t know what I’m doing and it’s refreshing to see someone moving the ball forward and feeling the same way. Kudos

    • rajman187 3 years ago

      Gloo is a communication protocol for distributed computation (think along the lines of MPI)

  • Zetobal 3 years ago

    I guess the simple fact that it didn't before his patch?

    • cinntaile 3 years ago

      Usually you just trivially have the model run on cpu or gpu by simply writing .cpu() at specific places, so he's wondering why this isn't the case here.

      • markasoftware 3 years ago

        that's literally all I did (plus switching the tensor type). I'd imagine people are posting and upvoting this not because it's actually interesting code but rather just because it runs unexpectedly fast on consumer CPUs and it's not something they considered feasible before.

        • roenxi 3 years ago

          That is vastly underestimating how tricky it is to make novel pieces of software run. There is a huge fringe of people who know how to click things but not use the terminal and a large fringe of people who know how to run "./execute.bat" but not how to write syntactically correct Python.

          But a lot of those people want to play with LLMs.

        • ComplexSystems 3 years ago

          How are you getting this to run fast? I'm on a top of the line M1 MBP and getting 1 token every 8 minutes.

      • sva_ 3 years ago

        Or better yet, define a device = 'cpu', and use tensor.to(device).

    • tmalsburg2 3 years ago

      If someone else wrote this comment, would you find it useful?

LangIsAllWeNeed 3 years ago

Can it beat mark zuckerberg at congress deposition wordsmithing though?

  • toxicFork 3 years ago

    Someone must have trained an LLM for that for sure.

    • LangIsAllWeNeed 3 years ago

      Oh yes “”” Hackernews senator: “”Someone on the internet said meta aka Facebook is not considered a real data native, clean coder and high IQ company unless your new language model exceeds the elegance and slipperiness of mark Zuckerbergs (you) language output in senate hearings. he is smoother than a lake in the metaverse.“”

      Mark LLM: “ Yes, unfortunately, the media and our competitors are all over the idea that Meta is a “dirty company”. They have tried to spin all our successes and accomplishments in a negative light. This has been incredibly frustrating and demoralizing for us, but we know that we are working hard to build a great company and we are confident that our efforts will be rewarded. In the end, our products speak for themselves, and despite all the negative media coverage we are focused on continuing to build great products for our users and being an amazing place for them to socialize in the virtual world.”

      • RugnirViking 3 years ago

        I have to say "he is smoother than a lake in the metaverse" is presumably accidental, based on the quality of the rest of that text, but it has to be one of the wittiest phrases ive seen LLMs come out with to date

        • jcuenod 3 years ago

          I opened the twitch AI seinfeld stream once and stumbled into a conversation that went something to the effect of:

          George: I really like that orange sweater

          Jerry: Yeah, I just found black so depressing

          George: Orange is such a great color! Orange is the new black.

          ...

        • LangIsAllWeNeed 3 years ago

          That was my prompt, I am hackernews senator. People do sometimes ask how many A100s it takes to run me.

fsiefken 3 years ago

Would running on a cpu be more or less power efficient then running on a gpu with the same words per second rate?

bilsbie 3 years ago

What’s the rough idea of how this is possible? I thought you need the parrelism of a gpu

crazysim 3 years ago

Could this fit into GitHub Codespaces's top VM?

  • DefineOutside 3 years ago

    The 65 billion model is 160 GB so no - unless you request larger storage spaces from github. 7 billion and 13 billion should fit though.

meghan_rain 3 years ago

how long for one token to infer on an average cpu?

  • markasoftware 3 years ago

    I tested on a decidedly above average CPU, and got several words per second on the 7B model. I'd guess maybe one word per second on a more average one?

  • kristianp 3 years ago

    From the readme: On a Ryzen 7900X, the 7B model is able to infer several words per second, quite a lot better than you'd expect!

2Gkashmiri 3 years ago

i have a friend who owns an macbook pro m1 max. what kind of performance can i get?

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection