Fork of Facebook’s LLaMa model to run on CPU

246 points by __anon-2023__ 3 years ago · 177 comments

Reader

dsign 3 years ago

The thing I like the most about the current AI wave is the pressure is putting on computing hardware. Yes, mobile phones with long battery lives are cool and all of that, but most cool things I like are locked behind huge computational requirements.

TaylorAlexander 3 years ago

Agree. I work in robotics and we never have enough compute. I want to see us get to the point where the most advanced robot ever has all the compute it needs onboard, and that means huge growth in compute density and efficiency are needed.
- ben_w 3 years ago
  
  That's genuinely surprising.
  What sort of on-board compute do you typically have today?
  - RugnirViking 3 years ago
    
    a common example from my robotics experience (mainly mobile robots) has been getting something powerful enough to run our image recognition/interpreting sensor data. We often have something like several microprocessors (think:arduino equivalent running c++ or c) which run all the motor control etc and a high level system (used to often be raspberry pi, now more often nvidia jetson nano) listening to all of those and using most of it's computing power on some kind of sensor data, usually image recognition or processing TOF camera/lidar/radar data etc. We often have to optimise hard to get a couple of cycles or "frames" per second with these, which really puts limitations on how robots respond (250ms delay is veeeery noticable, especially if it's in obstacle avoidance - relatively common)
    
    zamnos 3 years ago
    
    Limiting ourselves to onboard compute available on mobile robots is one thing, but even for fixed installation robots, aka an arm in a factory where space and power aren't limited, we're very much still limited by compute capacity. Trying to use robots to do something as simple as folding clothes still cannot be done at a reasonable speed. Yeah, on a personal level, just buck up and spend the 20 minutes folding your clothes, or hire a maid to do it for you, but the complexity of automating the task of folding clothes by a robot is a stand in for other tasks in industry that we still can't automate because the complexity is still too high for our current computing power, and have to hire a human for.
    Researchers at US Berkeley came out with the algorithm they named SpeedFolding in October of last year. Watch https://youtu.be/UTMT2WAUlRw?t=511 and then realize that linked excerpt is sped up 9x.
    If we had 9x faster compute we could have laundry folding robots which is one thing, but that amount of compute would enable robots to do tons more tasks in industry.
    
    mlsu 3 years ago
    
    Robotics is a double whammy, you have compute problems but you also have actuation.
    Getting robots to move quickly is easy; getting them to move quickly to exactly where you want them, or with exactly as much force... that is much, much more difficult. Double for mobile robots where you don't have a good energy source. If cost is an issue that is another dimension -- powerful and accurate actuators are extremely expensive.
  - bick_nyers 3 years ago
    
    I don't work in the field but just to kind of put it into perspective, a 12v 100A LiFePO4 battery has 1200 Watts capacity and weighs 30 pounds. A typical gaming PC (which to be fair, is more willing to trade power for performance) consumes about 600 Watts per hour. Problem for a Tesla? Not so much. Problem for a lightweight drone? Definitely.
    
    idiotsecant 3 years ago
    
    ahhh the units in this post are making my eye twitch.
    
    imtringued 3 years ago
    
    I would be mad too, if my gaming PC was demanding 300 Watts of power but it took half an hour to ramp up ;)
    
    bick_nyers 3 years ago
    
    I know Watts per hour is not the right way to phrase it but I feel it helps for those that don't know. Also, I just don't like saying Amp. Hours :)
    
    idiotsecant 3 years ago
    
    Watts per hour implies watts/hour. Watt-hour implies a number of watts multiplied by a length of time. Also known as energy. Watts are power. Watt hours are energy. Two different things. Watts/hour is nothing.
    
    slaw 3 years ago
    
    Watt/hour is speed of power. It doesn't make any sense.
    
    Tostino 3 years ago
    
    No, you misunderstand what it means in that case. Watt-hours are comparable to joules. 1 watt-hour = 3600 joules.
    
    slaw 3 years ago
    
    watt*hour=joule
    watt-hour: you cannot subtract time from power, it doesn't make any sense.
    watt/hour: delta of power per time, something very weird.
  - TaylorAlexander 3 years ago
    
    The NVIDIA Jetson boards are popular, but even with a full desktop processor and state of the art GPU, you can easily down them in data from a LIDAR sensor or a few cameras. Especially since robots may also need fast response times.
    There is another reply to your comment that shares a lot of what I have experienced. You have so many pieces of code that need to run and a good handful of them are working on something like LIDAR point clouds with a million 3D points in them, plus some cameras running several different image recognition and segmentation algorithms, and you want to have fast cycle times, it just all adds up. Every serious robot I have ever worked on is maxing out its system, even ones at Google X with a full desktop CPU, a high end NVIDIA graphics card, and a couple secondary ARM CPUs.
    
    ben_w 3 years ago
    
    Thanks! :)
    That definitely helps me understand why the footage of the robots in this video had to be sped up: https://youtu.be/Ybk8hxKeMYQ
aqme28 3 years ago

Crazy to me that as soon as one GPU wave is dying (crypto), another one is picking up slack.
- atleastoptimal 3 years ago
  
  Which is a good thing. So glad all that GPU compute is being used on cool stuff rather than running SHA-256 18 quintillion times
  - kkielhofner 3 years ago
    
    Definitely a good thing but FYI it hasn't been profitable/feasible to mine bitcoin (SHA-256) on GPU for many many years as ASIC based miners have completely taken over. I've talked about it plenty on HN but any way you slice it crypto is an unbelievable waste of resources in every possible way regardless.
    What really (finally) more or less killed GPU mining was the Ethereum move to PoS (Proof of Stake).
  - DJBunnies 3 years ago
    
    They can’t both be cool?
    
    moffkalast 3 years ago
    
    Nothing cool in throwing away lots of resources for no reason. In fact there's substantial heating involved.
    
    nightfly 3 years ago
    
    One is a contest to waste the most resources, one has potential to actually have useful results
    
    super256 3 years ago
    
    Why do you think Bitcoin does not have useful results?
    
    idiotsecant 3 years ago
    
    Bitcoin has a use, but there are other options for consensus algorithms that don't waste as much energy as the citizens of a medium sized country and fill the same user case (and other expanded use cases). Why not just do that?
    
    ben_w 3 years ago
    
    Strictly speaking, the comment you're replying to doesn't say which of the two is a contest to waste resources and which has potential to have useful results.
    
    super256 3 years ago
    
    I assumed that this was doubling down on atleastoptimal’s view.
    
    speedgoose 3 years ago
    
    Can you cite an useful result? I can't but I don't think that some people getting richer is useful.
    
    Xerox9213 3 years ago
    
    https://www.elliptic.co/blog/live-updates-millions-in-crypto...
    https://www.cnbc.com/2022/03/23/ukrainian-flees-to-poland-wi...
    In general, the purpose of Bitcoin is not to get rich, but to have a currency that is universally accepted and not tied to a political party’s fiscal decisions.
    
    UncleEntity 3 years ago
    
    In general, so is the Federal Reserve.
    
    super256 3 years ago
    
    Sure: I can buy servers and domains anonymously and buy drugs online, which are illegal in my country.
    
    DJBunnies 3 years ago
    
    Well, I agree with you there.
    
    ben_w 3 years ago
    
    Only if you really like big numbers for the sake of them. Otherwise, one is just straight up snake oil[0], and the other… is kinda hard to tell yet, because while I'm really impressed, I don't know if it's {a toy, a tool, the first sign of a major transformation}.
    [0] did you know the original snake oils contains more omega-3 and therefore improves cognitive function when compared to lard? I did not. But you can get omega-3 elsewhere, and the people who made the term synonymous with fraud didn't use those snakes, so…
  - varispeed 3 years ago
    
    > running SHA-256 18 quintillion times
    or games. People could have been studying or doing something more important than wasting time and energy. I get that it is entertainment, but so are board games and that don't require mining rare earth minerals or putting pressure on the grid as you can always play board games with candles on.
    
    counttheforks 3 years ago
    
    Or TV shows or movies. People could have been studying or doing something more important than wasting time and energy.
    Or going outside. People could have been studying or doing something more important than wasting time and energy.
    Or not being locked in the education facility. People could have been studying or doing something more important than wasting time and energy.
    
    kkielhofner 3 years ago
    
    I'm of two minds on this. I'm not a gamer so part of me thinks gaming is a complete waste of time and resources. Then again, the same could be said about almost any hobby/pastime.
    That said, gaming is what gave us GPUs (which have developed for gaming over the course of decades) so that we can now utilize them for more interesting and "productive" applications.
    So, for me, in the end I'm happy the PC gaming industry and user base has been pushing GPU capability.
    
    KMnO4 3 years ago
    
    Be careful. The gaming industry has successfully conditioned people into believing they need a $1500 GPU with the TDP of a microwave so they can play the next unfinished-at-release AAA title.
    
    mrtranscendence 3 years ago
    
    Monopoly by candlelight, just the future I had always envisioned.
    
    BizarroLand 3 years ago
    
    Why are you wasting candlestick on playing games? It's such a waste! Don't you know bees died to make that candle?
    The most ecologically friendly thing you can do is go to sleep. If you want to play games, do it while the sun is out!
    (/s, just in case)
    
    varispeed 3 years ago
    
    If more people played Monopoly, they would have realised the Western economy is at the stage were a few players bought all properties and utilities.
- saurik 3 years ago
  
  One day we'll find out that all of the VR, crypto, and maybe now AI bubbles were nothing but conspiracies being driven by big-GPU to keep their share price up.
  - schaefer 3 years ago
    
    Speaking for myself, I have already gotten more use out of 2 weeks of chatgpt than I have out of 16 years of Bitcoin
    
    dividedbyzero 3 years ago
    
    Which is just what I'd expect to read on an influential tech forum if grandparent was in fact right.
    
    EVa5I7bHFq9mnYK 3 years ago
    
    14. First Bitcoin was mined 14 years ago. And Bitcoins have not been mined with GPUs since 2013.
    
    arbitrary_name 3 years ago
    
    You pedantry reinforces the point, rather than diminishing it.
  - taneq 3 years ago
    
    VR has been a godsend for forcing hardware, OS and driver developers to actually pay attention to jitter and max latency. If crypto means we get nice fast pretty games and fancy AI then I’m for it. :)
  - toxicFork 3 years ago
    
    The universe was a hoax invented by a GPU company
  - psychphysic 3 years ago
    
    DLSS and DLAA were at least, FSR proved that.
- messe 3 years ago
  
  Charlie Stross (cstross on here) had a fun blog post[1] about this phenomenon just a week and a half ago.
  > As for what you should look to invest in?
  > I'm sure it's just a coincidence that training neural networks and mining cryptocurrencies are both applications that benefit from very large arrays of GPUs. [...]
  > If I was a VC I'd be hiring complexity theory nerds to figure out what areas of research are promising once you have Yottaflops of numerical processing power available, then I'd be placing bets on the GPU manufacturers going there
  [1]: https://www.antipope.org/charlie/blog-static/2023/02/place-y...
- favaq 3 years ago
  
  Gee, it's almost as if GPUs are useful.
  - tnzk 3 years ago
    
    Indeed, except for processing graphics.
- nirav72 3 years ago
  
  For most part of the 20th, a bulk of the energy humanity was able to extract was used for industrialization. Now it seems that a vast bulk of the energy being extracted will go towards computation.
  - xmcqdpt2 3 years ago
    
    I doubt it frankly. Computation consumes a lot of energy, true, but it is dwarfed by how much energy we use in transportation and food production. Energy use per capita in most of the Global North is about 75,000kWh per year,
    https://ourworldindata.org/grapher/per-capita-energy-use
    That's like the average person running 27 NVIDIA A100s at max capacity at all times!
    
    vasco 3 years ago
    
    Yeah but every time we discover a new interesting thing to do with computers the requirements go up by several orders of magnitude. How many more orders of magnitude of energy can we spend on food production from here with current projections of world population to peak in 2100?
  - alchemist1e9 3 years ago
    
    The singularity and fusion power are probably interlinked because of this. Once one happens the other will and in either order.
    
    ben_w 3 years ago
    
    Doubt it, unless you include solar as fusion, but then we've already got it.
- college_physics 3 years ago
  
  GPU conspiracy or just the side-effect of the decline of Intel?
  - swyx 3 years ago
    
    theres an economics theory of “supply creates its own demand”. we wanted to do ml, gpus were around for games, we repurposed them for ml, and ml architectures that benefit from gpus won the “hardware lottery” (influential paper from sara hooker in case you are unaware)
- layer8 3 years ago
  
  Just imagine if Bitcoin, GPT and Half-Life had come out at the same time.
  - senttoschool 3 years ago
    
    Bitcoin miners don't use GPUs.
    
    layer8 3 years ago
    
    They did initially.
pjmlp 3 years ago

What we will get are specialized hardware, with not so open APIs anyway.
With a bunch of people trailling behind with "it kind of works" open alternatives.
- dsign 3 years ago
  
  It sounds like you are complaining about capitalism :-)
  It's not so bad. Nvidia could come and say, "hey, I'm going to lock down your GPU so that you can only use it to render polygons in my whitelisted list of video-games, and then you pay us $$$$$$ to buy our 'datacenter' thingy for anything else." But if they do it, people will go and buy the competitor's product.
  And yes, probably their 4090 are being bought by some rich kids with their parents' money, but I reckon most of it are sales to professionals, people who would justify their purchase decision with more than playing First-person-shooters. I for example play videogames with my gf, and we have equivalent GPUs. Hers is AMD and costs less than mine, even if it does the same, but I went for Nvidia so that PhysX were available and I could use Pytorch and Numba+GPU and even C++ CUDA. The moment Nvidia locks that down, I'll have to switch to AMD.
  - zirgs 3 years ago
    
    > hey, I'm going to lock down your GPU so that you can only use it to render polygons in my whitelisted list of video-games
    You just described gaming consoles.
  - pjmlp 3 years ago
    
    As you will find on my comment history, I am perfectly fine with commercial products and APIs.
    Good luck with AMD.
HeartStrings 3 years ago

John Hopkins are working on organoids that will replace silicon GPUs for AI.
- TecoAndJix 3 years ago
  
  Here is an article from JHU on that topic - https://hub.jhu.edu/2023/02/28/organoid-intelligence-biocomp...
- colordrops 3 years ago
  
  Reminds me of this Choose Your Own Adventure book from 1984. It was about how PCs had organic AI components and each was unique, and you happened to get your hands on a super intelligent one.
  https://www.goodreads.com/en/book/show/755062
- ben_w 3 years ago
  
  As is one of the YouTubers I follow.
  Meatcubator: https://youtu.be/Z_ZGq8Tah0k
  Growing human brain cells: https://youtu.be/V2YDApNRK3g
- pelagicAustral 3 years ago
  
  If they can get pass these new ethical committees...
  - BizarroLand 3 years ago
    
    I'm sure there are rat and mouse brain cells free for the taking from almost any pharmaceutical testing lab.
    If organics is the only factor, I don't know why those wouldn't perform as well as human or ape brain cells.

mrtksn 3 years ago

Unlike Stable Diffusion, I don't stumble upon people who actually use it. Are there examples of the output this can generate? What happens once you manage to run the model?

thot_experiment 3 years ago

I've been playing around with LLMs recently and it's definitely interesting stuff. I've mostly focused on roleplay/MUD applications and it's not quuitteee there but it's pretty good, and it's idiosyncrasies are often hilarious.
(when fed the leaked bing prompt, my AI decided it was Australian and started tossing in random shit like "but here in Australia, we'd call it limey green" when asked about chartreuse, i assume because the codename for bing chat is 'sydney')
sdrinf 3 years ago

This is very new, give it a few days. Here's one from Shawn: https://twitter.com/theshawwn/status/1632595934839177216
knaik94 3 years ago

I have been using similar models like LLM for helping draft fictional stories. The community fine tuned models are geared towards SFW and/or NSFW story competition.
https://github.com/KoboldAI/KoboldAI-Client To read more about current popular models.
https://koboldai.net/ is a way to run some of these models in the "cloud". There's no account required and the prompts are run on other people's hardware, with priority weighting based on how much compute you have used or donated. There's an anonymous api key and there's no expectation that the output can't be logged.
The models that run on hardware locally are very basic in the quality of output. Here's an example of a 6B output used to try to emulate chatgpt. https://mobile.twitter.com/Knaikk/status/1629711223863345154 The model was finetuned on story completion so it's not meaningfully comparable.
It's less popular because the hardware required for the great output is still above the top of line consumer specs. 24 gb vram is closer to a bare minimum to get meaningful output, and fine-tuning is still out of reach. There's some development with using services like runpod.
redox99 3 years ago

On /g/ there's always a very active AI chatbot general that focuses on these models.
- samvher 3 years ago
  
  What's /g/?
  - n1c 3 years ago
    
    https://4chan.org/g/
    (enter at your own risk? I think it's kinda safe but still 4chan)
    
    brobinson 3 years ago
    
    Better to direct people to the catalog: https://boards.4channel.org/g/catalog
    Browsing page by page is not a good idea.
    Search for "aicg" or visit https://boards.4channel.org/g/catalog#s=aicg to see the AI Chatbot General thread (a new one is created every time the previous one hits the reply limit).
    
    koheripbal 3 years ago
    
    Message boards like this are so unreadable unless you have hours upon hours of spare time.
yieldcrv 3 years ago

We just need some better GUIs
Stable Diffusion was in the same place as this in the same time frame of the model getting released. Its only been a few days.
KyeRussell 3 years ago

Pretty sure you wouldn’t see anyone using it commercially as IIRC it’s only public due to a leak.
- input_sh 3 years ago
  
  It's not a leak, it's a shortcut.
  You can download it from Facebook, but it's behind "apply for access" form. Magnet links floating around are just a workaround around that form.
  That said, commercial use is forbidden by the license specified in the form: https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z...
- mrtksn 3 years ago
  
  I wasn't looking for a commercial use but its an Interesting point. Would it be possible to prove that someone is using it commercially?
  1) Spin it up on a cluster in Belarus
  2) ???
  3) Profit?
  - notpushkin 3 years ago
    
    4) Lukashenka grabs your profits and kicks you out ;-)
    
    mrtksn 3 years ago
    
    Why would he do that? More likely scenario would be "gets his share as a an apartment in London".
    
    mylidlpony 3 years ago
    
    That's the oldest authoritarian trick in the book, pretty much any successful business in russia got the same fate for example. They even tried it with nginx.
  - qingdao99 3 years ago
    
    It could be watermarked
    
    mrtksn 3 years ago
    
    how?
    
    qingdao99 3 years ago
    
    https://arxiv.org/pdf/2301.10226.pdf
    https://www.youtube.com/watch?v=XZJc1p6RE78
    
    ad404b8a372f2b9 3 years ago
    
    That emoji attack is hilarious, who would have thought that twitter-speak is watermark resistant.
    
    mrtksn 3 years ago
    
    cool, thanks
RugnirViking 3 years ago

ive used LLMS a lot for filling out details in my dnd worlds. Both openai products but also the open source GPT-J from euluther Things like writing the text of some books for players to read, of I have to curate just like people do with stable diffusion. Also used it to write songs, its surprisingly good at taking things like chord progressions written in notation and rolling with variations on them

visarga 3 years ago

It's useless before the model gets instruction and preference tunings. Won't even follow a simple ask, it will just assume it is a list of questions and generate more, or continue with slightly related comments.

FB trained a LLaMA-I (instruction tuned) variant for sports, just to show they can, but I don't think it got released.

cfcf14 3 years ago

You have to prompt it correctly, non-instruction-aligned models don't behave like agent simulators by default.
qingdao99 3 years ago

Surely it would work with a format like:
User: <question or task>
Assistant:
ShamelessC 3 years ago

Useless!? C'mon.
RugnirViking 3 years ago

it's not that useless, you just have to prompt it the right way (usually by offering an example of the kind of output you want)
notpushkin 3 years ago

So, you need to know how to tune it.
ma2rten 3 years ago

It's still useful, but you need to know how to use it.
skhm 3 years ago

Try giving it a simple request instead ;)

popol12 3 years ago

0.35 words/s on my 11th gen i5 with 7B model (framework laptop)

not so bad !

fguerraz 3 years ago

How long did you have to wait for it to load? On my machine it's been running for 15mins, I'm still waiting for a prompt...
- popol12 3 years ago
  
  You get the full answer after completion, so it’s normal if you don’t get an output immediately
  I computed the speed by doing speed=number of words/ total run time
- kristianp 3 years ago
  
  How much RAM do you both have?
  - popol12 3 years ago
    
    32gb of ram and 64gb on NVME swap

haolez 3 years ago

Would it be possible to run the 65B one like this as well? Is the bottleneck just the RAM, or would I need an absurd number of CPUs as well?

It's not that hard to create a consumer-grade desktop with 256GB in 2023.

gpm 3 years ago

I don't know about this fork specifically, but in general yes absolutely.
Even without enough ram, you can stream model weights from disk and run at [size of model/disk read speed] seconds per token.
I'm doing that on a small GPU with this code, but it should be easy to get this working with the CPU as compute instead (and at least with my disk/CPU, I'm not even sure that it would run even slower, I think disk read would probably still be the bottleneck)
A lack of an absurd number of CPUs just means it's slow, not impossible.
https://github.com/gmorenz/llama/tree/ssd
- haolez 3 years ago
  
  Yeah, I find this area fascinating. Like, it's very cool to run a 7B params model locally, but it must feel like a toy when compared to ChatGPT, for example.
  However, the 65B parameter, according to the benchmarks, is such a beast that you might be able to do some things on it that are not possible on ChatGPT (despite all of ChatGPT's quality of life features). Amazing times.
downvotetruth 3 years ago

You don't need 256 GB. A pair of the new 48GB DDR5 will work along with a pair of 32GB sticks should work in a consumer DDR5 MB to fit the weights. It does burst when initially loading. So, a fast disk with about the same swap size as RAM seems necessary. It took about 25 mins to generate a single 500 character response using a 5800X & 32 GB DDR4, but I was not able to get to it to run on more than 1 thread with the 7B model.
- Tepix 3 years ago
  
  All current Ryzen CPUs do not work with 48GB DDR5, right? That means if you want to go beyond 128GB you can get an old X399 board (there are some reports of people getting 256GB to work) or more recent Threadripper boards.
  - downvotetruth 3 years ago
    
    Current Ryzen CPUs do not work with either 24GB or 48GB DDR5.
- downvotetruth 3 years ago
  
  Follow up: https://github.com/facebookresearch/llama/issues/79#issuecom... claims 65B was able to fit in 128 GB by unsharding & merging weights into a single file instead of the multiple pth with 172Gb max swap file usage & appears to stream to GPU.
- haolez 3 years ago
  
  Why? Is it a limitation of the model or just something with the configuration that you couldn't figure out for this test?
  - downvotetruth 3 years ago
    
    I tried mark's OMP_NUM_THREADS suggestion (https://news.ycombinator.com/item?id=35018559), did not see any an obvious change to make it parallel, and given the huggingface patch (https://github.com/huggingface/transformers/pull/21955) once it gets in is suppose to allow streaming from RAM to the GPU. So, for me it was not worth the effort to keep working on the CPU version as even the best case ~30X speedup will still take around a minute to run the 7B.
basch 3 years ago

I wonder if we will start to see complex prune functions and tools start to pop up.
So before you start a task, you sort of describe the domain, and the model is separated into the third most useful and relevant to that topic/query, and 2/3rd most distant from that realm. Then either just the 1/3rd is used in a detached fashion, or it works as 2 layers of cache, one in ram one on disk.

benenglish 3 years ago

Wondering how difficult this would be to get running on a m1 max?

ComplexSystems 3 years ago

I got one token every 8 minutes or so.
- popol12 3 years ago
  
  Using which model ? On a pretty mid range i5 11th gen I'm getting 0.35 token/s, using the 7B model. Haven't tried the bigger models.
- 2Gkashmiri 3 years ago
  
  Is that good? Not good?
  - gorbypark 3 years ago
    
    A token is approximately 4 characters. So, four characters per 8 minutes is pretty slow. This comment would take 1224 minutes to generate, if I was an AI.
  - Tepix 3 years ago
    
    Usually you want tokens per second, not seconds per token. So it's a bad sign.
swyx 3 years ago

another commenter posted a fork that does it https://news.ycombinator.com/item?id=35067469
per the readme it looks like there a few bugs to figure out in case anyone here is a pytorch expert

Havoc 3 years ago

Would it not be possible to run on both gpu and cpu at same time in whatever proportion the hardware is available ?

Most gaming desktops have a solid gpu but not enough vram. Pity having the gpu idle here

ur-whale 3 years ago

> 1. Create a conda environment

Uh-oh, bad start.

_flux 3 years ago

Why is it a bad start?
It could be venv as well, I suppose, I haven't used conda.
- disgruntledphd2 3 years ago
  
  Conda is gonna work much, much, much better for these kinds of applications, as that's what it's mostly used for, i.e. scientific/numerical computing with C/C++ dependencies.
  - ur-whale 3 years ago
    
    Conda is an abomination that will download 4Gig of unnecessary shite and carelessly dump it into your system, thereby ruining your existing configuration in the process.
    Use it in a container or a VM unless you enjoy re-installing your system from scratch.
    Or better still, don't use it al all and let it wither away: these kind of braindead projects need to be put down with extreme prejudice.
    
    bshipp 3 years ago
    
    The download size is large but conda doesn't ruin any existing configuration unless you explicitly tell it to be your native python environment. Conda is set up as a self-contained set of independent environments. Why would your system care what's inside the Anaconda directory unless you explicitly add it to your PATH/bash?
    
    ur-whale 3 years ago
    
    I haven't touched that steaming pile of shite in a looong while, so - who knows - they might have managed to minimize the amount of havoc their wreak on their user's systems.
    But ... I seem to recall ... Conda tries to install GPU drivers does is not? ... Is that not the case anymore?
    Because if it still does, your theory about "Why would your system care" and all that doesn't really hold water.
    
    bshipp 3 years ago
    
    I use miniconda on Linux but it's never attempted to install graphics drivers on Windows.
    
    orangepurple 3 years ago
    
    Chill. Use Miniconda3 as a light alternative. Conda is unnecessary. I agree nobody should ever use Conda unless they are extreme noobs. We all have to start somewhere.

Smith42 3 years ago

Since this is pytorch it should run on cpu anyway. What am I missing?

progman32 3 years ago

Reading the patch: https://github.com/facebookresearch/llama/compare/main...mar...
Looks like this is just tweaking some defaults and commenting out some code that enables cuda. It also switches to something called gloo, which I'm not familiar with. Seems like an alternate backend.
- markasoftware 3 years ago
  
  you don't actually need to switch to gloo, I just have no idea what I'm doing.
  - refulgentis 3 years ago
    
    Lol, all my best work has been when I don’t know what I’m doing and it’s refreshing to see someone moving the ball forward and feeling the same way. Kudos
- rajman187 3 years ago
  
  Gloo is a communication protocol for distributed computation (think along the lines of MPI)
Zetobal 3 years ago

I guess the simple fact that it didn't before his patch?
- cinntaile 3 years ago
  
  Usually you just trivially have the model run on cpu or gpu by simply writing .cpu() at specific places, so he's wondering why this isn't the case here.
  - markasoftware 3 years ago
    
    that's literally all I did (plus switching the tensor type). I'd imagine people are posting and upvoting this not because it's actually interesting code but rather just because it runs unexpectedly fast on consumer CPUs and it's not something they considered feasible before.
    
    roenxi 3 years ago
    
    That is vastly underestimating how tricky it is to make novel pieces of software run. There is a huge fringe of people who know how to click things but not use the terminal and a large fringe of people who know how to run "./execute.bat" but not how to write syntactically correct Python.
    But a lot of those people want to play with LLMs.
    
    ComplexSystems 3 years ago
    
    How are you getting this to run fast? I'm on a top of the line M1 MBP and getting 1 token every 8 minutes.
    
    ingenieroariel 3 years ago
    
    Try switching all the .cuda() to .mps() I got a 100x speedup on a different language model on a Macbook M1 Air.
    https://pytorch.org/docs/stable/notes/mps.html
    
    singularity2001 3 years ago
    
    dedicated fork: https://github.com/remixer-dec/llama-mps
    
    markasoftware 3 years ago
    
    probably pytorch is very optimized to x86. It's likely using lots of SIMD and whatnot. I'm sure it's possible to get similar performance on m1 macs, but not with the current version of pytorch.
    Do you have enough ram? (not swapping to disk)?
    
    jwitthuhn 3 years ago
    
    Same experience for me, looks like it is only using one cpu core instead of all of them.
  - sva_ 3 years ago
    
    Or better yet, define a device = 'cpu', and use tensor.to(device).
- tmalsburg2 3 years ago
  
  If someone else wrote this comment, would you find it useful?

LangIsAllWeNeed 3 years ago

Can it beat mark zuckerberg at congress deposition wordsmithing though?

toxicFork 3 years ago

Someone must have trained an LLM for that for sure.
- LangIsAllWeNeed 3 years ago
  
  Oh yes “”” Hackernews senator: “”Someone on the internet said meta aka Facebook is not considered a real data native, clean coder and high IQ company unless your new language model exceeds the elegance and slipperiness of mark Zuckerbergs (you) language output in senate hearings. he is smoother than a lake in the metaverse.“”
  Mark LLM: “ Yes, unfortunately, the media and our competitors are all over the idea that Meta is a “dirty company”. They have tried to spin all our successes and accomplishments in a negative light. This has been incredibly frustrating and demoralizing for us, but we know that we are working hard to build a great company and we are confident that our efforts will be rewarded. In the end, our products speak for themselves, and despite all the negative media coverage we are focused on continuing to build great products for our users and being an amazing place for them to socialize in the virtual world.”
  - RugnirViking 3 years ago
    
    I have to say "he is smoother than a lake in the metaverse" is presumably accidental, based on the quality of the rest of that text, but it has to be one of the wittiest phrases ive seen LLMs come out with to date
    
    jcuenod 3 years ago
    
    I opened the twitch AI seinfeld stream once and stumbled into a conversation that went something to the effect of:
    George: I really like that orange sweater
    Jerry: Yeah, I just found black so depressing
    George: Orange is such a great color! Orange is the new black.
    ...
    
    LangIsAllWeNeed 3 years ago
    
    That was my prompt, I am hackernews senator. People do sometimes ask how many A100s it takes to run me.

fsiefken 3 years ago

Would running on a cpu be more or less power efficient then running on a gpu with the same words per second rate?

frognumber 3 years ago

Less.

bilsbie 3 years ago

What’s the rough idea of how this is possible? I thought you need the parrelism of a gpu

raihansaputra 3 years ago

inference has less pressure of parallelism compared to training

crazysim 3 years ago

Could this fit into GitHub Codespaces's top VM?

DefineOutside 3 years ago

The 65 billion model is 160 GB so no - unless you request larger storage spaces from github. 7 billion and 13 billion should fit though.

meghan_rain 3 years ago

how long for one token to infer on an average cpu?

markasoftware 3 years ago

I tested on a decidedly above average CPU, and got several words per second on the 7B model. I'd guess maybe one word per second on a more average one?
- raverbashing 3 years ago
  
  Cool so we're back to the days of 2400 baud modems
  - Tepix 3 years ago
    
    More like 300 baud. At 300 baud (30 cps) you can still read it as it arrives.
  - asplake 3 years ago
    
    Simulating a slow typist
kristianp 3 years ago

From the readme: On a Ryzen 7900X, the 7B model is able to infer several words per second, quite a lot better than you'd expect!

2Gkashmiri 3 years ago

i have a friend who owns an macbook pro m1 max. what kind of performance can i get?

singularity2001 3 years ago

dedicated fork: https://github.com/remixer-dec/llama-mps
- kristianp 3 years ago
  
  Mps = Metal Performance Shaders, for those out of the loop.

Settings

Fork of Facebook’s LLaMa model to run on CPU

Keyboard Shortcuts