Ggml.ai joins Hugging Face to ensure the long-term progress of Local AI

839 points by lairv 5 months ago · 240 comments

Reader

mythz 5 months ago

I consider HuggingFace more "Open AI" than OpenAI - one of the few quiet heroes (along with Chinese OSS) helping bring on-premise AI to the masses.

I'm old enough to remember when traffic was expensive, so I've no idea how they've managed to offer free hosting for so many models. Hopefully it's backed by a sustainable business model, as the ecosystem would be meaningfully worse without them.

We still need good value hardware to run Kimi/GLM in-house, but at least we've got the weights and distribution sorted.

data-ottawa 5 months ago

Can we toss in the work unsloth does too as an unsung hero?
They provide excellent documentation and they’re often very quick to get high quality quants up in major formats. They’re a very trustworthy brand.
- disiplus 5 months ago
  
  Yeah, they're the good guys. I suspect the open source work is mostly advertisements for them to sell consulting and services to enterprises. Otherwise, the work they do doesn't make sense to offer for free.
  - danielhanchen 5 months ago
    
    Haha for now our primary goal is to expand the market for local AI and educate people on how to do RL, fine-tuning and running quants :)
    
    WanderPanda 5 months ago
    
    Amazing work and people should really appreciate that the opportunity costs of your work are immense (given the hype).
    On another note: I'm a bit paranoid about quantization. I know people are not good at discerning model quality at these levels of "intelligence" anymore, I don't think a vibe check really catches the nuances. How hard would it be to systematically evaluate the different quantizations? E.g. on the Aider benchmark that you used in the past?
    I was recently trying Qwen 3 Coder Next and there are benchmark numbers in your article but they seem to be for the official checkpoint, not the quantized ones. But it is not even really clear (and chatbots confuse them for benchmarks of the quantized versions btw.)
    I think systematic/automated benchmarks would really bring the whole effort to the next level. Basically something like the bar chart from the Dynamic Quantization 2.0 article but always updated with all kinds of recent models.
    
    danielhanchen 5 months ago
    
    Thanks! Yes we actually did think about that - it can get quite expensive sadly - perplexity benchmarks over short context lengths with small datasets are doable, but it's not an accurate measure sadly. We're actually investigating currently what would be the best efficient course of action on evaluating quants - will keep you posted!
    
    jychang 5 months ago
    
    > How hard would it be to systematically evaluate the different quantizations? E.g. on the Aider benchmark that you used in the past?
    Very hard. $$$
    The benchmarks are not cheap to run. It'll cost a lot to run them for each quant of each model.
    
    danielhanchen 5 months ago
    
    Yes sadly very expensive :( Maybe a select few quants could happen - we're still figuring out what is the most economical and most efficient way to benchmark!
    
    illusive4080 5 months ago
    
    Roughly how much does it cost to run one of the popular benchmarks? Are we talking $1,000, $10,000, or $100k?
    
    danielhanchen 5 months ago
    
    Oh it's more time that's the issue - each benchmark takes 1-3 hours ish to run on 8 GPUs, so running on all quants per model release can be quite painful.
    Assume AWS spot say $20/hr B200 for 8 GPUs, then $20 ish per quant, so assuming benchmark is on BF16, 8bit, 6, 5, 4, 3, 2 bits then 7 ish tests so $140 per model ish to $420 ish/hr. Time wise 7 hours to 1 day ish.
    We could run them after a model release which might work as well.
    This is also on 1 benchmark.
    
    Zetaphor 5 months ago
    
    This would be amazing
    
    danielhanchen 5 months ago
    
    Working on it! :)
  - arcanemachiner 5 months ago
    
    I hope that is exactly what is happening. It benefits them, and it benefits us.
- swyx 5 months ago
  
  not that unsung! we've given them our biggest workshop spot every single year we've been able to and will do until they are tired of us https://www.youtube.com/@aiDotEngineer/search?query=unsloth
  - danielhanchen 5 months ago
    
    Appreciate it immensely haha :) Never tired - always excited and pumped for this year!
- danielhanchen 5 months ago
  
  Oh thank you - appreciate it :)
- cubie 5 months ago
  
  I'm a big fan of their work as well, good shout.
  - danielhanchen 5 months ago
    
    Thank you!
Tepix 5 months ago

It's insane how much traffic HF must be pushing out of the door. I routinely download models that are hundreds of gigabytes in size from them. A fantastic service to the sovererign AI community.
- razster 5 months ago
  
  My fear is that these large "AI" companies will lobby to have these open source options removed or banned, growing concern. I'm not sure how else to explain how much I enjoy using what HF provides, I religiously browse their site for new and exciting models to try.
  - culi 5 months ago
    
    ModelScope is the Chinese equivalent of Hugging Face and a good back up. All the open models are Chinese anyways
    
    thot_experiment 5 months ago
    
    Not true! Mistral is really really good, but I agree that there isn't a single decent open model from the USA.
    
    culi 5 months ago
    
    Mistral is cool and I wish them success but it consistently ranks extremely low on benchmarks while still being expensive. Chinese models like DeepSeek might rank almost as low as Mistral but they are significantly cheaper. And Kimi is the best of both worlds with incredible benchmark results while still being incredibly cheap
    I know things change rapidly so I'm not counting them out quite yet but I don't see them as a serious contender currently
    
    thot_experiment 5 months ago
    
    Sure, benchmarks are fake and I use Mistral over equivalently sized models most of the time because it's better in real life. It runs plenty fast for me, I don't pay for inference.
    
    BoredomIsFun 5 months ago
    
    > it consistently ranks extremely low on benchmarks
    As general purpose chatbots small Mistral models are better than comparably sized Chiniese models, as they have better SimpleQA scores and general knowledge of Western culture.
    
    seanmcdirmid 5 months ago
    
    It’s really hard to beat qwen coder, especially for role play where the instruction following is really useful. I don’t think their corpus is lacking in western knowledge, although I wonder if Chinese users get even better results from it?
    
    BoredomIsFun 5 months ago
    
    > It’s really hard to beat qwen coder, for role play
    I am not sure if you actually tried that. Mistrals are widely asccepted go-to models for roleplay and creative writing. No Qwens are good at prose, except for their latest big Qwen 3.5.
    > I don’t think their corpus is lacking in western knowledge,
    It absolutely does, especially pop culture knowledge.
    
    seanmcdirmid 5 months ago
    
    Instruct and coder just follow instructions so well though. I guess I’ve just never been able to make mistral work well, I guess.
    
    BoredomIsFun 5 months ago
    
    Qwen3 30B A3B and that big 400+ B Coder were absolutely terrible at editing fiction. I would tell them what to change in the prose and they'd just regurgitate text with no changes.
    
    seanmcdirmid 5 months ago
    
    Did you try asking Gemini what model to use and how to configure/set it up? It has worked wonders for me, ironically (since I’m using a big model to setup smaller local models).
    
    BoredomIsFun 5 months ago
    
    > Did you try asking Gemini what model to use and how to configure/set it up?
    That would besuboptimal, as Gemini has too old knowledge cutoff. I am long past the need for such an advice anyway, as I've been using local models since mid 2024.
    
    seanmcdirmid 5 months ago
    
    Gemini will search the web for most things (at least if you are using it via the web search interface), it isn’t limited to the knowledge it was trained on. Actually, I’m a bit mortified that not everyone knows this. If you ask Gemini (from the search interface) about a current event that happened yesterday, they will use search to pull in context and work with that. Also about model that was released yesterday, it can do that.
    It’s only a very low level model access where search isn’t used. Local models also need to be configured to use search, and I haven't had a use case to do that yet.
    Gemini seems to call this “grounding with google search”. If you have Gemini installed in your enterprise, it will also search internal data sources for context.
    
    BoredomIsFun 5 months ago
    
    > Gemini will search the web for most things (at least if you are using it via the web search interface), it isn’t limited to the knowledge it was trained on.
    If decides to do so, and even then baked in knowledge would influence the result.
    In any case I do not need Gemini or any other LLMs to figure out setting for my llama.cpp, thank you very much.
    
    seanmcdirmid 5 months ago
    
    It has always searched the web for me, and it can give me pretty good guidance about a model released in the last week. All models ATM are trying to reduce dependence on internal knowledge mostly through RAG. Anyways, this part of LLMs has gotten much better in the last 6 months.
    If you are able to figure out the right settings for a model Thats was released last week, then great for you! But it sounds like you just don’t trust LLMs to use current knowledge, and have some misconception about how they satisfy recent knowledge requests.
    
    Eupolemos 5 months ago
    
    Why are you talking price when we are talking local AI?
    That doesn't make any sense to me. Am I missing something?
    
    dirasieb 5 months ago
    
    15 missed calls from your local power company
    
    culi 5 months ago
    
    Your electricity is free?
    
    seanmcdirmid 5 months ago
    
    Apple silicon is crazy efficient as well as being comparable to GPUs in performance for max and ultra chips.
    
    cpburns2009 5 months ago
    
    If you have the hardware to run expensive models, is the cost of electricity much of a factor? According to Google, the average price in the Silicon Valley Area is $0.448 per kWh. An RTX 5090 costs about $4,000 and has a peak power consumption of 1000 W. Maxing out that GPU for a whole year would cost $3,925 at that rate. It's not particularly more expensive than that hardware itself.
    
    culi 5 months ago
    
    At that point it'd be cheaper to get an expensive subscription to a cloud platform AI product. I understand the case for local LLMs but it seems silly to worry about pricing for cloud-based offerings but not worry about pricing for locally run models. Especially since running it locally can often be more expensive
    
    thot_experiment 5 months ago
    
    for almost the entire year, yes.
    
    ac29 5 months ago
    
    Arcee is working on that, see a blog post about their newest in progress model here: https://www.arcee.ai/blog/trinity-large
    Its still not fully post trained and its a non-reasoning model, but its worth keeping an eye on if you dont want to use the Chinese models that currently are the best open-weight options.
    
    CamperBob2 5 months ago
    
    To be fair there are lots of worse models than OpenAI's GPT-OSS-120b. It's not a standout when positioned next to the latest releases from China, but prior to the current wave it was considered one of the stronger local models you can reasonably run.
  - throwaway27448 5 months ago
    
    They can try. I don't think they'll be able to get the toothpaste back in the tube. The data will just move our of the country.
    
    seanmcdirmid 5 months ago
    
    Many of the models on hugging face are already Chinese. It’s kind of obvious that local AI is going to flourish more in China than the USA due to hardware constraints.
  - dotancohen 5 months ago
    
    How do you choose which models to try for which workflows? Do you have objective tests that you run, or do you just get a feel for them while using them in your daily workflow?
  - toofy 5 months ago
    
    it’s only a matter of time. we have all seen first hand how … wrong … these companies behave, almost on a regular basis.
    there’s a small tinfoil hat part of me that suspects part of their obscene investments and cornering the hardware market is driven by an conscious attempt to stop open source local from taking off. they want it all, the money, the control, and to be the only source of information to us.
- Onavo 5 months ago
  
  Bandwidth is not that expensive. The Big 3 clouds just want to milk customers via egress. Look at Hetzner or CloudFlare R2 if you want to get get an idea of commodity bandwidth costs.
- vardalab 5 months ago
  
  Yup, I have downloaded probably a terabyte in the last week, especially with the Step 3.5 model being released and Minimax quants. I wonder what my ISP thinks. I hope they don't cut me off. They gave me a fast lane, they better let me use it, lol
  - fc417fc802 5 months ago
    
    Even fairly restrictive data caps are in the range of 6 Tb per month. P2P at a mere 100 Mb works out to 1 TiB per 24 hours.
    Hypothetically my ISP will sell me unmetered 10 Gb service but I wonder if they would actually make good on their word ...
    
    3eb7988a1663 5 months ago
    
    I have a 1.2TB cap before you start getting charged extra, so you might need to recalibrate your restrictive level.
    
    fc417fc802 5 months ago
    
    Is that with a WISP by chance? Or in a developing country? Or are there really wired providers with such low caps in the western world in this day and age?
    
    Zetaphor 5 months ago
    
    ATT once told me if I don't pay for their TV service then my home gigabit fiber would have a 1TB cap. They had an agreement with the apartment building so I had no other choice of provider.
    
    fc417fc802 5 months ago
    
    Buy our off brand netflix or else we'll make it so you can't watch netflix. How is that legal?
    
    Zetaphor 5 months ago
    
    The law is written by the highest bidder, and the telecom lobbyists are very generous
    
    nagaiaida 5 months ago
    
    well it's my wired cap a stone's throw from buildings with google cloud logos on the side in a major us city, so...
    
    zargon 5 months ago
    
    Comcast.
zozbot234 5 months ago

> We still need good value hardware to run Kimi/GLM in-house
If you stream weights in from SSD storage and freely use swap to extend your KV cache it will be really slow (multiple seconds per token!) but run on basically anything. And that's still really good for stuff that can be computed overnight, perhaps even by batching many requests simultaneously. It gets progressively better as you add more compute, of course.
- Aurornis 5 months ago
  
  > it will be really slow (multiple seconds per token!)
  This is fun for proving that it can be done, but that's 100X slower than hosted models and 1000X slower than GPT-Codex-Spark.
  That's like going from real time conversation to e-mailing someone who only checks their inbox twice a day if you're lucky.
  - zozbot234 5 months ago
    
    You'd need real rack-scale/datacenter infrastructure to properly match the hosted models that are keeping everything in fast VRAM at all times, and then you only get reasonable utilization on that by serving requests from many users. The ~100X slower tier is totally okay for experimentation and non-conversational use cases (including some that are more agentic-like!), and you'd reach ~10X (quite usable for conversation) by running something like a good homelab.
- HPsquared 5 months ago
  
  At a certain point the energy starts to cost more than renting some GPUs.
  - vardalab 5 months ago
    
    Yeah, that is hard to argue with because I just go to OpenRouter and play around with a lot of models before I decide which ones I like. But there's something special about running it locally in your basement
    
    dotancohen 5 months ago
    
    I'd love to hear more about this. How do you decide that you like a model? For which use cases?
  - fc417fc802 5 months ago
    
    Aren't decent GPU boxes in excess of $5 per hour? At $0.20 per kWhr (which is on the high side in the US) running a 1 kW workstation 24/7 would work out to the same price as 1 hour of GPU time.
    The issue you'll actually run into is that most residential housing isn't wired for more than ~2kW per room.
sowbug 5 months ago

Why doesn't HF support BitTorrent? I know about hf-torrent and hf_transfer, but those aren't nearly as accessible as a link in the web UI.
- embedding-shape 5 months ago
  
  > Why doesn't HF support BitTorrent?
  Harder to track downloads then. Only when clients hit the tracker would they be able to get download states, and forget about private repositories or the "gated" ones that Meta/Facebook does for their "open" models.
  Still, if vanity metrics wasn't so important, it'd be a great option. I've even thought of creating my own torrent mirror of HF to provide as a public service, as eventually access to models will be restricted, and it would be nice to be prepared for that moment a bit better.
  - sowbug 5 months ago
    
    I thought of the tracking and gate questions, too, when I vibed up an HF torrent service a few nights ago. (Super annoying BTW to have to download the files just to hash the parts, especially when webseeds exist.) Model owners could disable or gate torrents the same way they gate the models, and HF could still measure traffic by .torrent downloads and magnet clicks.
    It's a bit like any legalization question -- the black market exists anyway, so a regulatory framework could bring at least some of it into the sunlight.
    
    embedding-shape 5 months ago
    
    > Model owners could disable or gate torrents the same way they gate the models, and HF could still measure traffic by .torrent downloads and magnet clicks.
    But that'll only stop a small part, anyone could share the infohash and if you're using the dht/magnet without .torrent files or clicks on a website, no one can count those downloads unless they too scrape the dht for peers who are reporting they've completed the download.
    
    fc417fc802 5 months ago
    
    > unless they too scrape the dht for peers who are reporting they've completed the download.
    Which can be falsified. Head over to your favorite tracker and sort by completed downloads to see what I mean.
    
    sowbug 5 months ago
    
    Right, but that's already happening today. That's the black-market point.
  - Barbing 5 months ago
    
    That would be a very nice service. I think folks might rely on it for a number of reasons, including that we'll want to see how biases changed over time. What got sloppier, shillier...
  - jimbob45 5 months ago
    
    Wouldn’t it still provide massive benefits if they could convince/coerce their most popular downloaded models to move to torrenting?
    
    intrasight 5 months ago
    
    Benefit to you, but great downside to the three letter agencies that inject their goods into these models.
  - homarp 5 months ago
    
    how are all the private trackers tracking ratios?
  - taminka 5 months ago
    
    most of the traffic is probably from open weights, just seed those, host private ones as is
Fin_Code 5 months ago

I still don't know why they are not running on torrent. Its the perfect use case.
- heliumtera 5 months ago
  
  How can you be the man in the middle in a truly P2P environment?
- freedomben 5 months ago
  
  That would shut out most people working for big corp, which is probably a huge percentage of the user base. It's dumb, but that's just the way corp IT is (no torrenting allowed).
  - zozbot234 5 months ago
    
    It's a sensible option, even when not everyone can really use it. Linux distros are routinely transfered via torrent, so why not other massive, open-licensed data?
    
    freedomben 5 months ago
    
    Oh as an option, yeah I agree it makes a ton of sense. I just would expect a very, very small percentage of people to use the torrent over the direct download. With Linux distros, the vast majority of downloads still come from standard web servers. When I download distro images I opt for torrents, but very few people do the same
    
    Const-me 5 months ago
    
    > very small percentage of people to use the torrent over the direct download
    BitTorrent protocol is IMO better for downloading large files. When I want to download something which exceeds couple GB, and I see two links direct download and BitTorrent, I always click on the torrent.
    On paper, HTTP supports range requests to resume partial downloads. IME, it seems modern web browsers neglected to implement it properly. They won’t resume after browser is reopened, or the computer is restarted. Command-line HTTP clients like wget are more reliable, however many web servers these days require some session cookies or one-time query string tokens, and it’s hard to pass that stuff from browser to command-line.
    I live in Montenegro, CDN connectivity is not great here. Only a few of them like steam and GOG saturate my 300 megabit/sec download link. Others are much slower, e.g. windows updates download at about 100 megabit/sec. BitTorrent protocol almost always delivers the 300 megabit/sec bandwidth.
    
    zrm 5 months ago
    
    With Linux distros they typically put the web link right on the main page and have a torrent available if you go look for it, because they want you to try their distro more than they want to save some bandwidth.
    Suppose HF did the opposite because the bandwidth saved is more and they're not as concerned you might download a different model from someone else.
    
    thot_experiment 5 months ago
    
    I have terabytes of linux isos I got via torrents, many such cases!

simonw 5 months ago

It's hard to overstate the impact Georgi Gerganov and llama.cpp have had on the local model space. He pretty much kicked off the revolution in March 2023, making LLaMA work on consumer laptops.

Here's that README from March 10th 2023 https://github.com/ggml-org/llama.cpp/blob/775328064e69db1eb...

> The main goal is to run the model using 4-bit quantization on a MacBook. [...] This was hacked in an evening - I have no idea if it works correctly.

Hugging Face have been a great open source steward of Transformers, I'm optimistic the same will be true for GGML.

I wrote a bit about this here: https://simonwillison.net/2026/Feb/20/ggmlai-joins-hugging-f...

ushakov 5 months ago

i am curious, why are your comments always pinned to the top?
- carbocation 5 months ago
  
  Because many of us think simonw has discerning taste on this topic and like to read what he has to say about it, so we upvote his comments.
  - ushakov 5 months ago
    
    i don't doubt this. i just find it questionable that one particular poster always gets in the spotlight when AI is the topic - while other conversations in my opinion offer more interesting angles.
    
    jonas21 5 months ago
    
    Upvote the conversations that you find to be more interesting. If enough people do the same, they too will make it to the top.
    
    coldtea 5 months ago
    
    Parent implies there might be some "boosting" involved, in which case, "upvote the conversations that you find to be more interesting" wont change anything...
    Not saying this is the case, but it's what the comment implies, so "just upvote your faves" doesn't really address it.
    
    colesantiago 5 months ago
    
    Agreed,
    I would like to see others, being promoted to the top rather than Simon’s constant shilling for backlinks to his blog every time an AI topic is on the front page.
- simonw 5 months ago
  
  At a guess that's because my comment attracted more up-votes than the other top-level comments in the thread.
  I generally try to include something in a comment that's not information already under discussion - in this case that was the link and quote from the original README.
  - ushakov 5 months ago
    
    of course your comment attracts more upvotes - it's at the top.
    
    seanhunter 5 months ago
    
    It’s at the top because of upvotes. They don’t have an “if simonw: boost” branch in the code.
    
    ushakov 5 months ago
    
    the code is not public, so we can't know. i think it's much more nuanced and certain users' comments might get a preferential treatment, based on factors other than the upvote count - which itself is hidden from us.
    
    ComplexSystems 5 months ago
    
    > the code is not public, so we can't know.
    I feel like you're making this statement in bad faith, rather than honestly believing the developers of the forum software here have built in a clause to pin simonw's comments to the top.
    
    satvikpendem 5 months ago
    
    > certain users' comments might get a preferential treatment
    This does not happen. It hasn't even happened when pg made the forum in the first place.
    
    dcrazy 5 months ago
    
    I thought dang explicitly said it does happen? It certainly happens for stories.
    
    ontouchstart 5 months ago
    
    Attention feeds attention.
    Attention is ALL You Need.
- llm_nerd 5 months ago
  
  HN goes through phases. I remember when patio11 was the star of the hour on here. At another time it was that security guy (can't remember his name).
  And for those who think it's just organic with all of the upvotes, HN absolutely does have a +/- comment bias for users, and it does automatically feature certain people and suppress others.
  - imiric 5 months ago
    
    > And for those who think it's just organic with all of the upvotes, HN absolutely does have a bias for authors, and it does automatically feature certain people and suppress others.
    Exactly.
    There are configurable settings for each account, which might be automatically or manually set—I'm not sure–, that control the initial position of a comment in threads, and how long it stays there. There might be a reward system, where comments from high-karma accounts are prioritized over others, and accounts with "strikes", e.g. direct warnings from moderators, are penalized.
    The difference in upvotes that account ultimately receives, and thus the impact on the discussion, is quite stark. The more visible a comment is, i.e. the more at the top it is, the more upvotes it can collect, which in turn makes it stay at the top, and so on.
    It's safe to assume that certain accounts, such as those of YC staff, mods, or alumni, or tech celebrities like simonw, are given the highest priority.
    I've noticed this on my own account. Before being warned for an IMO bullshit reason, my comments started to appear near the middle, and quickly float down to the bottom, whereas before they would usually be at the top for a few minutes. The quality of what I say hasn't changed, though the account's standing, and certainly the community itself, has.
    I don't mind, nor particularly care about an arbitrary number. This is a proprietary platform run by a VC firm. It would be silly to expect that they've cracked the code of online discourse, or that their goal is to keep it balanced. The discussions here are better on average than elsewhere because of the community, although that also has been declining over the years.
    I still find it jarring that most people would vote on a comment depending on if they agree with it or not, instead of engaging with it intellectually, which often pushes interesting comments to the bottom. This is an unsolved problem here, as much as it is on other platforms.
    
    Eisenstein 5 months ago
    
    There is a saying that if everyone you encounter seems to be unreasonable, maybe it isn't the other people that are being unreasonable.
    This isn't to say that social media is fair, or that people vote properly or that any ranking system based on agreement by readers is a good one. However, generally when you are getting negativity communicated to you and you are seeing consistently poor results around actions you take, it is going to be useful to examine the possibility that there is a difference in how you perceive what you are doing vs how others do. In that case spending time trying to figure out ways in which you are being wronged so that you can continue in the same manner is going to be time wasted.
    
    llm_nerd 5 months ago
    
    You seem to be assuming that everything is organic and above board on here. That it's all just user/community stimuli, and if someone flies high well clearly it's great content, from which we can infer the reverse as well.
    We don't have the source for HN, nor do we have the obvious bias metadata that the moderators have put in place, but simply paying attention betrays that manipulation mechanisms exist and are heavily utilized.
    For instance I clearly have a "bad guy" flag on my account, and frequently see my highly rated comments sorted below literally greyed out comments. Comments older than mine, so it isn't just the normal "well newer comments get a boost", it's just that there is a comment "DEI" in place where some people get a freebie boost and some people get a freebie detriment. It's why often mediocre content and comments by the core group is always floating high.
    And let me make it very clear that I do not care. I don't harbour any delusions about some tight community or the like, and HN is not important in my life or my ego. I also know that it's basically a propaganda network for YC (I mean...it's right in the URL), and good for them. It's their site and they can do anything they want with it.
    I only commented because some people really think this place is a meritocracy+democracy. That isn't how it works, even if they really want people to think that.
    
    Eisenstein 5 months ago
    
    No one is under the assumption that any social media space is going to be meritocratic or democratic. The assumption is that some percentage of users are manipulating it and the backend and admins are doing the same. It is an attention economy. I don't think anyone is naive about this. My comment was merely a take on the 'the video game controller is broken' excuse that everyone has when they need to cover for their ego. Sometimes the controller is broken, but it almost never is.
    
    imiric 5 months ago
    
    How are you getting persecution complex from what I said? If anything, your comment might be feeding that delusion. :)
    My point is that HN definitely has certain weights associated with accounts, which control the karma, visibility, and ultimately discussion of certain topics.
    This problem doesn't affect only negativity or downvotes, but upvotes as well. The most upvoted comments are not necessarily of the highest quality, or contribute the most to the discussion. They just happen to be the most visible, and to generally align with the feeling of the hive mind.
    I know this because some of my own comments have been at the top, without being anything special, while others I think are, barely get any attention. I certainly examine my thinking whenever it strongly aligns with the hive mind, as this community does not particularly align with my values.
    I also tend to seek out comments near the bottom of threads, and have dead comments enabled, precisely to counteract this flawed system. I often find quality opinions there, so I suggest everyone do the same as well.
    An essential feature of a healthy and interesting discussion forum is to accomodate different viewpoints. That starts by not burying those that disagree with the majority, or boosting those that agree. AFAIK no online system has gotten this right yet.
  - rymc 5 months ago
    
    the security you mean is probably tptacek (https://news.ycombinator.com/user?id=tptacek)
- throwaway2027 5 months ago
  
  Time flies and simonw his AI feedback isn't always received favorably, sometimes he pushes it too much.
- satvikpendem 5 months ago
  
  They aren't pinned, people just vote on them, and more so because simonw is a recognizable name with lots of posts and comments.
- magicalhippo 5 months ago
  
  New comments get a boost, and as such are frequently near the top just due to that. Frequent upvotes also boosts. There might be other factors.
  However these things are dynamic and change over time. As I read the discussion just now, the GP comment was the ~5th top-level comment.
- francispauli 5 months ago
  
  thanks for reminding me i need to follow his blog weekly again

HanClinto 5 months ago

I'm regularly amazed that HuggingFace is able to make money. It does so much good for the world.

How solid is its business model? Is it long-term viable? Will they ever "sell out"?

microsoftedging 5 months ago

FT had a solid piece a few weeks back: "Why AI start-up Hugging Face turned down a $500mn Nvidia deal"
https://giftarticle.ft.com/giftarticle/actions/redeem/9b4eca...
- jackbravo 5 months ago
  
  sounds very interesting, but even though it says giftarticle.ft, I got blocked by a paywall.
  - nerevarthelame 5 months ago
    
    https://archive.is/zSyUc
    To summarize, they rejected Nvidia's offer because they didn't want one outsized investor who could sway decisions. And "the company was also able to turn down Nvidia due to its stable finances. Hugging Face operates a 'freemium' business model. Three per cent of customers, usually large corporations, pay for additional features such as more storage space and the ability to set up private repositories."
    
    bee_rider 5 months ago
    
    Freemium seems to be working pretty well for them—what’s the alternative website, after all. They seem to command their niche.
  - culi 5 months ago
    
    find the Bypass Paywalls Clean extension. Never worry about a paywall again
dmezzetti 5 months ago

They have paid hosting - https://huggingface.co/enterprise and paid accounts. Also consulting services. Seems like a pretty good foundation to me.
- julien_c 5 months ago
  
  and a lot of traction on paid (private in particular) storage these days; sneak peek at new landing page: https://huggingface.co/storage
bityard 5 months ago

Their business model is essentially the same as GitHub. Host lots of stuff for free and build a community around it, sell the upscaled/private version to businesses. They are already profitable.
- HanClinto 5 months ago
  
  This is what Sourceforge did too, and they still had the DevShare adware thing didn't they?
  GitHub is great -- huge fan. To some degree they "sold out" to Microsoft and things could have gone more south, but thankfully Microsoft has ruled them with a very kind hand, and overall I'm extremely happy with the way they've handled it.
  I guess I always retain a bit of skepticism with such things, and the long-term viability and goodness of such things never feels totally sure.
heliumtera 5 months ago

>Will they ever "sell out"?
Oh no, never. Don't worry, the usual investors are very well known for fighting for user autonomy (AMD, Nvidia, Intel,IBM, Qualcomm)
They are all very pro consumers and all backers are certainly here for your enjoyment only
- zozbot234 5 months ago
  
  These are all big hardware firms, which makes a lot of sense as a classic 'commoditize the complement' play. Not exactly pro-consumer, but not quite anti-consumer either!
  - smallerize 5 months ago
    
    heliumtera is being sarcastic.
I_am_tiberius 5 months ago

I once tried hugging face because I wanted I worked through some tutorial. They wanted my credit card details during the registration as far as I remember. After a month they invoiced me some amount of money and I had no idea what it was. To be honest, I don't understand what exactly they do and what services I was paying for, but I cancelled my account and never touched it again. For me that was a totally intransparent process.
- shafyy 5 months ago
  
  Their pricing seems pretty transparent: https://huggingface.co/pricing
- in-silico 5 months ago
  
  Sounds like a personal skill issue

car 5 months ago

So great to see my two favorite Open Source AI projects/companies joining forces.

Since I don't see it mentioned here, LlamaBarn is an awesome little—but mighty—MacOS menubar program, making access to llama.cpp's great web UI and downloading of tastefully curated models easy as pie. It automatically determines the available model- and context-sizes based on available RAM.

https://github.com/ggml-org/LlamaBarn

Downloaded models live in:

  ~/.llamabarn

Apart from running on localhost, the server address and port can be set via CLI:

  # bind to all interfaces (0.0.0.0)
  defaults write app.llamabarn.LlamaBarn exposeToNetwork -bool YES

  # or bind to a specific IP (e.g., for Tailscale)
  defaults write app.llamabarn.LlamaBarn exposeToNetwork -string "100.x.x.x"

  # disable (default)
  defaults delete app.llamabarn.LlamaBarn exposeToNetwork

noisy_boy 5 months ago

Github is showing me unicorn - is there an Linux equivalent? I have a old Thinkpad with a puny Nvidia GPU, can I hope to find anything useful to run on that?
- car 5 months ago
  
  Building Llama.cpp from source with CUDA enabled should get you pretty far. llama-server has a really good web UI, the latest version supports model switching.
  As for models, plenty of GGUF quantized (down to 2-bit) available on HF and modelscope.

0xbadcafebee 5 months ago

> The community will continue to operate fully autonomously and make technical and architectural decisions as usual. Hugging Face is providing the project with long-term sustainable resources, improving the chances of the project to grow and thrive. The project will continue to be 100% open-source and community driven as it is now.

I want this to be true, but business interests win out in the end. Llama.cpp is now the de-facto standard for local inference; more and more projects depend on it. If a company controls it, that means that company controls the local LLM ecosystem. And yeah, Hugging Face seems nice now... so did Google originally. If we all don't want to be locked in, we either need a llama.cpp competitor (with a universal abstration), or it should be controlled by an independent nonprofit.

zozbot234 5 months ago

Llama.cpp is an open source project that anyone can fork as needed, so any "control" over it really only extends to facilitating development of certain features.
- 0xbadcafebee 5 months ago
  
  In practice, nobody does this, because you then have to keep the fork up to date with upstream plus your changes, and this is an endless amount of work.

mnewme 5 months ago

Huggingface is the silent GOAT of the AI space, such a great community and platform

lairvOP 5 months ago

Truly amazing that they've managed to build an open and profitable platform without shady practices
- al_borland 5 months ago
  
  It’s such a sad state of affairs when shady practices are so normal that finding a company without them is noteworthy.

jgrahamc 5 months ago

This is great news. I've been sponsoring ggml/llama.cpp/Georgi since 2023 via Github. Glad to see this outcome. I hope you don't mind Georgi but I'm going to cancel my sponsorship now you and the code have found a home!

superkuh 5 months ago

I'm glad the llama.cpp and the ggml backing are getting consistent reliable economic support. I'm glad that ggerganov is getting rewarded for making such excellent tools.

I am somewhat anxious about "integration with the Hugging Face transformers library" and possible python ecosystem entanglements that might cause. I know llama.cpp and ggml already have plenty of python tooling but it's not strictly required unless you're quantizing models yourself or other such things.

beoberha 5 months ago

Seems like a great fit - kinda surprised it didn’t happen sooner. I think we are deep in the valley of local AI, but I’d be willing to bet it breaks out in the next 2-3 years. Here’s hoping!

breisa 5 months ago

I mean they already supported the project quite a bit. @ngxson and maybe others? from Huggingface are big contributors to llama.cpp.

tkp-415 5 months ago

Can anyone point me in the direction of getting a model to run locally and efficiently inside something like a Docker container on a system with not so strong computing power (aka a Macbook M1 with 8gb of memory)?

Is my only option to invest in a system with more computing power? These local models look great, especially something like https://huggingface.co/AlicanKiraz0/Cybersecurity-BaronLLM_O... for assisting in penetration testing.

I've experimented with a variety of configurations on my local system, but in the end it turns into a make shift heater.

0xbadcafebee 5 months ago

8GB is not enough to do complex reasoning, but you could do very small simple things. Models like Whisper, SmolVLM, Quen2.5-0.5B, Phi-3-mini, Granite-4.0-micro, Mistral-7B, Gemma3, Llama-3.2 all work on very little memory. Tiny models can do a lot if you tune/train them. They also need to be used differently: system prompt preloaded with information, few-shot examples, reasoning guidance, single-task purpose, strict output guidelines. See https://github.com/acon96/home-llm for an example. For each small model, check if Unsloth has a tuned version of it; it reduces your memory footprint and makes inference faster.
For your Mac, you can use Ollama, or MLX (Mac ARM specific, requires different engine and different model disk format, but is faster). Ramalama may help fix bugs or ease the process w/MLX. Use either Docker Desktop or Colima for the VM + Docker.
For today's coding & reasoning models, you need a minimum of 32GB VRAM combined (graphics + system), the more in GPU the better. Copying memory between CPU and GPU is too slow so the model needs to "live" in GPU space. If it can't fit all in GPU space, your CPU has to work hard, and you get a space heater. That Mac M1 will do 5-10 tokens/s with 8GB (and CPU on full blast), or 50 token/s with 32GB RAM (CPU idling). And now you know why there's a RAM shortage.
- BoredomIsFun 5 months ago
  
  > Mistral-7B
  Is hopelessly dated. There are much better newer models around.
mft_ 5 months ago

There’s no way around needing a powerful-enough system to run the model. So you either choose a model that can fit on what you have —i.e. via a small model, or a quantised slightly larger model— or you access more powerful hardware, either by buying it or renting it. (IME you don’t need Docker. For an easy start just install LM Studio and have a play.)
I picked up a second-hand 64GB M1 Max MacBook Pro a while back for not too much money for such experimentation. It’s sufficiently fast at running any LLM models that it can fit in memory, but the gap between those models and Claude is considerable. However, this might be a path for you? It can also run all manner of diffusion models, but there the performance suffers (vs. an older discrete GPU) and you’re waiting sometimes many minutes for an edit or an image.
- ryandrake 5 months ago
  
  I wasn't able to have very satisfying success until I bit the bullet and threw a GPU at the problem. Found an actually reasonably priced A4000 Ada generation 20GB GPU on eBay and never looked back. I still can't run the insanely large models, but 20GB should hold me over for a while, and I didn't have to upgrade my 10 year old Ivy Bridge vintage homelab.
- sigbottle 5 months ago
  
  Are mac kernels optimized compared to CUDA kernels? I know that the unified GPU approach is inherently slower, but I thought a ton of optimizations were at the kernel level too (CUDA itself is a moat)
  - liuliu 5 months ago
    
    Depending on what you do. If you are doing token generations, compute-dense kernel optimization is less interesting (as, it is memory-bounded) than latency optimizations else where (data transfers, kernel invocations etc). And for these, Mac devices actually have a leg than CUDA kernels (as pretty much Metal shaders pipelines are optimized for latencies (a.k.a. games) while CUDA shaders are not (until cudagraph introduction, and of course there are other issues).
  - ttoinou 5 months ago
    
    There’s this developer called nightmedia who converts a lot of models to apple MLX. I can run Qwen3 coder next at 60 tps on my m4 max. It works
  - bigyabai 5 months ago
    
    Mac kernels are almost always compute shaders written in Metal. That's the bare-minimum of acceleration, being done in a non-portable proprietary graphics API. It's optimized in the loosest sense of the word, but extremely far from "optimal" relative to CUDA (or hell, even Vulkan Compute).
    Most people will not choose Metal if they're picking between the two moats. CUDA is far-and-away the better hardware architecture, not to mention better-supported by the community.
zozbot234 5 months ago

The general rule of thumb is that you should feel free to quantize even as low as 2 bits average if this helps you run a model with more active parameters. Quantized models are not perfect at all, but they're preferable to the models with fewer, bigger parameters. With 8GB usable, you could run models with up to 32B active at heavy quantization.
- zargon 5 months ago
  
  A large model (100B+, the more the better) may be acceptable at 2-bit quantization, depending on the task. But not a small model. Especially not for technical tasks. On top of that, one still needs room for OS, software and KV cache. 8GB is just not very useful for local LLMs. That said, it can still be entertaining to try out a 4-bit 8B model for the fun of it.
  - zozbot234 5 months ago
    
    100B+ is the amount of total parameters, whereas what matters here is active - very different for sparse MoE models. You're right that there's some overhead for the OS/software stack but it's not that much. KV-cache is a good candidate for being swapped out, since it only gets a limited amount of writes per emitted token.
    
    zargon 5 months ago
    
    Total parameters, not active parameters, is the property that matters for model robustness under extreme quantization.
    Once you're swapping from disk, the performance will be quite unusable for most people. And for local inference, KV cache is the worst possible choice to put on disk.
ontouchstart 5 months ago

This is the easiest set up on a Mac. You need at least 16gb on a MacBook:
https://github.com/ggml-org/llama.cpp/discussions/15396
xrd 5 months ago

I think a better bet is to ask on reddit.
https://www.reddit.com/r/LocalLLM/
Everytime I ask the same thing here, people point me there.
yjftsjthsd-h 5 months ago

With only 8 GB of memory, you're going to be running a really small quant, and it's going to be slow and lower quality. But yes, it should be doable. In the worst case, find a tiny gguf and run it on CPU with llamafile.
HanClinto 5 months ago

Maybe check out Docker Model Runner -- it's built on llama.cpp (in a good way -- not like Ollama) and handles I think most of what you're looking for?
https://www.docker.com/blog/run-llms-locally/
As far as how to find good models to run locally, I found this site recently, and I liked the data it provides:
https://localclaw.io/
Hamuko 5 months ago

I tried to run some models on my M1 Max (32 GB) Mac Studio and it was a pretty miserable experience. Slow performance and awful results.

dmezzetti 5 months ago

This is really great news. I've been one of the strongest supporters of local AI dedicating thousands of hours towards building a framework to enable it. I'm looking forward to seeing what comes of it!

logicallee 5 months ago

>I've been one of the strongest supporters of local AI, dedicating thousands of hours towards building a framework to enable it.
Sounds like you're very serious about supporting local AI. I have a query for you (and anyone else who feels like donating) about whether you'd be willing to donate some memory/bandwidth resources p2p to hosting an offline model:
We have a local model we would like to distribute but don't have a good CDN.
As a user/supporter question, would you be willing to donate some spare memory/bandwidth in a simple dedicated browser tab you keep open on your desktop that plays silent audio (to not be put in the background and deloaded) and then allocates 100mb -1 gb of RAM and acts as a webrtc peer, serving checksumed models?[1] (Then our server only has to check that you still have it from time to time, by sending you some salt and a part of the file to hash and your tab proves it still has it by doing so). This doesn't require any trust, and the receiving user will also hash it and report if there's a mismatch.
Our server federates the p2p connections, so when someone downloads they do so from a trusted peer (one who has contributed and passed the audits) like you. We considered building a binary for people to run but we consider that people couldn't trust our binaries, or would target our build process somehow, we are paranoid about trust, whereas a web model is inherently untrusted and safer. Why do all this?
The purpose of this would be to host an offline model: we successfully ported a 1 GB model from C++ and Python to WASM and WebGPU (you can see Claude doing so here, we livestreamed some of it[2]), but the model weights at 1 GB are too much for us to host.
Please let us know whether this is something you would contribute a background tab to hosting on your desktop. It wouldn't impact you much and you could set how much memory to dedicate to it, but you would have the good feeling of knowing that you're helping people run a trusted offline model if they want - from their very own browser, no download required. The model we ported is fast enough for anyone to run on their own machines. Let me know if this is something you'd be willing to keep a tab open for.
[1] filesharing over webrtc works like this: https://taonexus.com/p2pfilesharing/ you can try it in 2 browser tabs.
[2] https://www.youtube.com/watch?v=tbAkySCXyp0and and some other videos
- HanClinto 5 months ago
  
  Hosting model weights for projects like this I think is something that you could upload to a space in Hugging Face?
  What services would you need that Hugging Face doesn't provide?
- echoangle 5 months ago
  
  Maybe stupid question but why not just put it in a torrent?
  - logicallee 5 months ago
    
    Torrents require users to download and install a torrent client! In addition, we would like to retain the possibility of giving live updates to the latest version of a sovereign fine-tuned file, torrents don't autoupdate. We want to keep improving what people get.
    Finally, we would like the possibility of setting up market dynamics in the future: if you aren't currently using all your ram, why not rent it out? This matches the p2p edge architecture we envision.
    In addition, our work on WebGPU would allow you to rent out your gpu to a background tab whenever you're not using it. Why have all that silicon sit idle when you could rent it out?
    You could also donate it to help fine tune our own sovereign model.
    All of this will let us bootstrap to the point where we could be trusted with a download.
    We have a rather paranoid approach to security.
  - liuliu 5 months ago
    
    It is very simple. Storage / bandwidth is not expensive. Residential bandwidth is. If you can convince people to install a bandwidth-related software on their residential homes, you can then charge other people $5 to $10 per 1GiB bandwidth (useful for botnet mostly, get around DDOS protections and other reCAPTCHA tasks).
    
    logicallee 5 months ago
    
    Thank you for your suggestion. Below is only our plans/intentions, we welcome feedback about it:
    We are not going to do what you suggest. Instead, our approach is to use the RAM people aren't using at the moment for a fast edge cache close to their area.
    We've tried this architecture and get very low latency and high bandwidth. People would not be contributing their resources to anything they don't know about.
- liuliu 5 months ago
  
  > We have a local model we would like to distribute but don't have a good CDN.
  That is not true. I am serving models off Cloudflare R2. It is 1 petabyte per month in egress use and I basically pay peanuts (~$200 everything included).
  - logicallee 5 months ago
    
    1 petabyte per month is 1 million downloads of a 1 GB file. We intend to scale to more than 1 million downloads per month. We have a specific scaling architecture in mind. We're qualified to say this because we've ported a billion parameter model to run in your browser - fast - on either webgpu or wasm. (You can see us doing it live at the youtube link in my comment above.) There is a lot of demand for that.
    
    liuliu 5 months ago
    
    The bandwidth is free on Cloudflare R2. I paid money for storage (~10TiB storage of different models). If you only host 1GiB file there, you are only paying $0.01 per month I believe.

ukblewis 5 months ago

Honestly I’m shocked to be the only one I see of this opinion: HuggingFace’s `accelerate`, `transformers` and `datasets` have been some of the worst open source Python libraries I have ever used that I had to use. They break backwards compatibility constantly, even on APIs which are not underscore/dunder named even on minor version releases without even documenting this, they refuse PRs fixing their lack of `overloads` type annotations which breaks type checking on their libraries and they just generally seem to have spaghetti code. I am not excited that another team is joining them and consolidating more engineering might in the hands of these people

ukblewis 5 months ago

And I said all of that despite us continuing to use their platform and libraries extensively… We just don’t have a choice due to their dominance of open source ML
ukblewis 5 months ago

And clearly I say all of this in my name and not my employers name

mhher 5 months ago

It's great to see the ggml team getting proper backing. Keeping inference in bare-metal C/C++ without the Python bloat is the only way local AI is going to scale efficiently. Well deserved for Georgi, Johannes, Piotr, and the rest of the team.

ontouchstart 5 months ago

I have played with both mlx-lm and llama.cpp after I bought a 24GB M5 MacBook Pro last year.

Then I fell down the rabbit holes of uv, rust and C++ and forgot about LLMs. Today after I saw this announcement and answered someone’s question about how to set it up, when I got home, I decided play with llama.cpp again.

I was surprised and impressed:

https://ontouchstart.github.io/rabbit-holes/llama.cpp/

I am not going to use mlx-lm or lmstudio anymore. llama.cpp is so much fun.

the__alchemist 5 months ago

Does anyone have a good comparison of HuggingFace/Candle to Burn? I am testing them concurrently, and Burn seems to have an easier-to-use API. (And can use Candle as a backend, which is confusing) When I ask on Reddit or Discord channels, people overwhelmingly recommend Burn, but provide no concrete reasons beyond "Candle is more for inference while Burn is training and inference". This doesn't track, as I've done training on Candle. So, if you've used both: Thoughts?

csunoser 5 months ago

I have used both (albeit 2 years ago, and things change really fast). At the time, Candle didn't have 2d conv backprop with strides properly implemented. And getting Burn running libtch backend was just a lot simpler.
I did use candle for wasm based inference for teaching purposes - that was reasonably painless and pretty nice.

mattfrommars 5 months ago

I don’t know if this warrants a separate thread here but I have to ask…

How can I realistically get involved the AI development space? I feel left out with what’s going on and living in a bubble where AI is forced into by my employer to make use of it (GitHub Copilot), what is a realistic road map to kinda slowly get into AI development, whatever that means

My background is full stack development in Java and React, albeit development is slow.

I’ve only messed with AI on very application side, created a local chat bot for demo purposes to understand what RAG is about to running models locally. But all of this is very superficial and I feel I’m not in the deep with what AI is about. I get I’m too ‘late’ to be on the side of building the next frontier model and makes no sense, what else can I do?

I know Python, next step is maybe do ‘LLM from scratch”? Or I pick up Google machine learning crash course certificate? Or do recently released Nvidia Certification?

I’m open for suggestions

fc417fc802 5 months ago

I'm not entirely clear what your goals are but roughly, just figure out an application that holds your interest and build a model for it from scratch. Probably don't start with an LLM though. Same as for anything else really. If you're interest in computer graphics then decide on a small scale project and go build it from scratch. Etc.
w10-1 5 months ago

The competition for root and branch AI models and infrastructure is intense and skilled.
But if you're adjacent to some leaf use-case for AI, you're likely already as good as anyone else at productizing it.
And that's who is getting hired: people who show they can deliver product-market fit.
breisa 5 months ago

Maybe look into model finetuning/distilation. Unsloth [1] has great guides and provides everything you need to get started on Google Colab for free. [1] https://unsloth.ai/
swyx 5 months ago

go thru workshops here https://www.youtube.com/@aiDotEngineer/

jimmydoe 5 months ago

Amazing. I like the openness of both project and really excited for them.

Hopefully this does not mean consolidation due to resource dry up but true fusion of the bests.

option 5 months ago

Isn't HF banned in China? Also, how are many Chinese labs on Twitter all the time?

In either case - huge thanks to them for keeping AI open!

dragonwriter 5 months ago

> Isn't HF banned in China?
I think, for some definition of “banned”, that’s the case. It doesn’t stop the Chinese labs from having organization accounts on HF and distributing models there. ModelScope is apparently the HF-equivalent for reaching Chinese users.
disiplus 5 months ago

I think in the West we think everything is blocked. But for example, if you book an eSIM, when you visit you already get direct access to Western services because they route it to some other server. Hong Kong is totally different: they basically use WhatsApp and Google Maps, and everything worked when I was there.
- embedding-shape 5 months ago
  
  But also yes, parent is right, HF is more or less inaccessible, and Modelscope frequently cited as the mirror to use (although many Chinese labs seems to treat HF as the mirror, and Modelscope as the "real" origin).
woadwarrior01 5 months ago

HF is indeed banned in China. The Chinese equivalent of HF is ModelScope[1].
[1]: https://modelscope.cn/

androiddrew 5 months ago

One of the few acquisitions I do support

am17an 5 months ago

One often overlooked after that is ggml, the tensor library that runs llama.cpp is not based on pytorch, rather just plain cpp. In a world where pytorch dominates, it shows that alternatives are possible and are worthy to be pursued.

kristianp 5 months ago

> Towards seamless “single-click” integration with the transformers library

That's interesting. I thought they would be somewhat redundant. They do similar things after all, except training.

stephantul 5 months ago

Georgi is such a legend. Glad to see this happening

segmondy 5 months ago

Great news! I have always worried about ggml and long term prospect for them and wished for them to be rewarded for their effort.

fancy_pantser 5 months ago

Was Georgi ever approached by Meta? I wonder what they offered (I'm glad they didn't succeed, just morbid curiosity).

sbinnee 5 months ago

I am happy for ggml team. They did so much work for quantization and actually made it available to everyone. Thank you.

sheepscreek 5 months ago

Curious about the financials behind this deal. Did they close above what they raised? What’s in it for HuggingFace?

dhruv3006 5 months ago

Huggingface is actually something thats driving good in the world. Good to see this collab/

jpcompartir 5 months ago

This is great, brings clear benefits to both sides and the rest of us.

Always rooting for Hugging Face

karmasimida 5 months ago

Does local AI have a future? The models are getting ridiculously big and any storage hardware is hoarded by few companies for next 2 years and nvidia has stopped making consumer GPU for this year.

It seems to me there is no chance local ML is going to be anywhere out of the toy status comparing to closed source ones in short term

rhdunn 5 months ago

Mistral have small variants (3B, 8B, 14B, etc.), as do others like IBM Granite and Qwen. Then there are finetunes based on these models, depending on your workflow/requirements.
- karmasimida 5 months ago
  
  True, but anything remotely useful is 300B and above
  - Eupolemos 5 months ago
    
    That is a very broad and silly position to take, especially in this thread.
    I use Devstral 2 and Gemini 3 daily.
    
    ac29 5 months ago
    
    Devstral 2 is 123B parameters. Thats less than 300B, but its still much larger than the 3-14B models GP was talking about.
dust42 5 months ago
I am actually doing now a good part of dev with Qwen3-Coder-Next on an M1 64GB with Qwen Code CLI (a fork of Gemini CLI). I very much like
```
  a) to have an idea how much tokens I use and 
  b) be independent of VC financed token machines and 
  c) I can use it on a plane/train
```
Also I never have to wait in a queue, nor will I be told to wait for a few hours. And I get many answers in a second.
I don't do full vibe coding with a dozen agents though. I read all the code it produces and guide it where necessary.
Last not least, at some point the VC funded party will be over and when this happens one better knows how to be highly efficient in AI token use.
- ttoinou 5 months ago
  
  How much tokens per seconds are you getting ?
  Whats the advantage of qwen code cli over opencode ?
  - dust42 5 months ago
    
    320 tok/s PP and 42 tok/s TG with 4bit quant and MLX. Llama.cpp was half for this model but afaik has improved a few days ago, I haven't yet tested though.
    I have tried many tools locally and was never really happy with any. I tried finally Qwen Code CLI assuming that it would run well with a Qwen model and it does. YMMV, I mostly do javascript and Python. Most important setting was to set the max context size, it then auto compacts before reaching it. I run with 65536 but may raise this a bit.
    Last not least OpenCode is VC funded, at some point they will have to make money while Gemini CLI / Qwen CLI are not the primary products of the companies but definitely dog-fooded.
    
    ttoinou 5 months ago
    
    Works for me, but sometimes there's an issue with the tool template from Qwen, past chats are changed, thus KV cache gets invalidated and it needs to reprocess input tokens from scratch. Doesn't happen all the time though
    Btw I also get 42-60 tps on M4 Max with the MLX 4 bit quants hosted by LM Studio, which software do you use to run it ?
    
    dust42 5 months ago
    
    I use MLX server directly from the MLX community project (by Apple). 42 tps is with 0-5000 token context. Starts to drop from there, I have never seen 60.
    Yesterday I tested the latest llama.cpp and the result is that PP has made a huge jump to 420 tps which is 30% faster than MLX on my M1. TG is now 25 tps which is below MLX but does not degrade much, at 50k context it is still 22-23 tps.
    Together with Qwen code CLI llama.cpp does a lot less often re-process the full KV cache. So for now I am switching back to llama.cpp.
    It is worth to spend some time with the settings. I am really annoyed by the silly jokes (was it Claude that started this?). You can disable them with customWittyPhrases. Also setting contextWindowSize will make the CLI auto compress, which works really well for me.
    And depending on what you do, maybe set privacy.usageStatisticsEnabled to false.
    Like Gemini, Qwen CLI supports OpenTelemetry. When I have time I'll have a look why the KV cache gets invalidated.
    
    ttoinou 5 months ago
    
    Great thanks ! I am so annoyed by a specific phrase which is "launching wit.exe", not funny when it could actually be talking for real about software running on your machine

moralestapia 5 months ago

I hope Georgi gets a big fat check out of this, he deserves it 100%.

geooff_ 5 months ago

As someone who's been in the "AI" space for a while its strange how Hugging Face went from one of the biggest name to not a part of the discussion at all.

r_lee 5 months ago

I think that's because there's less local AI usage now since there's all kinds of image models by the big labs, so there's really no rush of people self hosting stable diffusion etc anymore
the space moved from Consumer to Enterprise pretty fast due to models getting bigger
- zozbot234 5 months ago
  
  Today's free models are not really bigger when you account for the use of MoE (with ever increasing sparsity, meaning a smaller fraction of active parameters), and better ways of managing KV caching. You can do useful things with very little RAM/VRAM, it just gets slower and slower the more you try to squeeze it where it doesn't quite belong. But that's not a problem if you're willing to wait for every answer.
  - r_lee 5 months ago
    
    yeah, but I mean more like the old setups where you'd just load a model on a 4090 or something, even with MoE it's a lot more complex and takes more VRAM, right? like it just seems not justifiable for most hobbyists
    but maybe I'm just slightly out of the loop
    
    zozbot234 5 months ago
    
    With sparse MoE it's worth running the experts in system RAM since that allows you to transparently use mmap and inactive experts can stay on disk. Of course that's also a slowdown unless you have enough RAM for the full set, but it lets you run much larger models on smaller systems.
segmondy 5 months ago

part of what discussion? anyone in the AI space knows and uses HF, but the public doesn't give a care and why should they? It's just an advanced site were nerds download AI stuff. HF is super valuable with their transformers library, their code, tutorials, smol-models, etc, but how does it translate to investor dollars?
LatencyKills 5 months ago

It isn't necessary to be part of the discussion if you are truly adding value (which HF continues to do). It's nice to see a company doing what it does best without constantly driving the hype train.

cyanydeez 5 months ago

Is there a local webui that integrates with Hugging face?

Ollama and webui seem to rapidly lose their charm. Ollama now includes cloud apis which makes no sense as a local.

lukebechtel 5 months ago

Thank you Georgi <3

periodjet 5 months ago

Prediction: Amazon will end up buying HuggingFace. Screenshot this.

forty 5 months ago

Looks like someone tried to type "Gmail" while drunk...

rkomorn 5 months ago

Looks like Gargamel of Smurfs fame to me.

rvz 5 months ago

This acquisition is almost the same as the acquisition of Bun by Anthropic.

Both $0 revenue "companies", but have created software that is essential to the wider ecosystem and has mindshare value; Bun for Javascript and Ggml for AI models.

But of course the VCs needed an exit sooner or later. That was inevitable.

andsoitis 5 months ago

I believe ggml.ai was funded by angel investors, not VC.

Settings

Ggml.ai joins Hugging Face to ensure the long-term progress of Local AI

Keyboard Shortcuts