Small offline large language model – TinyChatEngine from MIT
graphthinking.blogspot.comUse llama.cpp for quantized model inference. It is simpler (no Docker nor Python required), faster (works well on CPUs), and supports many models.
Also there are better models than the one suggested. Mistral for 7B parameters. Yi if you want to go larger and happen to have 32Gb of memory. Mixtral MoE is the best but requires too much memory right now for most users.
I'm curious, what do you use these small LLMs for, like can you give some examples of (not too) personal uses cases from the past month?
My understanding (I haven’t used a fine tuned one) is that you can use one that you fine tune yourself for narrow automation tasks. Kind of like a superpowered script. From my llama 2 7b experiments I have not gotten great results out of the non fine tuned versions of the model for coding tasks. I haven’t tried code llama yet.
Thanks for the suggestion. I'm new to running LLMs so I'll take a look at your suggestion [0]. My ~10 year old MacBook Air has 4GB of RAM, so I'm primarily interested in smaller LLMs.
You don't necessarily need to fit the model all in memory – llama.cpp supports mmaping the model directly from disk in some cases. Naturally inference speed will be affected.
Btw, shouldn't it in theory be possible to run the Mixtral MoE loading next submodel sequentially and store outputs and then do the rest of the algorithm to make it easier to run on machines that cannot fit whole model in the memory?
Yes but loading weights into memory takes time
Yeah I imagine sequential inference would be slower. How long do you have to wait to load these weights on a personal PC? I have not tried using those systems so far.
Python is only used in the toolchain, the inference engine is entirely C/C++.
I’m a tad confused
> TinyChatEngine provides an off-line open-source large language model (LLM) that has been reduced in size.
But then they download the models from huggingface. I don’t understand how these are smaller? Or do they modify them locally?
https://github.com/mit-han-lab/TinyChatEngine
Turns out the original source is actually somewhat informative. Including telling you how much hardware do you need. This blog post looks like your typical note you leave for yourself to annotate a bit of your shell history.
I wish that all these repos were more clear about the hardware requirements. Seeing that it runs on a 8 GB Raspberry, probably with abysmal performance, I'd say that it will run on my 32 GB Intel laptop on the CPU. Will it run on its Nvidia card? I remember that the rule of thumb was one GB of GPU RAM per G parameters, so I'd say that it won't run. However this has 4 bit quantization so it could have lower requirements.
Of course the main problem is that I don't know enough about the subject to reason on it on my own.
Roughly speaking I believe it's the number of parameters times the size of the parameters. So in the 4 bit case it's half a gigabyte per billion parameters.
From a performance point of view (quantized) integer parameters are going to run better on CPUs than floating point parameters.
Your assessment is exactly correct -- the blog post is my note-to-self about getting the repo to work. My "added value" in the post is a Dockerfile for ease of installation.
They have postprocessed the models specifically for size and latency. They published several papers on this.
Their optimized models are not downloaded from HF, but from dropbox. I have no idea why.
"Small large" ..... so, medium? :)
No - LLMs can't talk to the dead, they're just fancy autocompletes
I have used them and I can say it's pretty decent overall. I personally plan to use tinyengineon iot devices which is for even smaller iot microcontroller devices.
May I ask what your use case is? I've found LLMs are pretty good at parsing unstructured data into JSON, with minimal hallucinations.
Is there a tutorial on how to do something like that? It sounds damn useful.
I’m also curious.
Where is a good place to understand the high level topics in AI. Like an offline language model compared to a presumably online model?
I tried this and installation was easy on macOS 10.14.6 (once I updated Clang correctly).
Performance on my relatively old i5-8600 CPU running 6 cores at 3.10GHz with 32GB of memory gives me about 150-250 ms per token on the default model, which is perfectly usable.