Small offline large language model – TinyChatEngine from MIT

graphthinking.blogspot.com

117 points by physicsgraph 2 years ago · 24 comments

Reader

antirez 2 years ago

Use llama.cpp for quantized model inference. It is simpler (no Docker nor Python required), faster (works well on CPUs), and supports many models.

Also there are better models than the one suggested. Mistral for 7B parameters. Yi if you want to go larger and happen to have 32Gb of memory. Mixtral MoE is the best but requires too much memory right now for most users.

j-bos 2 years ago

I'm curious, what do you use these small LLMs for, like can you give some examples of (not too) personal uses cases from the past month?
- SOLAR_FIELDS 2 years ago
  
  My understanding (I haven’t used a fine tuned one) is that you can use one that you fine tune yourself for narrow automation tasks. Kind of like a superpowered script. From my llama 2 7b experiments I have not gotten great results out of the non fine tuned versions of the model for coding tasks. I haven’t tried code llama yet.
physicsgraphOP 2 years ago

Thanks for the suggestion. I'm new to running LLMs so I'll take a look at your suggestion [0]. My ~10 year old MacBook Air has 4GB of RAM, so I'm primarily interested in smaller LLMs.
[0] https://github.com/ggerganov/llama.cpp
- akx 2 years ago
  
  You don't necessarily need to fit the model all in memory – llama.cpp supports mmaping the model directly from disk in some cases. Naturally inference speed will be affected.
TotalCrackpot 2 years ago

Btw, shouldn't it in theory be possible to run the Mixtral MoE loading next submodel sequentially and store outputs and then do the rest of the algorithm to make it easier to run on machines that cannot fit whole model in the memory?
- wfhpw 2 years ago
  
  Yes but loading weights into memory takes time
  - TotalCrackpot 2 years ago
    
    Yeah I imagine sequential inference would be slower. How long do you have to wait to load these weights on a personal PC? I have not tried using those systems so far.
PeterStuer 2 years ago

Python is only used in the toolchain, the inference engine is entirely C/C++.

upon_drumhead 2 years ago

I’m a tad confused

> TinyChatEngine provides an off-line open-source large language model (LLM) that has been reduced in size.

But then they download the models from huggingface. I don’t understand how these are smaller? Or do they modify them locally?

lrem 2 years ago

https://github.com/mit-han-lab/TinyChatEngine
Turns out the original source is actually somewhat informative. Including telling you how much hardware do you need. This blog post looks like your typical note you leave for yourself to annotate a bit of your shell history.
- pmontra 2 years ago
  
  I wish that all these repos were more clear about the hardware requirements. Seeing that it runs on a 8 GB Raspberry, probably with abysmal performance, I'd say that it will run on my 32 GB Intel laptop on the CPU. Will it run on its Nvidia card? I remember that the rule of thumb was one GB of GPU RAM per G parameters, so I'd say that it won't run. However this has 4 bit quantization so it could have lower requirements.
  Of course the main problem is that I don't know enough about the subject to reason on it on my own.
  - dkjaudyeqooe 2 years ago
    
    Roughly speaking I believe it's the number of parameters times the size of the parameters. So in the 4 bit case it's half a gigabyte per billion parameters.
    From a performance point of view (quantized) integer parameters are going to run better on CPUs than floating point parameters.
- physicsgraphOP 2 years ago
  
  Your assessment is exactly correct -- the blog post is my note-to-self about getting the repo to work. My "added value" in the post is a Dockerfile for ease of installation.
PeterStuer 2 years ago

They have postprocessed the models specifically for size and latency. They published several papers on this.
Their optimized models are not downloaded from HF, but from dropbox. I have no idea why.

rodnim 2 years ago

"Small large" ..... so, medium? :)

the_sleaze9 2 years ago

No - LLMs can't talk to the dead, they're just fancy autocompletes

aravindgp 2 years ago

I have used them and I can say it's pretty decent overall. I personally plan to use tinyengineon iot devices which is for even smaller iot microcontroller devices.

jcjmcclean 2 years ago

May I ask what your use case is? I've found LLMs are pretty good at parsing unstructured data into JSON, with minimal hallucinations.
- collyw 2 years ago
  
  Is there a tutorial on how to do something like that? It sounds damn useful.
- hm-nah 2 years ago
  
  I’m also curious.

collyw 2 years ago

Where is a good place to understand the high level topics in AI. Like an offline language model compared to a presumably online model?

darkmuck 2 years ago

https://huggingface.co/learn

dkjaudyeqooe 2 years ago

I tried this and installation was easy on macOS 10.14.6 (once I updated Clang correctly).

Performance on my relatively old i5-8600 CPU running 6 cores at 3.10GHz with 32GB of memory gives me about 150-250 ms per token on the default model, which is perfectly usable.

Settings

Small offline large language model – TinyChatEngine from MIT

Keyboard Shortcuts