Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

152 points by yu3zhou4 16 hours ago · 16 comments

Reader

README is in my opinion (author here) the most interesting - I wrote it to help others build useful mental model to be able to recreate the project yourself, without need to even read my code

janalsncm 10 hours ago

Really practical teaching approach. I clicked in to see how safetensors are loaded and just kept reading. Thanks for sharing.

tom-wal 3 hours ago

I feel like I learned twice as much in 10 minutes reading this than I did reading LLM for Dummies. Thank you

xuanlin314 9 hours ago

The lesson-style README is a great approach. Breaking down LLM inference into digestible steps makes the codebase approachable even for people who haven't touched CUDA before.

GoldenJade 9 hours ago

Thanks for sharing this. As someone currently researching LLMs, I'm sure I'll be referencing this quite a bit going forward.

dwa3592 13 hours ago

Very nice job on read me.

>>Physically, LLM is a file which contains a lot of float numbers.

aka atoms of the LLM.

cyanydeez 13 hours ago

the universe is just atomic if statments
- nullpoint420 4 hours ago
  
  it from bit

juancn 14 hours ago

Looks interesting, it reminds me of the first llama.cpp, but better documented.

nazgulsenpai 15 hours ago

I love the documentation formatted in lessons. I can't wait to read through it.

sylware 2 hours ago

I am looking at a plain and simple C implemented LLM inference, and/or x86_64 assembly implemented, and/or AMD GPU RDNA assembly.

Anybody?

cookiengineer 13 hours ago

Wanted to add that the author has an amazing blog with lots of interesting papers: https://jedrzej.maczan.pl/

einpoklum 13 hours ago

It seems the author believes checking the return values of CUDA API calls is not "tiny" enough :-(

Settings

Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

Keyboard Shortcuts