Settings

Theme

Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

github.com

152 points by yu3zhou4 16 hours ago · 16 comments

Reader

yu3zhou4OP 15 hours ago

README is in my opinion (author here) the most interesting - I wrote it to help others build useful mental model to be able to recreate the project yourself, without need to even read my code

  • janalsncm 10 hours ago

    Really practical teaching approach. I clicked in to see how safetensors are loaded and just kept reading. Thanks for sharing.

tom-wal 3 hours ago

I feel like I learned twice as much in 10 minutes reading this than I did reading LLM for Dummies. Thank you

xuanlin314 9 hours ago

The lesson-style README is a great approach. Breaking down LLM inference into digestible steps makes the codebase approachable even for people who haven't touched CUDA before.

GoldenJade 9 hours ago

Thanks for sharing this. As someone currently researching LLMs, I'm sure I'll be referencing this quite a bit going forward.

dwa3592 13 hours ago

Very nice job on read me.

>>Physically, LLM is a file which contains a lot of float numbers.

aka atoms of the LLM.

juancn 14 hours ago

Looks interesting, it reminds me of the first llama.cpp, but better documented.

nazgulsenpai 15 hours ago

I love the documentation formatted in lessons. I can't wait to read through it.

sylware 2 hours ago

I am looking at a plain and simple C implemented LLM inference, and/or x86_64 assembly implemented, and/or AMD GPU RDNA assembly.

Anybody?

cookiengineer 13 hours ago

Wanted to add that the author has an amazing blog with lots of interesting papers: https://jedrzej.maczan.pl/

einpoklum 13 hours ago

It seems the author believes checking the return values of CUDA API calls is not "tiny" enough :-(

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection