Karpathy's llama2.c ported to pure Python
github.comI made a jupyter notebook "llama2.ipynb" from the Karpathy project: https://github.com/rbitr/llama2.ipynb
I didn't do a pure python, mine uses numpy, and although I haven't benchmarked, it runs the stories15M model much faster than 1.3 tok/sec on my 2018 macbook. You should try swapping in numpy matrix multiplication, or @ (I actually don't know if that's native or part of another package) for matmul and see what changes.
1.3 tok / sec is something similar to my Python version port performance, but I tried on M1 Max
The llama2.py code defines its own accum, rmsnorm and matmul. Why not use NumPy? A "pure Python" code that is much slower than one using NumPy is less interesting to me.
If your goal is to make it as fast as possible, then for sure Python implementation is not a solution here. I think for this exactly reason llama.cpp got high attention
I find these efforts impressive, but what is the value proposition here? (I'm not just talking about this fork, but also Karapathy's llama2.c as well).
Personally for me the value was to implement a complex logic from a scientific paper in a pure Python. It helps to understand the essence of a cutting edge AI technology. And it's quite fascinating that it would take about 500 lines of core part code to implement inference for such a complex solution.
Regarding the original llama2.c as I believe the value proposition is to have simple implementation that can execute the inference locally on wide variety of platforms. What if we can execute fine-tuned Llama7B on our phones?
> What if we can execute fine-tuned Llama7B on our phones?
7B and 13B are already quite performant with mlc-llm (which uses an Apache TVM Vulkan/Metal backend). Llama.cpp has the potential to perform well too.
These "single file" implementations are not meant to be optimized or feature rich, I dont think.
Its educational. It shows a how llama works in a clear, concise, testable way.
Writing one's own and/or porting every line of code has great value