Llama2.java: Karpathy's llama2.c ported to Java

github.com

33 points by mukel 3 years ago · 18 comments

Reader

gavinray 3 years ago

The Java code is impressively written, using newer features like MemorySegment.

Looked at the author and realized it's Alfonso from the Graal team -- makes sense.

I wonder whether the "matmul" code could be further optimized with the Vector API and SIMD.

mukelOP 3 years ago

Author here: I implemented several versions of matmul with different unrolling schemes using the Vector API and I got a ~4X speedup with a single thread, but the speedup fades the more threads you add. I think that performance is constrained by memory bandwidth which is saturated with a small number of threads, regardless of vectorization.
kurhan 3 years ago

Also new virtual threads might be beneficial. I was experimenting using Vector api for matrix multiplication once and effect was pretty good.
- mike_hearn 3 years ago
  
  Virtual threads shouldn't help as the program isn't I/O or wait bottlenecked. It's a pure computation, so it's all about vectorization here.

atairov 3 years ago

Thanks for sharing this! It's great to have a reference implementation written on java lang. With given original simplicity it's really easy to follow llama architecture logic.

Just in case if anyone interested in Python version, I spend some time on weekend and ported it to pure python -- https://github.com/tairov/llama2.py

I never knew that it would take about 500 lines of core part code to implement inference for such a cutting edge AI technology.

mukelOP 3 years ago

A Java port of llama2.c that performs very close to C on large models. Llama 2 7B runs at a whooping 1.6 tokens/s.

mike_hearn 3 years ago

Hey man, awesome stuff. Surely any JIT compiler will struggle to vectorize something using IntStream.range, though? Looking at matmul, I'd not expect that to be auto-vectorized. The Panama API can be used to do a matmul vectorization, too bad it seems to never launch.
- mwcampbell 3 years ago
  
  Panama is now in its third preview in the soon-to-be-released JDK 21:
  https://openjdk.org/jeps/442
  Is there any indication that it won't go from there to a final release soon?
  - mike_hearn 3 years ago
    
    That's only for the FFI I think. The vector API has been incubated six times now and is waiting for Valhalla :(

shortrounddev2 3 years ago

How you all used these things for anything useful? I can't get them to give useful results on my 3060 8gb. If I wanted to get decent results I think I'd need to rent a GPU node somewhere, but chatGPT is still free

SushiHippie 3 years ago

The 4bit quantized 13B models, give really decent answers (not as good as gpt4, but often as good as gpt 3)
nmfisher 3 years ago

I know it might be asking a lot, but it would be great if someone could put up a HF space so I could try all the various flavours/sizes.
- lazylion2 3 years ago
  
  /r/LocalLLaMA/
  - nmfisher 3 years ago
    
    I'm already subscribed (and I already ran the small version locally), but I'd still like to be able to quickly evaluate the models online in a couple of minutes, rather than going through the rigmarole of downloading & running every new model/variant locally.

jiehong 3 years ago

This makes me wonder: what’s the status of GPU programming on the JVM?

Any abstraction for GPGPU or shaders programming?

pjmlp 3 years ago

Besides TornadoVM,
http://javagl.de/jcuda.org/
https://dragan.rocks/software/
https://blogs.oracle.com/javamagazine/post/programming-the-g...
mike_hearn 3 years ago

See here: https://www.tornadovm.org/
But it's a research project.
jfumero 3 years ago

To quote Gary Frost (creator of Aparapi), TornadoVM is the state-of-the-art right now. He mentioned this at JVMLS 2023. Hopefully the videos will be available soon from this link: https://openjdk.org/projects/mlvm/jvmlangsummit/

Settings

Llama2.java: Karpathy's llama2.c ported to Java

Keyboard Shortcuts