Llama Deck
Llama Deck is a command-line tool for quickly managing and experimenting with multiple versions of llama inference implementations. It can help you quickly filter and download different llama implementations and llama2-like transformer-based LLM models. We also provide some Docker images based on some implementations, which can be easily deploy and run through our tool.
Inspired by llama2.c project and forked from llama-shepherd-cli.
Shortcuts
Install The Tool: pip install llama-deck
Manage Repositories : list_repo install_repo -l <language>
Manage Models: list_model install_model
-m <model_name>
Manage and Run Docker Images :install_img run_img
Install
To install the tool, simply run:
Explore & Download Llama Repositories
List Repositories
To list all Llama Implementations, run:
You can also set -l to specify the language of the repository, like:

Download Repositories
You can also download those implementation repositories through our tool:
You can also set -l to specify a language.
Once it runs, it supports to download multiple repositories at once, by input row numbers from the listed table. And if you don't like the default download path, you can also specify your own path to download.
Repositories are saved and splitted by the language and the author name, you can find them in <specified download path>/llamaRepos.
Available Repositories
Originating from llama2.c project by Andrej Karpathy.
Explore & Download Models
Currently the tool only contains Tinyllamas provided in llama2.c project, and Meta-Llama. More model options will be extended and provided to download.
The oprations for listing and downloading models are similar to repositories. For list available models, run:
And for download model:
Similarly, -m is optional and can be set to specify the model name you want to show and download.
The tool could also helps you to download the default tokenizer provided in llama2.c.
Available Models
More model options will be extended and provided to download.
| Model | url | |
|---|---|---|
| 1 | stories15M | https://huggingface.co/karpathy/tinyllamas |
| 2 | stories42M | https://huggingface.co/karpathy/tinyllamas |
| 3 | stories110M | https://huggingface.co/karpathy/tinyllamas |
| 4 | Meta-Llama | https://llama.meta.com/llama-downloads/ |
IMPORTANT! It is lisence protected to download Meta-Llama models, which means you still needs to apply for a download permission by Meta. But once you received the download url from Meta's confirmation email, this tool will automatically grab and run download.sh provided by Meta to help you download Meta-Llama models.
Install & Run Images
In order to quickliy deploying and experimenting with multiple versions of llama inference implementations, we build an image repository consists of some dockerized popular implementations. See our image repository.
llama-deck can access, pull and run these dockerized implementations. When you need to run multiple implementations, or compare the performance differences between implementations, this will greatly save your effort in deploying implementations, configuring many runtime environment, and learning how to infer a certain implementation.
Before trying these functions, make sure docker is already installed and running on your device.
Install DockerImages
To list images from our image repository, use:
And install image:
Both for list_img and run_img action, an optional flag -i <image tag> can be set to check if a specific tag is included. All image tags are named with format <repository name>_<author>. (e.g. for Karpathy's llama2.c, the image tag is llama2.c_karpathy)
The process of installing images is mostly the same as installing repositories and models. Back to Shortcuts
Run the Docker Images
There are 2 ways to run images.
1 Follow the Instructions
Run:
llama-deck run_img
Simply call run_img action and let the tool find resources and helps you set all configs for model inference. After running this, it will automatically check and list installed images that can be run by this tool.
You will be asked to:
Step 1. Select one or more Docker images you want to run.
Step 2. Select one model, or specify the model path (abs path needed).
Step 3. Set inference arguments: -i,-t,-p... (Optional)
e.g. In Step 3, input -n 256 -i "Once upon a time", then all selected images will inference model with step=256, prompt="Once upon a time" .
Then the tool will run all your selected images, with args your set. And you will see stdout from all those running containers (images), with arg status and inference result printed, looks like:

Run Image in Single Command
A faster way to run a specific image is to call run_img action with specified image_tag and model_path, followed by inference args if needed.
llama-deck run_img <image_tag> <model_path> <other args (optional)>
For example, if I want to:
- Inference the model:
/home/bufan/LlamaDeckResources/llamaModels/stories15M.bin - run llama2.java inside image with tag 'llama2.java_mukel'
- 128 steps and prompt "Once upon a time"
Then the command is:
llama-deck run_img llama2.java_mukel \
/home/bufan/LlamaDeckResources/llamaModels/stories15M.bin \
-n 128 -i "Once opon a time"Result:
$ llama-deck run_img llama2.java_mukel /home/bufan/LlamaDeckResources/llamaModels/stories15M.bin -n 128 -i "Once opon a time" ==> Selected run arguments: image_tag: llama2.java_mukel model_path: /home/bufan/LlamaDeckResources/llamaModels/stories15M.bin steps: 128 input_prompt: Once opon a time Running llama2.java_mukel... ################## stdout from llama2.java_mukel #################### ==>Supported args: tokenizer prompt chat_init_prompt mode temperature step top-p seed ==> Set args: prompt = 'Once opon a time' step = 128 ==> RUN COMMAND: java --enable-preview --add-modules=jdk.incubator.vector Llama2 /models/model.bin -i 'Once opon a time' -n 128 WARNING: Using incubator modules: jdk.incubator.vector Config{dim=288, hidden_dim=768, n_layers=6, n_heads=6, n_kv_heads=6, vocab_size=32000, seq_len=256, shared_weights=true, head_size=48} Once opon a time, there was a boy. He liked to throw a ball. Every day, he would go outside and throw the ball with his friends. One day, the boy saw something funny. He saw a penny, made of copper and was very happy. He liked the penny so much that he wanted to throw it again. He threw the penny and tried to make it go even higher. But, the penny was too lazy to go higher. So, the boy went back to the penny and tried again. He threw it as far as he could. But this time achieved tok/s: 405.750799 ##################################################################### All images finished.
IMPORTANT! Please always give the absolute path when inputing the <model_path>: Since the llama model file is always large, instead of copying it into each container and improve IO and memory cost, llama-deck choose to mount the model into each running container (image), where absolute path is needed to mount it when starting an image.
More about passing inference arguments
Inference args supported by llama-deck are the same as llama2.c. Those are:
-t <float> temperature in [0,inf], default 1.0
-p <float> p value in top-p (nucleus) sampling in [0,1] default 0.9
-s <int> random seed, default time(NULL)
-n <int> number of steps to run for, default 256. 0 = max_seq_len
-i <string> input prompt
-z <string> optional path to custom tokenizer (not implemented yet)
-m <string> mode: generate|chat, default: generate
-y <string> (optional) system prompt in chat mode
It is noticed that not all implementations supports all these args from llama2.c. And due to the nature of different implementations, different ways/formats are used to pass these args.
So for each selected image to run, llama-deck will automatically detect its supported args and drop out those unsupported. Then it convert args you set into correct format, put it to correct position (in a command to run the implementation) and finally pass them to inplementation inside the image. This operation is done inside each running container.
Available Images
| Tag | Size | Author | Repository | |
|---|---|---|---|---|
| 1 | llama2.zig_cgbur | 259.0 MB | @cgbur | https://github.com/cgbur/llama2.zig |
| 2 | llama2.cs_trrahul | 374.0 MB | @trrahul | https://github.com/trrahul/llama2.cs |
| 3 | llama2.py_tairov | 57.0 MB | @tairov | https://github.com/tairov/llama2.py |
| 4 | llama2.rs_gaxler | 331.0 MB | @gaxler | https://github.com/gaxler/llama2.rs |
| 5 | llama2.c_karpathy | 139.0 MB | @karpathy | https://github.com/karpathy/llama2.c |
| 6 | llama2.java_mukel | 178.0 MB | @mukel | https://github.com/mukel/llama2.java |
| 7 | go-llama2_tmc | 133.0 MB | @tmc | https://github.com/tmc/go-llama2 |
| 8 | llama2.cpp_leloykun | 169.0 MB | @leloykun | https://github.com/leloykun/llama2.cpp |
More dockerized implementations will be extended.
TODO List
- Implement customized tokenizer when running images.
- Extend more models and build more images.
- Try Multi-thread in running images?
License
See the LICENSE file for details.
