A toolchain for deploying and managing LLMs at the edge.
Alki takes a Hugging Face model, converts it to GGUF format, applies quantization, and produces production-ready deployment bundles that run efficiently on edge devices. Supports single-device deployments and fleet-scale orchestration.
✨ Goals
- Simple: One command from HuggingFace to optimized GGUF bundle.
- Portable: CPU/GPU support via llama.cpp runtime with broad hardware compatibility.
- Production-ready: Containers, systemd units, and deployment manifests included.
- Fleet Intelligence: Orchestrate deployments across hundreds of edge devices with A/B testing.
- Edge-Native: Real-time monitoring and optimization without cloud dependency.
🚀 Quickstart
Get started immediately (works with base install):
# Validate pre-converted GGUF models with benchmarking alki validate "Qwen/Qwen3-0.6B-GGUF" --filename "*Q8_0.gguf" --benchmark
For HuggingFace → GGUF conversion (requires conversion dependencies):
# Install conversion dependencies (~2GB download for PyTorch) pip install -e .[convert] # Convert HF model to GGUF and create deployment bundle alki pack "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --quantize Q8_0 --name tinyllama-chat # Build and test container image alki image build ./dist/tinyllama-chat --tag tinyllama:latest alki image test tinyllama:latest # Publish for fleet deployment alki publish ./dist/tinyllama-chat --registry myregistry.com/ai --tag v1.0
Note: First-time usage downloads llama.cpp conversion tools (~150MB). Conversion dependencies include PyTorch and are optional for pre-converted GGUF models.
📍 Status & Roadmap
✅ Available Now:
- Direct HF → GGUF conversion (Q8_0 quantization)
- Pre-converted GGUF model support (all quantization profiles)
- NVIDIA GPU acceleration (
--nglparameter for layer offloading) - Performance benchmarking (tokens/sec, memory usage)
- Production bundles with manifests, SBOMs, and deployment configs
- Container images with llama-server runtime (CPU and CUDA variants)
- Multi-platform deployment (Docker, K8s, systemd)
- CLI:
validate,pack,image,publishcommands
🚧 Phase 1 (In Progress):
- Advanced quantization (Q4_K_M, Q5_K_M)
- Hardware optimization profiles
- End-to-end validation pipeline
🚀 Phase 2 (Planned):
- Multi-runtime backends (Ollama, MLC-LLM, ONNX)
- Multi-modal model support
- Hardware-specific optimization
See ROADMAP.md for complete development plan.
🚀 Deployment
Embedded Images (simple): alki image build creates containers with models baked in
Bundle Registry (fleet): alki publish for efficient updates and A/B testing across devices
🛠️ Installation
python -m venv .venv source .venv/bin/activate # Minimal install (validation, pre-converted GGUF support) make install # OR full install (includes HuggingFace → GGUF conversion) make install-all
🚀 GPU Acceleration
Alki supports NVIDIA GPU acceleration to significantly speed up inference
Prerequisites
- Verify Host Drivers: Ensure your NVIDIA drivers are correctly installed and functioning on the host system via
nvidia-smi - Verify Docker Version: Confirm that you are running Docker Engine version 19.03 or later, as the
--gpusflag requires it
Creating GPU-Enabled Bundles
# Create bundle with GPU layers (recommended: 16-24 for RTX 4060) alki pack "Qwen/Qwen3-0.6B-GGUF" --filename "*Q8_0.gguf" --ngl 20 # Build CUDA-enabled container image alki image build ./dist/qwen3-0-6b-gguf --tag qwen3:gpu --base cuda # Run with GPU support docker run --gpus all -p 8080:8080 qwen3:gpu
GPU Layer Recommendations
| GPU Model | 1B Models | 3B Models | 7B Models |
|---|---|---|---|
| RTX 4060 (8GB) | 20-28 layers | 12-20 layers | 8-16 layers |
| RTX 4070 (12GB) | 32+ layers | 24-32 layers | 16-24 layers |
| RTX 4080 (16GB) | 32+ layers | 32+ layers | 24-32 layers |
Prerequisites:
- NVIDIA drivers installed
- NVIDIA Container Toolkit
- Docker with GPU support
Monitor GPU usage: nvidia-smi
🎮 Runtime Inference
Once deployed, your models serve an OpenAI-compatible API via llama-server:
# Test with curl curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "tinyllama-chat", "messages": [{"role": "user", "content": "Tell me about Alki beach in Seattle, WA"}], "max_tokens": 100, "temperature": 0.8 }' # Or use any OpenAI client library import openai client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed") response = client.chat.completions.create( model="tinyllama-chat", messages=[{"role": "user", "content": "Tell me about Alki beach in Seattle, WA"}], max_tokens=100 )
Health Checks
# Check model status curl http://localhost:8080/v1/models # Health endpoint curl http://localhost:8080/health
🤖 Supported Models
Alki supports models through two paths:
Pre-converted GGUF (all quantizations): Qwen, Llama, Mistral, Phi families Direct HF → GGUF (Q8_0): Qwen, Llama, Mistral, Phi, Gemma, TinyLlama, StableLM
Quantization Profiles
| Profile | Size | Quality | Use Case |
|---|---|---|---|
| Q4_K_M | ~75% smaller | Good | Edge deployment |
| Q5_K_M | ~65% smaller | Better | Balanced |
| Q8_0 | ~50% smaller | Excellent | Development/testing |
Architecture support depends on llama.cpp version (b4481). Install conversion: pip install alki[convert]
📦 Bundle Structure
Bundles include models, metadata, and deployment configs:
dist/my-model/
models/*.gguf # Quantized GGUF models
metadata/ # Manifest, SBOM, checksums
deploy/
systemd/*.service # systemd units
k3s/*.yaml # Kubernetes manifests
docker/Dockerfile # Container config
Deploy: kubectl apply -f ./dist/my-model/deploy/k3s/ or cp deploy/systemd/*.service /etc/systemd/system/
🛠️ Tech Stack
- llama.cpp - Core runtime with broad CPU/GPU compatibility
- Python 3.10+ - CLI, conversion pipeline, and bundle generation
- Typer - Command-line interface
- GGUF - Model format with efficient quantization
- Docker - Container packaging and distribution
- Pytest - Testing and validation harness
🔧 Development
# Setup python -m venv .venv && source .venv/bin/activate make install # Run checks make all # Format, lint, and test make check # CI checks only
📜 License
Apache-2.0
Free to use, modify, and contribute.
🤝 Contributing
Alki is actively developed and contributions are welcome.
See CONTRIBUTING.md for development setup and guidelines.