GitHub - apollopower/alki: An open-source toolchain for deploying LLMs at the edge

4 min read Original article ↗

A toolchain for deploying and managing LLMs at the edge.

Alki takes a Hugging Face model, converts it to GGUF format, applies quantization, and produces production-ready deployment bundles that run efficiently on edge devices. Supports single-device deployments and fleet-scale orchestration.

✨ Goals

  • Simple: One command from HuggingFace to optimized GGUF bundle.
  • Portable: CPU/GPU support via llama.cpp runtime with broad hardware compatibility.
  • Production-ready: Containers, systemd units, and deployment manifests included.
  • Fleet Intelligence: Orchestrate deployments across hundreds of edge devices with A/B testing.
  • Edge-Native: Real-time monitoring and optimization without cloud dependency.

🚀 Quickstart

Get started immediately (works with base install):

# Validate pre-converted GGUF models with benchmarking
alki validate "Qwen/Qwen3-0.6B-GGUF" --filename "*Q8_0.gguf" --benchmark

For HuggingFace → GGUF conversion (requires conversion dependencies):

# Install conversion dependencies (~2GB download for PyTorch)
pip install -e .[convert]

# Convert HF model to GGUF and create deployment bundle
alki pack "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --quantize Q8_0 --name tinyllama-chat

# Build and test container image
alki image build ./dist/tinyllama-chat --tag tinyllama:latest
alki image test tinyllama:latest

# Publish for fleet deployment
alki publish ./dist/tinyllama-chat --registry myregistry.com/ai --tag v1.0

Note: First-time usage downloads llama.cpp conversion tools (~150MB). Conversion dependencies include PyTorch and are optional for pre-converted GGUF models.

📍 Status & Roadmap

✅ Available Now:

  • Direct HF → GGUF conversion (Q8_0 quantization)
  • Pre-converted GGUF model support (all quantization profiles)
  • NVIDIA GPU acceleration (--ngl parameter for layer offloading)
  • Performance benchmarking (tokens/sec, memory usage)
  • Production bundles with manifests, SBOMs, and deployment configs
  • Container images with llama-server runtime (CPU and CUDA variants)
  • Multi-platform deployment (Docker, K8s, systemd)
  • CLI: validate, pack, image, publish commands

🚧 Phase 1 (In Progress):

  • Advanced quantization (Q4_K_M, Q5_K_M)
  • Hardware optimization profiles
  • End-to-end validation pipeline

🚀 Phase 2 (Planned):

  • Multi-runtime backends (Ollama, MLC-LLM, ONNX)
  • Multi-modal model support
  • Hardware-specific optimization

See ROADMAP.md for complete development plan.

🚀 Deployment

Embedded Images (simple): alki image build creates containers with models baked in Bundle Registry (fleet): alki publish for efficient updates and A/B testing across devices

🛠️ Installation

python -m venv .venv
source .venv/bin/activate

# Minimal install (validation, pre-converted GGUF support)
make install

# OR full install (includes HuggingFace → GGUF conversion)
make install-all

🚀 GPU Acceleration

Alki supports NVIDIA GPU acceleration to significantly speed up inference

Prerequisites

  1. Verify Host Drivers: Ensure your NVIDIA drivers are correctly installed and functioning on the host system via nvidia-smi
  2. Verify Docker Version: Confirm that you are running Docker Engine version 19.03 or later, as the --gpus flag requires it

Creating GPU-Enabled Bundles

# Create bundle with GPU layers (recommended: 16-24 for RTX 4060)
alki pack "Qwen/Qwen3-0.6B-GGUF" --filename "*Q8_0.gguf" --ngl 20

# Build CUDA-enabled container image
alki image build ./dist/qwen3-0-6b-gguf --tag qwen3:gpu --base cuda

# Run with GPU support
docker run --gpus all -p 8080:8080 qwen3:gpu

GPU Layer Recommendations

GPU Model 1B Models 3B Models 7B Models
RTX 4060 (8GB) 20-28 layers 12-20 layers 8-16 layers
RTX 4070 (12GB) 32+ layers 24-32 layers 16-24 layers
RTX 4080 (16GB) 32+ layers 32+ layers 24-32 layers

Prerequisites:

Monitor GPU usage: nvidia-smi

🎮 Runtime Inference

Once deployed, your models serve an OpenAI-compatible API via llama-server:

# Test with curl
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tinyllama-chat",
    "messages": [{"role": "user", "content": "Tell me about Alki beach in Seattle, WA"}],
    "max_tokens": 100,
    "temperature": 0.8
  }'

# Or use any OpenAI client library
import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="tinyllama-chat",
    messages=[{"role": "user", "content": "Tell me about Alki beach in Seattle, WA"}],
    max_tokens=100
)

Health Checks

# Check model status
curl http://localhost:8080/v1/models

# Health endpoint
curl http://localhost:8080/health

🤖 Supported Models

Alki supports models through two paths:

Pre-converted GGUF (all quantizations): Qwen, Llama, Mistral, Phi families Direct HF → GGUF (Q8_0): Qwen, Llama, Mistral, Phi, Gemma, TinyLlama, StableLM

Quantization Profiles

Profile Size Quality Use Case
Q4_K_M ~75% smaller Good Edge deployment
Q5_K_M ~65% smaller Better Balanced
Q8_0 ~50% smaller Excellent Development/testing

Architecture support depends on llama.cpp version (b4481). Install conversion: pip install alki[convert]

📦 Bundle Structure

Bundles include models, metadata, and deployment configs:

dist/my-model/
  models/*.gguf              # Quantized GGUF models
  metadata/                  # Manifest, SBOM, checksums
  deploy/
    systemd/*.service        # systemd units
    k3s/*.yaml              # Kubernetes manifests
    docker/Dockerfile       # Container config

Deploy: kubectl apply -f ./dist/my-model/deploy/k3s/ or cp deploy/systemd/*.service /etc/systemd/system/

🛠️ Tech Stack

  • llama.cpp - Core runtime with broad CPU/GPU compatibility
  • Python 3.10+ - CLI, conversion pipeline, and bundle generation
  • Typer - Command-line interface
  • GGUF - Model format with efficient quantization
  • Docker - Container packaging and distribution
  • Pytest - Testing and validation harness

🔧 Development

# Setup
python -m venv .venv && source .venv/bin/activate
make install

# Run checks
make all     # Format, lint, and test
make check   # CI checks only

📜 License

Apache-2.0

Free to use, modify, and contribute.

🤝 Contributing

Alki is actively developed and contributions are welcome.

See CONTRIBUTING.md for development setup and guidelines.