GitHub - dakshjain-1616/gguf-serve: Zero-friction local server that loads any GGUF model behind an OpenAI-compatible REST API. No venv, no cloud. Wraps llama.cpp with a thin FastAPI surface supporting streaming and function calling.

Autonomously built using NEO — your Autonomous AI Agent | Get the VS Code Extension

Serve any GGUF model as a local OpenAI-compatible REST API. No Python environment hassle, no venv, no cloud — just run.

What is gguf-serve?

gguf-serve is a zero-friction local LLM server that loads any GGUF-format quantized model and exposes it as a fully OpenAI-compatible REST API on localhost:8000. Drop it in as a local replacement for the OpenAI API — your existing code just works.

Built on top of llama-cpp-python and Flask, it supports GPU acceleration, rate limiting, request logging, embeddings, tokenization, and mock mode for testing — all configurable via environment variables.

Architecture Infographic

┌─────────────────────────────────────────────────────────────────────┐
│                          gguf-serve                                 │
│                  Local OpenAI-Compatible LLM Server                 │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
            ┌──────────────────▼──────────────────┐
            │           Flask REST API             │
            │         localhost:8000               │
            └──────────────────┬──────────────────┘
                               │
     ┌─────────────────────────┼─────────────────────────┐
     │                         │                         │
     ▼                         ▼                         ▼
┌─────────┐             ┌─────────────┐           ┌──────────┐
│  /v1/   │             │    /v1/     │           │  /v1/    │
│  chat/  │             │ completions │           │embeddings│
│completns│             │             │           │/tokenize │
└────┬────┘             └──────┬──────┘           └────┬─────┘
     │                         │                       │
     └─────────────────────────┼───────────────────────┘
                               │
            ┌──────────────────▼──────────────────┐
            │         llama-cpp-python             │
            │      Model Inference Engine          │
            └──────────────────┬──────────────────┘
                               │
            ┌──────────────────▼──────────────────┐
            │            GGUF Model File           │
            │   (quantized LLM — any size/arch)   │
            └─────────────────────────────────────┘

  Configuration Layer (env vars / .env)
  ┌──────────┬────────────┬───────────┬────────────┐
  │ n_ctx    │ n_threads  │ gpu_layers│ rate_limit │
  │ 2048 tok │ 4 (default)│ 0=CPU only│ per-IP     │
  └──────────┴────────────┴───────────┴────────────┘

  Observability
  ┌──────────────────────────────────────────────┐
  │  /health  │  /v1/models  │  /v1/stats        │
  │  uptime   │  model list  │  latency/metrics  │
  └──────────────────────────────────────────────┘

API Endpoints

Endpoint	Method	Description
`/health`	GET	Server health, uptime, status
`/v1/models`	GET	List loaded models (OpenAI format)
`/v1/stats`	GET	Request count, latency, metrics
`/v1/chat/completions`	POST	Multi-turn chat (drop-in OpenAI replacement)
`/v1/completions`	POST	Raw text completion
`/v1/embeddings`	POST	Generate vector embeddings
`/v1/tokenize`	POST	Tokenize text, return token count

Quickstart

1. Install dependencies

pip install -r requirements.txt

2. Run with a GGUF model

GGUF_MODEL_PATH=/path/to/model.gguf python gguf_serve.py

3. Try mock mode (no model file needed)

GGUF_MOCK_MODE=1 python gguf_serve.py

4. Send a chat request

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

Configuration

All settings via environment variables (or .env file — copy .env.example):

Variable	Default	Description
`GGUF_MODEL_PATH`	—	Path to your `.gguf` model file (required)
`GGUF_HOST`	`127.0.0.1`	Bind address
`GGUF_PORT`	`8000`	Port
`GGUF_N_CTX`	`2048`	Context window (tokens)
`GGUF_N_THREADS`	`4`	CPU threads for inference
`GGUF_N_GPU_LAYERS`	`0`	GPU layers to offload (`0` = CPU only)
`GGUF_MAX_TOKENS`	`256`	Default max tokens per response
`GGUF_TEMPERATURE`	`0.7`	Sampling temperature
`GGUF_TOP_P`	`0.95`	Nucleus sampling
`GGUF_TOP_K`	`40`	Top-k sampling
`GGUF_SEED`	`-1`	Seed (`-1` = random)
`GGUF_MOCK_MODE`	`0`	`1` = mock mode, no model needed
`GGUF_RATE_LIMIT`	`0`	Max requests/min per IP (`0` = unlimited)
`GGUF_LOG_REQUESTS`	`0`	`1` = log every request
`GGUF_LOG_LEVEL`	`INFO`	`DEBUG` / `INFO` / `WARNING` / `ERROR`

GPU Acceleration

GGUF_MODEL_PATH=/path/to/model.gguf \
GGUF_N_GPU_LAYERS=32 \
python gguf_serve.py

Increase GGUF_N_GPU_LAYERS to offload more layers to GPU (requires a compatible llama-cpp-python build).

Examples

Progressive examples are in the examples/ directory:

File	What it covers
`01_quick_start.py`	Minimal setup, health check, first chat
`02_advanced_usage.py`	Streaming, sampling params, embeddings
`03_custom_config.py`	Custom env config patterns
`04_full_pipeline.py`	End-to-end workflow with metrics

Testing

# Start server in mock mode
GGUF_MOCK_MODE=1 python gguf_serve.py &

# Run smoke tests
python smoke_test.py

Smoke tests cover: health, completions, chat, embeddings, tokenization, parameter validation, and stats accuracy.

Stack

Flask — REST API framework
llama-cpp-python — GGUF inference engine
Rich — Terminal output formatting
python-dotenv — Env config loading

License

MIT

Autonomously built using NEO — your Autonomous AI Agent | Get the VS Code Extension