Mesh LLM lets you pool spare GPU capacity across machines and expose the result as one OpenAI-compatible API.
If a model fits on one machine, it runs there. If it does not, Mesh LLM automatically spreads the work across the mesh:
- Dense models use pipeline parallelism.
- MoE models use expert sharding with zero cross-node inference traffic.
- Models collaborate during inference — a text-only model consults a vision peer, an uncertain model gets a second opinion from a different architecture.
- Every node gets the same local API at
http://localhost:9337/v1.
Why people use it
- Run models larger than a single machine can hold.
- Turn a few uneven boxes into one shared inference pool.
- Give agents a local OpenAI-compatible endpoint instead of wiring each tool by hand.
- Keep the setup simple: start one node, add more later.
Quick start
Install the latest release:
curl -fsSL https://raw.githubusercontent.com/Mesh-LLM/mesh-llm/main/install.sh | bashThen start a node:
Inspect local GPU identity:
That command:
- picks a suitable bundled backend for your machine
- downloads a model if needed
- joins the best public mesh
- exposes an OpenAI-compatible API at
http://localhost:9337/v1 - starts the web console at
http://localhost:3131
Use --headless to disable the embedded web console while keeping the management API (/api/*) available on the --console port. This is useful for headless server deployments where the UI is not needed.
Check what is available:
curl -s http://localhost:9337/v1/models | jq '.data[].id'
Send a request:
curl http://localhost:9337/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"GLM-4.7-Flash-Q4_K_M","messages":[{"role":"user","content":"hello"}]}'
Common workflows
1. Try the public mesh
This is the easiest way to see the system working end to end.
2. Start a private mesh
mesh-llm serve --model Qwen2.5-32B
This starts serving a model, opens the local API and console, and prints an invite token for other machines.
3. Build from source
git clone https://github.com/Mesh-LLM/mesh-llm
cd mesh-llm
just buildRequires: just, cmake, Rust toolchain, Node.js 24 + npm. NVIDIA GPU builds need nvcc (CUDA toolkit). AMD GPU builds need ROCm/HIP. Vulkan GPU builds need the Vulkan development files plus glslc. CPU-only and Jetson/Tegra also work. For source builds, just build auto-detects CUDA vs ROCm vs Vulkan on Linux, or you can force backend=rocm or backend=vulkan. See CONTRIBUTING.md for details.
Windows source builds are also supported for cuda, rocm/hip, vulkan, and cpu via just build. Metal remains macOS-only. Tagged stable GitHub releases publish macOS bundles plus Linux CPU, Linux ARM64 CPU, Linux CUDA, Linux ROCm, and Linux Vulkan bundles. Prereleases use the same workflow and can optionally skip the Linux CUDA, Linux ROCm, and Linux Vulkan bundles. The Linux ARM64 CPU artifact is mesh-llm-aarch64-unknown-linux-gnu.tar.gz. In install and release contexts, arm64 and aarch64 mean the same 64-bit ARM target, and generic 32-bit ARM is not a published release target. Windows publish jobs are currently commented out in .github/workflows/release.yml, but you can still generate the matching local Windows artifacts with just release-build-windows, just release-build-cuda-windows, just release-build-rocm-windows, just release-build-vulkan-windows, and the matching release-bundle-*-windows recipes.
Run
Once installed, you can run:
mesh-llm serve --auto # join the best public mesh, start servingThat's it. Downloads a model for your hardware, connects to other nodes, and gives you an OpenAI-compatible API at http://localhost:9337.
Or start your own:
mesh-llm serve --model Qwen2.5-32B # downloads model (~20GB), starts API + web console mesh-llm serve --model Qwen2.5-3B # or a small model first (~2GB)
Add another machine:
mesh-llm serve --join <token> # token printed by the first machine
Or discover and join public meshes:
mesh-llm serve --auto # find and join the best mesh mesh-llm client --auto # join as API-only client (no GPU)
Output
mesh-llm has two terminal output modes:
--log-format prettyrenders human-readable output. Inserveon an interactive TTY, this becomes the full dashboard; otherwise it falls back to line-oriented pretty output.--log-format jsonwrites newline-delimited JSON records tostdout, which keeps it safe forjq, log shippers, and shell pipelines.
JSON mode example:
{"timestamp":"...","level":"info","event":"llama_ready","model":"Qwen3-32B","port":8001,"ctx_size":8192,"message":"Qwen3-32B ready on internal port 8001"}
{"timestamp":"...","level":"info","event":"model_ready","model":"Qwen3-32B","port":38373,"internal_port":38373,"role":"host","message":"model Qwen3-32B ready on port 38373"}
{"timestamp":"...","level":"info","event":"ready","api_url":"http://localhost:9337","console_url":"http://localhost:3131","api_port":9337,"console_port":3131,"models_count":2,"message":"mesh-llm runtime ready"}Line-oriented pretty sessions accept these commands after startup is ready:
hshows helpiprints the current mesh status snapshotqquits cleanly
For the full event taxonomy and field reference, see crates/mesh-llm/src/cli/output/EVENTS.md.
How it works
Every node gets an OpenAI-compatible API at http://localhost:9337/v1. Distribution is automatic — you just say mesh-llm serve --model X and the mesh figures out the best strategy:
- Model fits on one machine? → runs solo, full speed, no network overhead
- Dense model too big? → pipeline parallelism — layers split across nodes
- MoE model too big? → expert parallelism — experts split across nodes, zero cross-node traffic
If a node has enough VRAM, it always runs the full model. Splitting only happens when it has to. Currently using upstream llama.cpp with a pinned Mesh-LLM patch queue; see docs/design/LLAMA_CPP_FORK.md.
Pipeline parallelism — for dense models that don't fit on one machine, layers are distributed across nodes proportional to VRAM. llama-server runs on the highest-VRAM node and coordinates via RPC. Each rpc-server loads only its assigned layers from local disk. Latency-aware: peers are selected by lowest RTT first, with an 80ms hard cap — high-latency nodes stay in the mesh as API clients but don't participate in splits.
MoE expert parallelism — Mixture-of-Experts models (Qwen3-MoE, GLM, OLMoE, Mixtral, DeepSeek — increasingly the best-performing architectures) are auto-detected from the GGUF header. The mesh reads expert routing statistics to identify which experts matter most, then assigns each node an overlapping shard: a shared core of critical experts replicated everywhere, plus unique experts distributed across nodes. Each node gets a standalone GGUF with the full trunk + its expert subset and runs its own independent llama-server — zero cross-node traffic during inference. Sessions are hash-routed to nodes for KV cache locality.
Multi-model — different nodes serve different models simultaneously. The API proxy peeks at the model field in each request and routes to the right node via QUIC tunnel. /v1/models lists everything available.
Demand-aware rebalancing — a unified demand map tracks which models the mesh wants (from --model flags, API requests, and gossip). Demand signals propagate infectiously across all nodes and decay naturally via TTL. Standby nodes auto-promote to serve unserved models with active demand, or rebalance when one model is significantly hotter than others. When a model loses its last server, standby nodes detect it within ~60s.
Inter-model collaboration — models on the mesh help each other during inference. When a text-only model receives an image, it silently consults a vision model on the mesh for a caption and generates from that. When a small model is uncertain, it races two peers for a second opinion and injects the winner's answer as context. When a model gets stuck in a repetition loop, another model nudges it out. The caller sees one seamless response — they don't know multiple models collaborated. Inspired by Mixture of Models (NSED) — the mesh is the ensemble. See VIRTUAL_LLM.md.
Latency design — the key insight is that HTTP streaming is latency-tolerant while RPC is latency-multiplied. llama-server always runs on the same box as the GPU. The mesh tunnels HTTP, so cross-network latency only affects time-to-first-token, not per-token throughput. RPC only crosses the network for pipeline splits where the model physically doesn't fit on one machine.
Network optimizations
- Zero-transfer GGUF loading —
SET_TENSOR_GGUFtells rpc-server to read weights from local disk. Dropped model load from 111s → 5s. - RPC round-trip reduction — cached
get_alloc_size, skip GGUF lookups for intermediates. Per-token round-trips: 558 → 8. - Direct server-to-server transfers — intermediate tensors pushed directly between rpc-servers via TCP, not relayed through the client.
- Speculative decoding — draft model runs locally on the host, proposes tokens verified in one batched forward pass. +38% throughput on code (75% acceptance).
Usage
Start a mesh
mesh-llm serve --model Qwen2.5-32B
Starts serving a model and prints an invite token. This mesh is private — only people you share the token with can join.
To make it public (discoverable by others via --auto):
mesh-llm serve --model Qwen2.5-32B --publish
Join a mesh
mesh-llm serve --join <token> # join with invite token (GPU node) mesh-llm client --join <token> # join as API-only client (no GPU)
Named mesh (buddy mode)
mesh-llm serve --auto --model GLM-4.7-Flash-Q4_K_M --mesh-name "poker-night" --publishEveryone runs the same command. First person creates it, everyone else discovers "poker-night" and joins automatically. Use --publish to make your named mesh discoverable on Nostr; without it the mesh is private but still joinable via invite token.
Auto-discover
mesh-llm serve --auto # discover, join, and serve a model mesh-llm client --auto # join as API-only client (no GPU) mesh-llm discover # browse available meshes mesh-llm gpus # inspect local GPUs and stable IDs
Inspect and clean the shared model cache
mesh-llm models installed mesh-llm models cleanup --unused-since 30d mesh-llm models cleanup --unused-since 30d --yes
models installed now shows whether a cached model is mesh-managed or external plus the last time mesh-llm used it. models cleanup only removes model files that mesh-llm explicitly marked as mesh-managed; by default it prints a dry run preview and requires --yes to delete anything.
Multi-model
mesh-llm serve --model Qwen2.5-32B --model GLM-4.7-Flash # Route by model name curl localhost:9337/v1/chat/completions -d '{"model":"GLM-4.7-Flash-Q4_K_M", ...}'
Different nodes serve different models. The API proxy routes by the model field.
Inspect local GPUs
mesh-llm gpus mesh-llm gpus --json mesh-llm gpu benchmark --json
mesh-llm gpus prints local GPU entries, backend device names, stable IDs, VRAM, unified-memory state, and cached bandwidth when a benchmark fingerprint is already available. Add --json for machine-readable inventory output, or run mesh-llm gpu benchmark --json to refresh the local fingerprint and print the benchmark result as JSON.
Use only pinnable Stable ID / stable_id values from mesh-llm gpus or mesh-llm gpus --json for pinned startup config. Stable-ID fallback values such as index:* or backend-device names like CUDA0 / HIP0 / MTL0 can still be printed for inventory purposes, but they are not valid pin targets.
Startup config
mesh-llm serve can now load startup models from ~/.mesh-llm/config.toml:
version = 1 [gpu] assignment = "pinned" [[models]] model = "Qwen3-8B-Q4_K_M" gpu_id = "pci:0000:65:00.0" [[models]] model = "bartowski/Qwen2.5-VL-7B-Instruct-GGUF/qwen2.5-vl-7b-instruct-q4_k_m.gguf" mmproj = "bartowski/Qwen2.5-VL-7B-Instruct-GGUF/mmproj-f16.gguf" ctx_size = 8192 gpu_id = "uuid:GPU-12345678" [[plugin]] name = "blackboard" enabled = true
Start with the default config path:
If no startup models are configured, mesh-llm serve prints a ⚠️ warning, shows help, and exits.
Or point at a different file:
mesh-llm serve --config /path/to/config.toml
Precedence rules:
- Explicit
--modelor--ggufignores configured[[models]]. - Explicit
--ctx-sizeoverrides configuredctx_sizefor the selected startup models. - Plugin entries still live in the same file.
Pinned startup notes:
assignment = "pinned"requires every configured[[models]]entry to include agpu_id.- Valid
gpu_idvalues come from the pinnable stable IDs reported bymesh-llm gpus/mesh-llm gpus --json, not fallback inventory IDs. - Pinned configs fail closed when a configured ID is missing, ambiguous, unsupported on the local backend, or no longer resolves on the current machine.
- Explicit
--model/--ggufstill bypass configured[[models]], so they also bypass config-owned pinnedgpu_idvalues.
No-arg behavior
mesh-llm # no args — prints --help and exitsDoes not start the console or bind any ports. Use the CLI flags shown in --help to start or join a mesh.
Background service
To install it as a per-user background service:
curl -fsSL https://raw.githubusercontent.com/Mesh-LLM/mesh-llm/main/install.sh | bash -s -- --serviceService installs are user-scoped:
- macOS installs a
launchdagent at~/Library/LaunchAgents/com.mesh-llm.mesh-llm.plist - Linux installs a
systemd --userunit at~/.config/systemd/user/mesh-llm.service - Shared environment config lives in
~/.config/mesh-llm/service.env - Startup models live in
~/.mesh-llm/config.toml
The two platforms handle launch startup the same way:
- macOS:
launchdruns~/.config/mesh-llm/run-service.sh, which loadsservice.envand executesmesh-llm serve. - Linux: the installer writes
mesh-llm servedirectly intoExecStart=in~/.config/systemd/user/mesh-llm.service.
The background service no longer stores custom startup args. Configure startup models in ~/.mesh-llm/config.toml instead.
service.env is optional and shared by both platforms. Use plain KEY=value lines, for example:
MESH_LLM_NO_SELF_UPDATE=1
If you edit the Linux unit manually, reload and restart it:
systemctl --user daemon-reload systemctl --user restart mesh-llm.service
On Linux this is a user service, so if you want it to keep running after reboot before login, enable lingering once:
sudo loginctl enable-linger "$USER"Web console
mesh-llm serve --model Qwen2.5-32B # dashboard at http://localhost:3131Live topology, per-node GPU capacity, model picker, and built-in chat. Live members show only the Client, Standby, Loading, and Serving badges. Wakeable provider-backed capacity is shown separately from topology and stays out of routing until it rejoins. Everything comes from /api/status (JSON) and /api/events (SSE).
Multimodal Support
mesh-llm supports multimodal requests on:
POST /v1/chat/completionsPOST /v1/responses
The console supports image, audio, and file attachments. Large attachments use request-scoped blob upload rather than permanent storage.
Current support matrix
| Family / model type | Vision | Audio | Notes |
|---|---|---|---|
Qwen3-VL, Qwen3VL |
yes | no | Example: Qwen3VL-2B-Instruct-Q4_K_M |
Qwen2-VL, Qwen2.5-VL |
yes | no | Vision-capable Qwen VL families |
LLaVA, mllama, PaliGemma, Idefics, Molmo, InternVL, GLM-4V, Ovis, Florence |
yes | no | Detected as vision-capable families |
Qwen2-Audio |
no | yes | Audio-capable family |
SeaLLM-Audio |
no | yes | Audio-capable family |
Ultravox |
no | yes | Audio-capable family |
Omni |
no or metadata-dependent | yes | Example: Qwen2.5-Omni-3B-Q4_K_M |
Whisper |
no | yes | Audio-capable family |
Any GGUF with mmproj sidecar |
yes | depends | Strong local signal for vision support |
Any model with vision_config / vision token IDs |
yes | depends | Promoted by metadata |
Any model with audio_config / audio token IDs |
depends | yes | Promoted by metadata |
Generic multimodal, -vl, image, video, voice naming only |
likely | likely | Hint only, not a strong routing guarantee |
Notes:
yesmeans mesh-llm treats the model as runtime-capable for routing and UI.likelymeans mesh-llm shows a weaker hint but does not rely on it as a hard capability.- Mixed image+audio requests work only when the selected model/runtime actually supports both modalities.
- Non-goals:
POST /v1/audio/transcriptions,POST /v1/audio/speech, andv1/realtime.
For the full capability and transport details, see docs/design/MULTI_MODAL.md.
Development
Build-from-source and UI development instructions are in CONTRIBUTING.md.
Using with agents
mesh-llm exposes an OpenAI-compatible API on localhost:9337. Any tool that supports custom OpenAI endpoints works. /v1/models lists available models; the model field in requests routes to the right node.
For built-in launcher integrations (goose, claude, opencode, pi):
- Goose and Claude reuse a local mesh on
--portand auto-start a local client if needed. - OpenCode and pi target
--host(default127.0.0.1:9337) and only auto-start a local client for loopback/localhost targets. - If
--modelis omitted, the launcher picks the strongest tool-capable model available on the mesh. - When the harness exits (e.g.
claudequits), the auto-started node is cleaned up automatically.
goose
Goose is available as both CLI (goose session) and desktop app (Goose.app).
Use a specific model (example: MiniMax):
mesh-llm goose --model MiniMax-M2.5-Q4_K_M
This command writes/updates ~/.config/goose/custom_providers/mesh.json and launches Goose.
opencode
OpenCode uses a temporary provider config injected by Mesh, so you don't need to edit local config files by hand. For the full advanced or manual setup, see docs/AGENTS.md.
Point OpenCode at a different mesh host or URL:
mesh-llm opencode --host https://mesh.example.com
Use a specific model (example: MiniMax):
mesh-llm opencode --host 127.0.0.1:9337 --model MiniMax-M2.5-Q4_K_M
Write or update a merged persistent OpenCode config:
mesh-llm opencode --write --host 127.0.0.1:9337
pi
Use a specific model:
mesh-llm pi --model MiniMax-M2.5-Q4_K_M
This writes every model from the mesh into ~/.pi/agent/models.json with model context sizes when available, then launches pi.
Write or update the Pi provider config without launching pi:
Target a remote mesh host or URL, including a custom port:
mesh-llm pi --write --host carrack.patio51.com:9337
External OpenAI-compatible backends (vLLM, TGI, Ollama, Lemonade, etc.)
The openai-endpoint plugin routes inference to any server that speaks the OpenAI /v1/chat/completions API. The server does all the inference work — mesh-llm just discovers its models and routes requests to it.
Enable the plugin in ~/.mesh-llm/config.toml with the URL:
# vLLM [[plugin]] name = "openai-endpoint" url = "http://gpu-box:8000/v1" # Ollama [[plugin]] name = "openai-endpoint" url = "http://localhost:11434/v1" # Lemonade [[plugin]] name = "openai-endpoint" url = "http://localhost:8000/api/v1"
The URL can also be set via MESH_LLM_OPENAI_ENDPOINT_URL env var (config takes precedence). Default: http://localhost:8000/v1. The plugin health-checks the backend by probing GET /v1/models — models appear and disappear automatically as the backend starts and stops.
To use an external backend without loading any llama.cpp models:
If you want the mesh to be discoverable via --auto, publish it:
mesh-llm serve --model Qwen2.5-32B --publish
3. Add another machine
mesh-llm serve --join <token>
Use mesh-llm client if the machine should join without serving a model:
mesh-llm client --join <token>
4. Create a named mesh for a group
mesh-llm serve --auto --model GLM-4.7-Flash-Q4_K_M --mesh-name "poker-night" --publishEveryone runs the same command. The first node creates the mesh, the rest discover and join it automatically. Use --publish so your named mesh appears in Nostr discovery; without it you must share the invite token manually.
5. Serve more than one model
mesh-llm serve --model Qwen2.5-32B --model GLM-4.7-Flash
Requests are routed by the model field:
curl localhost:9337/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"GLM-4.7-Flash-Q4_K_M","messages":[{"role":"user","content":"hello"}]}'
How it works
Mesh LLM keeps the user-facing surface simple: talk to localhost:9337, pick a model, and let the mesh decide how to serve it.
- If a model fits on one machine, it runs there with no network overhead.
- If a dense model does not fit, layers are split across low-latency peers.
- If an MoE model does not fit, experts are split across nodes and requests are hash-routed for cache locality.
- Different nodes can serve different models at the same time.
Each node also exposes a management API and web console on port 3131.
Install notes
The installer currently targets macOS and Linux release bundles. Windows coming soon.
To force a specific bundled flavor during install:
curl -fsSL https://raw.githubusercontent.com/Mesh-LLM/mesh-llm/main/install.sh | MESH_LLM_INSTALL_FLAVOR=vulkan bashInstalled release bundles use flavor-specific llama.cpp binaries:
- macOS:
metal - Linux:
cpu,cuda,rocm,vulkan - Linux ARM64 CPU:
cpu(asset triple:aarch64-unknown-linux-gnu)
For release and install naming, arm64 and aarch64 both refer to the same 64-bit ARM target. Generic 32-bit ARM is not a published release target.
To update a bundle install to the latest release:
To install a specific bundled release tag:
mesh-llm update --version v0.X.Y
If you build from source, always use just:
git clone https://github.com/Mesh-LLM/mesh-llm
cd mesh-llm
just buildRequirements and backend-specific build notes are in CONTRIBUTING.md.
Web console
When a node is running, open:
The console shows live topology with only Client, Standby, Loading, and Serving badges for live members, plus separate wakeable capacity, VRAM usage, loaded models, and built-in chat. Wakeable inventory is not part of topology peers or routing until it rejoins. It is backed by /api/status and /api/events.
To run without the embedded UI (for example, in a headless server environment), pass --headless:
mesh-llm serve --model Qwen2.5-3B --headless
In headless mode, the web console routes (/, /dashboard, /chat) return 404. The management API (/api/*) stays fully available on the --console port.
You can also try the hosted demo:
More docs
- docs/README.md for the docs map and topic directories
- docs/USAGE.md for service installs, model commands, storage, and runtime control
- docs/AGENTS.md for Goose, Claude Code, pi, OpenCode, curl, and blackboard usage
- docs/BENCHMARKS.md for benchmark numbers and context
- CONTRIBUTING.md for local development and build workflows
- docs/plugins/README.md for the plugin system and blackboard internals
- docs/moe/README.md for MoE ranking and placement planning
- docs/design/VIRTUAL_LLM.md for inter-model collaboration design
- docs/design/LLAMA_CPP_FORK.md for llama.cpp patch queue maintenance
- docs/design/LLAMA_STAGE_INTEGRATION_PLAN.md for the planned llama-stage-runtime integration
- crates/mesh-llm/README.md for Rust crate structure
- ROADMAP.md for future work
Community
Join the #mesh-llm channel on the Goose Discord for discussion and support.
