oMLX — LLM inference, optimized for your Mac

5 min read Original article ↗

Loading...

Local AI,
no more waiting on your Mac.

macOS-native MLX server with smart caching. Claude Code, OpenClaw, and Cursor respond in 5 seconds, not 90.

Apache 2.0  ·  Apple Silicon  ·  macOS 15+  · 

oMLX Dashboard — Dark Mode

oMLX Dashboard — Light Mode

0

tok/s prompt processing

0

throughput with batching

SSD KV cache (no eviction)

Qwen3.5-122B-A10B-4bit · M3 Ultra 512GB

Why oMLX

Built for the way
agents actually work.

Coding agents invalidate the KV cache dozens of times per session. oMLX persists every cache block to SSD — so when the agent circles back to a previous prefix, it's restored from disk in milliseconds, not recomputed from scratch.

01 — CORE

Paged SSD KV caching

Cache blocks are persisted to disk in safetensors format. Two-tier architecture: hot blocks stay in RAM, cold blocks go to SSD with LRU policy. Previously seen prefixes are restored across requests and server restarts — never recomputed.

02 — THROUGHPUT

Continuous batching

Handles concurrent requests through mlx-lm's BatchGenerator. Up to 4.14× generation speedup at 8× concurrency. No more queuing behind a single request.

03 — APP

Native macOS menu bar app

Start, stop, and monitor the server from your menu bar. Web dashboard for model management, chat, and real-time metrics. Signed, notarized, with in-app auto-update. Not Electron.

04 — MODELS

Multi-model serving

LLM, VLM, embedding, and reranker models loaded simultaneously. LRU eviction when memory runs low. Browse and download models directly from the admin dashboard.

05 — API

OpenAI + Anthropic drop-in

Compatible with Claude Code, OpenClaw, Cursor, and any OpenAI-compatible client. Native /v1/messages Anthropic endpoint. Web dashboard generates the exact config command for each tool.

06 — TOOLS

Tool calling + MCP

Supports all major tool calling formats: JSON, Qwen, Gemma, GLM, MiniMax. MCP tool integration and tool result trimming for oversized outputs. Configurable per model.


Performance

Real numbers,
real hardware.

All benchmarks on M3 Ultra 512GB. Single request and continuous batching across four popular models.

Context Prompt TPS Token TPS Peak Mem
1k768 tok/s56.6 tok/s65.5 GB
8k941 tok/s54.0 tok/s69 GB
16k886 tok/s48.3 tok/s71 GB
32k765 tok/s42.4 tok/s73 GB

Continuous batching

pp1024 / tg128  ·  no cache reuse

Batch Token TPS Speedup
56.6 tok/s1.00×
92.1 tok/s1.63×
135.1 tok/s2.39×
190.2 tok/s3.36×
Context Prompt TPS Token TPS Peak Mem
1k1,462 tok/s58.7 tok/s80 GB
8k2,009 tok/s54.9 tok/s83 GB
16k1,896 tok/s52.3 tok/s83 GB
32k1,624 tok/s45.1 tok/s85 GB

Continuous batching

pp1024 / tg128  ·  no cache reuse

Batch Token TPS Speedup
58.7 tok/s1.00×
100.5 tok/s1.71×
164.0 tok/s2.79×
243.3 tok/s4.14×
Context Prompt TPS Token TPS Peak Mem
1k588 tok/s34.0 tok/s227 GB
4k704 tok/s30.3 tok/s228 GB
8k663 tok/s26.3 tok/s229 GB
32k426 tok/s14.9 tok/s235 GB

Continuous batching

pp1024 / tg128  ·  no cache reuse

Batch Token TPS Speedup
34.0 tok/s1.00×
49.7 tok/s1.46×
109.8 tok/s3.23×
126.3 tok/s3.71×
Context Prompt TPS Token TPS Peak Mem
1k187 tok/s16.7 tok/s392 GB
4k180 tok/s13.7 tok/s394 GB
16k117 tok/s12.0 tok/s403 GB
32k78 tok/s10.7 tok/s415 GB

Continuous batching

pp1024 / tg128  ·  no cache reuse

Batch Token TPS Speedup
16.7 tok/s1.00×
23.7 tok/s1.42×
47.0 tok/s2.81×
60.3 tok/s3.61×

"The Qwen3.5 models running on oMLX is so fast that it makes running local AI on Mac worthwhile. It is so much faster than LMStudio and the tool calling is so much more reliable."

— GitHub comment, issue #62


FAQ

Common questions.

Ollama and LM Studio cache the KV state in memory, but when the context shifts mid-session — which happens constantly with coding agents — the entire cache gets invalidated and recomputed from scratch. oMLX persists every KV cache block to SSD, so previously cached portions are always recoverable. TTFT drops from 30–90 seconds to under 5 seconds on long contexts.

Apple Silicon (M1 or later) with macOS 15+. 16GB RAM is the minimum, but 64GB+ is recommended for comfortable use with larger models. The sweet spot for daily coding work is an M-series Pro/Max with 64GB+.

Yes. oMLX provides both OpenAI-compatible (/v1/chat/completions) and Anthropic-compatible (/v1/messages) API endpoints. It works as a drop-in backend for all three. The web dashboard has a one-click config generator — select your model, copy the command, paste into terminal.

No. oMLX reuses your existing LM Studio model directory — just point it at your models folder. You can also browse and download models directly from the built-in HuggingFace downloader in the admin dashboard.

Any MLX-format model from HuggingFace. This includes Qwen, LLaMA, Mistral, Gemma, DeepSeek, MiniMax, GLM, and more. Reasoning models (DeepSeek, MiniMax, Qwen) get automatic <think> tag handling. Vision-Language Models are supported since v0.2.0 with the same paged SSD caching.


Get started

Up and running
in two minutes.

Download the DMG or install from source. Reuses your existing LM Studio model directory — no re-download needed.

macOS App Recommended

Drag to Applications. The welcome screen walks you through model directory, server start, and first model download. Signed and notarized.

Download DMG

From source

Requires Python 3.10+ and Apple Silicon. Connects to any OpenAI-compatible client on localhost:8000.


git clone https://github.com/jundot/omlx
cd omlx && pip install -e .


omlx serve --model-dir ~/models