MTPLX 1.0.4
MTPLX 1.0.4
Same-day hotfix. 1.0.3 broke coding agents on their first tool turn, and pasting a GGUF repo into the app got a wrong answer. Both fixed.
Agents work again
Any tool-using client on 1.0.3, including Pi, Hermes, OpenCode, and anything speaking the OpenAI tools protocol, crashed with "unexpected keyword argument 'vision_splice'" on its first tool turn after a cache miss. Non-streaming callers got a 500. Streaming agents lost the stream mid-conversation, which the app reported as "Stream offline."
The cause was one stray argument left behind by the vision work, sitting in a diagnostics call that only agent tool turns reach, which is exactly the path the release tests never drove. It is removed, that path is now unit-tested directly, and a new audit test parses the server source and fails if any call ever passes an argument its target does not accept, so this class of mistake cannot ship quietly again. Reported within hours by two users (#99, #100). Thank you both.
GGUF repos get a straight answer
Qwopus3.6-27B-Coder-MTP-GGUF is trending, it has MTP in the name, and people paste it into "Add a model from Hugging Face." The app used to claim the repository did not exist. The truth is that GGUF is llama.cpp's format and MTPLX runs MLX models. The app now says exactly that, names the source repo the GGUF was made from, and points you at Forge to convert it. Genuine typos get "check the name" instead of a lie.
The repo check also follows your Hugging Face mirror setting now, so adding models works on networks where huggingface.co is blocked.
Downloads
- Mac app: mtplx.com/download
- All releases and checksums: mtplx.com/releases
- CLI:
brew install youssofal/mtplx/mtplx
Existing installs update themselves; the app refreshes its engine automatically after updating.
Homebrew 6 note
Homebrew 6.0 added tap trust, and right now it can fail third-party tap installs with a silent sandbox error even for trusted taps (Homebrew/brew#22631). If brew install youssofal/mtplx/mtplx dies that way, run it once as HOMEBREW_NO_REQUIRE_TAP_TRUST=1 brew install youssofal/mtplx/mtplx. That is a Homebrew regression, not an MTPLX one; upstream is on it.
MTPLX 1.0.3
The app can see. Attach an image and ask about it, with MTP speculative decoding still running on top. And the compatibility gate stops refusing models that run fine.
Downloads
- Mac app: mtplx.com/download (notarized DMG, Apple Silicon, macOS 14+)
- All releases and checksums: mtplx.com/releases
- CLI via Homebrew:
brew install youssofal/mtplx/mtplx - CLI via pip:
pip install mtplx
Existing installs update themselves; the app refreshes its engine automatically after updating.
Vision
Chat takes images now. Attach PNG, JPEG, or WebP from the paperclip, ask your question, and the model describes what it is looking at. The same works over the API with standard OpenAI image_url content parts, so any client that can send images to GPT-style endpoints can send them to MTPLX. Speculative decoding is untouched by all of this: on an M5 Max the 27B answers vision questions at the same MTP speeds as text.
This works on Qwen 3.6 27B in both Speed and Quality builds, Qwen 3.6 35B, and Qwen 3.5 9B. The 9B had shipped without its vision weights; the repo on Hugging Face now carries them again, and mtplx pull learned to sync repo updates like this into existing local copies instead of declaring them complete forever. Gemma vision needs its own tower implementation and is coming separately.
Two honest details: images skip the prompt cache by design, because two different pictures with identical text must never restore each other's state, so image turns always pay a fresh prefill. And /health now reports vision support per model, which is also how the app decides whether to offer image attachments.
The gate stops blocking runnable models
If you picked the Optimized Quality model from the CLI menu, MTPLX refused to load it and then crashed while printing the explanation (#98). Both halves of that are fixed. The crash was a one-line bug that had eaten every gate explanation since 1.0.0. The refusal was a policy mistake: the gate treated unverified runtime contracts, and even benchmark evidence saying a model was slow, as reasons not to load at all. Verification is now a label. Unverified models load with a one-line note saying so, and refusal is reserved for models that genuinely cannot execute. Trying weird model builds is the point of this thing; the runtime should tell you what it knows, then run.
Checksums (SHA-256)
cd753fdacea94e294eb834f955247389ae6ee49b33c00bc7fe7b39e11bd79a19 MTPLX-1.0.3.dmg
025f099d51c9b501c3f7f19e868091380176b3929887e58c28e82264cc3259c2 mtplx-1.0.3-py3-none-any.whl
9b0c7db41fc5eb7d88629f2e1b78da10bdd368de26283b840c0f58f7c5271494 mtplx-1.0.3.tar.gz
MTPLX 1.0.2
A bug-fix release. Two ways the engine could refuse to start are gone, parallel agents stop tripping over each other, and model downloads work behind networks that block Hugging Face.
Downloads
- Mac app: mtplx.com/download (notarized DMG, Apple Silicon, macOS 14+)
- All releases and checksums: mtplx.com/releases
- CLI via Homebrew:
brew install youssofal/mtplx/mtplx - CLI via pip:
pip install mtplx
Existing installs update themselves; the app refreshes its engine automatically after updating.
No more Degraded on every start
The Settings profile picker offered two options the engine never accepted, Auto and Sustained Max. Picking either saved a value that killed the engine at launch, so the app showed Degraded every time you pressed Play, with nothing pointing at the real cause. That class of bug is now closed from both sides: the engine accepts the old values (Auto means the default profile, Sustained Max means Sustained, and pinned fans stay a fan setting), existing configurations heal themselves when the app loads them, and the picker only offers profiles the engine actually runs. mtplx serve --profile auto works from the command line too.
Launches that lose their port also got smarter. If something took the port between the check and the start, or a listener the local probe cannot see is holding it, MTPLX now resolves the conflict and retries once on its own instead of failing with a raw port error.
Parallel agents stop colliding
Agent tools that fire many requests at once without session ids (LiteLLM batches, swival audits, anything that fans out) could fail with "session is already in flight" because requests sharing a prompt prefix resolved to the same session. A busy session now forks the request to a fresh session, so every parallel request completes; explicitly named sessions keep the strict behavior. Anonymous session ids are also random now instead of clock derived, which removes a rare collision under burst load.
Reported and fixed by Frank Denis (@jedisct1) in #95. First community patch in MTPLX, and a clean one.
Downloads work behind blocked networks
huggingface.co is blocked in mainland China, which made the guided download impossible there (#96). You can now set a download mirror, either right in the onboarding download step or later in Settings under Advanced, and MTPLX uses it for model downloads. Your Hugging Face token is never sent to a mirror, so gated repos stay on the official endpoint.
Checksums (SHA-256)
71c88729a7f6101365f2a7cebac01fe5f17a50b004930409b9e55512c96f788b MTPLX-1.0.2.dmg
16aedb9b01cc1a8a4219381e34f9a5209ff2afb8515546f245bb8dcd34ad5a7c mtplx-1.0.2-py3-none-any.whl
64b7909f5305d0c674b37c6b613e6f1057b2225534c6034978ad6064bb2cdfe3 mtplx-1.0.2.tar.gz
MTPLX 1.0.1
A bug-fix release, hours after 1.0.0, because two things deserved fixing immediately.
Tuning finishes on every Mac
Some Macs could not complete the first-run tuning step: when the fan helper could not verify a max-fan ramp (a fresh machine without the helper's sudo grant, for example), tuning refused to run and setup stopped. Pinned fans make timing cleaner, but they were never worth a dead setup wizard.
Tuning now runs with fans on automatic when pinning is unavailable, says so in the results, and still saves only a depth that actually beats the baseline. If you want strict pinned-fan measurements for benchmarking, mtplx tune --require-max-fans keeps the old behavior.
Gemma 4 works from the CLI
The official Gemma 4 repos are assistant-pair bundles, and the CLI's Hugging Face preflight did not recognize the layout, so mtplx start refused models the app ran fine. The CLI now reaches the same verdict as the app.
Downloads
- Mac app: mtplx.com/download
- All releases and checksums: mtplx.com/releases
- CLI:
brew install youssofal/mtplx/mtplx
Existing installs update themselves; the app refreshes its engine automatically after updating.
Checksums (SHA-256)
d6cffd30950ea9462bc328b1ab03e4bdea35d1135d07f8467ff6a9557bf9a153 MTPLX-1.0.1.dmg
f01c6a605bc221fd3489289f1266521f3a5df689539b340ff8515fd5241f2a1c mtplx-1.0.1-py3-none-any.whl
21fcc43ba641a168e595848c9239cd2620f4c3c7e755f6d1c480c86ae2f9c7f6 mtplx-1.0.1.tar.gz
MTPLX 1.0.0
Download
Mac app (recommended): mtplx.com/download. Drag to Applications; the app sets up everything else, updates itself, and puts mtplx in your terminal. All releases: mtplx.com/releases.
CLI only:
brew install youssofal/mtplx/mtplx
or python3 -m pip install mtplx. The DMG attached below is byte-identical to the one served at mtplx.com.
Requires Apple Silicon and macOS 14 or newer.
MTPLX 1.0.0 is the first full release: a native macOS app and the mtplx command line working as one product, built for Apple Silicon.
If you are new here: MTPLX runs local language models using their own built-in multi-token prediction heads as a speculative drafter, with exact rejection sampling. Same output distribution as normal decoding, measured 1.6x faster on a 16 GB M4 Mac mini and up to 2.24x on an M5 Max.
The Mac app
New in 1.0.0, and the reason this release exists. Download the DMG, drag it to Applications, and the app does the rest:
- Guided onboarding: checks your hardware, recommends a model that fits your memory, downloads it, installs its own Python engine and fan control, puts
mtplxon your PATH, and tunes decoding depth on your machine. - A live dashboard: decode speed gauge, acceptance rate by draft depth, the verify waterfall, context usage, cache state, and an activity feed that tells the truth about what the server is doing.
- Native chat with streaming, thinking cards, file attachments, web search, and LaTeX rendering.
- One-click launches for OpenCode, Pi, Hermes, and Open WebUI against your local server.
- A built-in AIME benchmark runner, so you can score a model yourself instead of trusting a chart.
- Automatic updates through Sparkle, with the engine refreshed after every update. No Homebrew required at any point: release builds bundle a pinned Python interpreter.
New models
The 0.3.7 engine ran one verified model. The 1.0.0 catalog covers a range of machines, in speed, balance, and quality builds:
- Gemma 4. Runs as an assistant pair, where the tuned control is the draft block size rather than depth. Long-context behavior was verified for no performance cliff.
- Qwen 3.6 MoE (35B-A3B). Mixture-of-experts support including prequantized expert sidecars, normalized expert layouts, and hard blocks on layouts that cannot run correctly.
- Qwen 3.5 (4B, 9B). Smaller machines get first-class models instead of a cut-down experience.
- Qwen 3.6 27B remains the flagship, now in speed, quality, and FP16 builds.
The catalog is shared by the app and the CLI, and the default is chosen for your machine: chip generation picks the precision, and machines under 32 GB route to the 9B model because the 27B default cannot load safely there.
KV cache reuse, in memory and on disk
Two layers, one goal: never pay for the same tokens twice.
- In RAM: warm-prefix reuse across turns and requests. Multi-turn chats and agent workloads like OpenCode hit the cache instead of re-processing the conversation, which is the difference between an agent that flows and one that stalls before every reply.
- On SSD: session state persists to disk with enforced size caps. Quit the server, restart your Mac, come back tomorrow: the session restores near-instantly instead of re-processing thousands of tokens.
Concurrency
1.0.0 adds continuous batching: the server can interleave multiple requests instead of serializing them. Batching presets, a scheduler mode, and explicit caps (--max-active-requests, --decode-batch-max, --batch-wait-ms) control the behavior. Agent workloads, which fire many short requests, benefit the most.
Smart fan mode
Fan control is no longer all-or-nothing. Smart mode ramps the fans when the model is working and restores them when it goes idle, works across the app, the CLI, and the server API, and survives handing a session from the app to a terminal client. The crash-safe watchdog from earlier releases still stands behind all of it: if MTPLX dies for any reason, your fans return to automatic.
A server built for agents
Most of the serving work this cycle came from running real coding agents against MTPLX and fixing everything that broke:
- OpenCode, Pi, and Hermes each have a hardened lane: correct tool contracts, trimmed read-only toolsets, and long-context depth policy that keeps speculation effective deep into a session.
- OpenAI
stopsequences are honored across chat, completions, and the Anthropic endpoint. /v1/completionsstreams tokens as they are generated, with real finish reasons and usage.- Cancellation is honest: cancelling a request, streaming or not, actually stops decode on the server.
- A live metrics stream (server-sent events) powers the app's dashboard and is available to your own tools, alongside snapshot, thermal, and prefill-history endpoints.
Forge
Forge turns the engine into a model factory. Point it at a Hugging Face repo and it converts the model to MLX (AWQ, compressed-tensors, NVFP4, and BF16 sources), calibrates and trains the MTP adapter, verifies the result on your hardware with quality gates that reject speed wins that degrade output, and publishes back to the Hub with full provenance if you choose. Vision towers are preserved through conversion. Available in the app as a full visual workflow and as mtplx forge.
The AIME benchmark
The app and mtplx bench aime run a live 30-problem AIME benchmark with fully disclosed, coaching-free prompts: the prompt carries only the answer-format contract, and every run records its exact prompts and rescue policy so results are reproducible.
One product, two surfaces
mtplx start now detects the app's running server and attaches to it instead of loading a second copy of the model. The app and CLI share the same catalog, recommendations, and settings, and mtplx stop knows the app's port.
New commands
mtplx stopstops the running server cleanly.mtplx settings get/setreads or changes live server settings.mtplx bench aime [--quick]runs the benchmark from the terminal.mtplx forgebuilds, verifies, and publishes MTP models.
Reliability and distribution
- Release builds bundle a pinned CPython, so a pristine Mac needs no Homebrew and no Python.
- The engine installs into an app-owned environment that ignores whatever pip configuration is on the machine, and packages load on macOS 14 and 15, not just the newest macOS.
- An old
mtplxon your PATH gets updated automatically instead of shadowing the new one. A newer one is left alone. - Busy ports resolve themselves: the app moves to the next free port with a banner, and the CLI tells you who owns a port and how to stop it.
Checksums (SHA-256)
377ca58438dce3458aacf8c74b7263a62c8636215da78c74696c7972df4cb649 MTPLX-1.0.0.dmg
74a6e74416dbcd78f9e70b2dac225a93cc2b98d4c12c18991461fcdc4fddf4b9 mtplx-1.0.0-py3-none-any.whl
3e65d8e54ea0ba0ab87010ec7a6d2cbdcc031c7fee104b4b8c9da874da375fe0 mtplx-1.0.0.tar.gz
MTPLX v0.3.7
MTPLX v0.3.7
This is a bugfix release for the recent public reports around the verified default model, Tune, speed reporting, and Claude Code.
Fixed
- Fixed the verified default model path for
start,quickstart,serve,tune, andbench tune. These now use the public Optimized Speed model instead of the oldmodels/...relative path. - Fixed
mtplx start --dry-run --jsonso the OpenWebUI handoff no longer says--model None. - Fixed Tune failure reporting. Bad model paths now show the real child error instead of a vague missing-artifact result.
- Made Tune speed output clearer: the headline
tok/sis decode speed, while prefill and end-to-end speed are tracked separately. - Added Claude Code support through the Anthropic Messages API, including tool use, tool results, streaming, and token counting.
Validation
- GitHub
ci,build, andhygienepassed on the release commit. - The tag release workflow built and checked the artifacts.
- PyPI trusted publishing passed, and PyPI now shows
0.3.7as latest. - Real local checks included source CLI dry-runs, fresh wheel install, real server startup, real Tune, bad-model Tune failure, and Claude Code headless plus interactive tool execution.
Install
python3 -m pip install -U mtplx
MTPLX v0.3.6
MTPLX v0.3.6
Production patch over v0.3.5 focused on the public release pillars: bounded memory, no silent decode/prefill tradeoff, and clean CLI UX.
Highlights
- Fixes AIME-shaped
max_tokens=65536memory behavior by bounding initial new-token KV reservation while preserving real prompt-context allocation. - Avoids retaining full-capacity live cache refs for anonymous one-off sessions.
- Improves OpenCode tool-result turns so stable cached prefixes are reused instead of cold-prefilling the full history.
- Ships Tune in the packaged CLI:
mtplx tune,mtplx-tune, andmtplx bench tune. - Fixes verified-default onboarding/model labeling for the installed Optimized Speed/Q4 artifact.
- Adds
bench tunechip diagnostics with power, frequency, temperature, utilization, fan, and thermal-pressure telemetry; generation-window scope is labeled when available. - Tightens README claims so paired same-machine speedup is not described as hardware-independent.
Validation
- Local: compileall, ruff, full pytest, twine check, fresh venv smoke, git diff check.
- CLI UX:
mtplx --version, OpenCode dry-run, Pi dry-run,mtplx-tunedry-run, andbench tunedry-run. - GitHub PR #66:
repository-hygiene,wheel, andno-mlx-smokeall passed before merge. - Release workflow: trusted PyPI publishing workflow passed for
v0.3.6.
Known Non-Claims
MTPLX v0.3.5
What's Changed
- Fixed OpenCode tool-result turns cold-prefilling the full conversation history. Follow-up OpenCode turns now reuse the stable SessionBank prefix instead of sitting at
Thinking...for minutes. - Fixed unsafe stream postcommit prefix anchoring so streamed assistant/tool histories do not poison the next cache boundary.
- Locked down the real-world consecutive Qwen XML tool-call regression so back-to-back tool calls stay structured and do not leak raw XML.
Validation
- Targeted server/tool/OpenCode pytest suite passed.
- Built and checked PyPI artifacts with
twine check. - Ran real local MTPLX server, streaming API, Android Studio doctor, OpenCode CLI, and Pi CLI smoke tests against the local Optimized Speed model.
- Verified PyPI
mtplx==0.3.5fresh venv install and Homebrewyoussofal/mtplx/mtplx0.3.5 install/test.
MTPLX v0.3.4
MTPLX v0.3.4 is a patch release for coding-agent UX and serving compatibility.
Highlights:
- Idle SessionBank postcommit is now cooperative and preemptible, so foreground Pi/OpenCode/agent turns do not sit silently behind long background cache work.
- Consecutive Qwen XML tool calls stay structured while streaming as OpenAI delta.tool_calls.
- Swival is now available through the start/integrate handoff flow.
- The locked indirect urllib3 dependency is updated to 2.7.0.
Validation:
- Local focused pytest and CI-mirror subsets passed.
- python -m build and twine check passed for the wheel and sdist.
- Fresh venv no-MLX CLI smoke passed.
- Real max-fan CLI generation with the local Optimized Speed model loaded and answered successfully.
- GitHub Actions build, hygiene, and ci are green on the release commit.
MTPLX v0.3.3
MTPLX v0.3.3
Patch release for OpenAI-compatible serving clients and coding-agent tool UX.
Added
mtplx doctor android-studiofor model discovery, nonstream chat, streaming chat, and tool-bearing request smoke.- Android Studio/OpenAI-compatible request-shape tolerance for
max_completion_tokens,stream_options,response_format,metadata, andparallel_tool_calls.
Changed
- Qwen XML tool calls now stream OpenAI
delta.tool_callsincrementally, so compatible clients can mount file-write/edit cards before the full argument body finishes. - Pi handoff no longer writes a hidden model-level
maxTokenscap.
Fixed
- Fixed Android Studio issue #58 where
/v1/chat/completionscould fail as500: null. - Hardened malformed, unknown, unclosed, or schema-invalid tool-call output so it falls back safely instead of hanging or storing raw XML as successful assistant tool history.
QA
- Local source tests: public CLI, OpenCode/onboarding, OpenAI bridge, server, and streaming tool-call translator passed.
- Built wheel/sdist and passed
twine checkplus fresh no-MLX venv smoke. - Installed the built wheel in a clean temp environment and verified
mtplx 0.3.3plus OpenCode dry-run with no hiddenmaxTokens. - Real local CLI smoke loaded the Optimized Speed model and generated through the MTP path.
- Real server/API smoke verified
/health,max_completion_tokens, OpenAI-shaped invalid-request errors, Android doctor, and incremental streamingwrite_filetool-call deltas.