Releases · youssofal/MTPLX

16 min read Original article ↗

MTPLX 1.0.4

MTPLX 1.0.4

Same-day hotfix. 1.0.3 broke coding agents on their first tool turn, and pasting a GGUF repo into the app got a wrong answer. Both fixed.

Agents work again

Any tool-using client on 1.0.3, including Pi, Hermes, OpenCode, and anything speaking the OpenAI tools protocol, crashed with "unexpected keyword argument 'vision_splice'" on its first tool turn after a cache miss. Non-streaming callers got a 500. Streaming agents lost the stream mid-conversation, which the app reported as "Stream offline."

The cause was one stray argument left behind by the vision work, sitting in a diagnostics call that only agent tool turns reach, which is exactly the path the release tests never drove. It is removed, that path is now unit-tested directly, and a new audit test parses the server source and fails if any call ever passes an argument its target does not accept, so this class of mistake cannot ship quietly again. Reported within hours by two users (#99, #100). Thank you both.

GGUF repos get a straight answer

Qwopus3.6-27B-Coder-MTP-GGUF is trending, it has MTP in the name, and people paste it into "Add a model from Hugging Face." The app used to claim the repository did not exist. The truth is that GGUF is llama.cpp's format and MTPLX runs MLX models. The app now says exactly that, names the source repo the GGUF was made from, and points you at Forge to convert it. Genuine typos get "check the name" instead of a lie.

The repo check also follows your Hugging Face mirror setting now, so adding models works on networks where huggingface.co is blocked.

Downloads

Existing installs update themselves; the app refreshes its engine automatically after updating.

Homebrew 6 note

Homebrew 6.0 added tap trust, and right now it can fail third-party tap installs with a silent sandbox error even for trusted taps (Homebrew/brew#22631). If brew install youssofal/mtplx/mtplx dies that way, run it once as HOMEBREW_NO_REQUIRE_TAP_TRUST=1 brew install youssofal/mtplx/mtplx. That is a Homebrew regression, not an MTPLX one; upstream is on it.

MTPLX 1.0.3

The app can see. Attach an image and ask about it, with MTP speculative decoding still running on top. And the compatibility gate stops refusing models that run fine.

Downloads

  • Mac app: mtplx.com/download (notarized DMG, Apple Silicon, macOS 14+)
  • All releases and checksums: mtplx.com/releases
  • CLI via Homebrew: brew install youssofal/mtplx/mtplx
  • CLI via pip: pip install mtplx

Existing installs update themselves; the app refreshes its engine automatically after updating.

Vision

Chat takes images now. Attach PNG, JPEG, or WebP from the paperclip, ask your question, and the model describes what it is looking at. The same works over the API with standard OpenAI image_url content parts, so any client that can send images to GPT-style endpoints can send them to MTPLX. Speculative decoding is untouched by all of this: on an M5 Max the 27B answers vision questions at the same MTP speeds as text.

This works on Qwen 3.6 27B in both Speed and Quality builds, Qwen 3.6 35B, and Qwen 3.5 9B. The 9B had shipped without its vision weights; the repo on Hugging Face now carries them again, and mtplx pull learned to sync repo updates like this into existing local copies instead of declaring them complete forever. Gemma vision needs its own tower implementation and is coming separately.

Two honest details: images skip the prompt cache by design, because two different pictures with identical text must never restore each other's state, so image turns always pay a fresh prefill. And /health now reports vision support per model, which is also how the app decides whether to offer image attachments.

The gate stops blocking runnable models

If you picked the Optimized Quality model from the CLI menu, MTPLX refused to load it and then crashed while printing the explanation (#98). Both halves of that are fixed. The crash was a one-line bug that had eaten every gate explanation since 1.0.0. The refusal was a policy mistake: the gate treated unverified runtime contracts, and even benchmark evidence saying a model was slow, as reasons not to load at all. Verification is now a label. Unverified models load with a one-line note saying so, and refusal is reserved for models that genuinely cannot execute. Trying weird model builds is the point of this thing; the runtime should tell you what it knows, then run.

Checksums (SHA-256)

cd753fdacea94e294eb834f955247389ae6ee49b33c00bc7fe7b39e11bd79a19  MTPLX-1.0.3.dmg
025f099d51c9b501c3f7f19e868091380176b3929887e58c28e82264cc3259c2  mtplx-1.0.3-py3-none-any.whl
9b0c7db41fc5eb7d88629f2e1b78da10bdd368de26283b840c0f58f7c5271494  mtplx-1.0.3.tar.gz

MTPLX 1.0.2

A bug-fix release. Two ways the engine could refuse to start are gone, parallel agents stop tripping over each other, and model downloads work behind networks that block Hugging Face.

Downloads

  • Mac app: mtplx.com/download (notarized DMG, Apple Silicon, macOS 14+)
  • All releases and checksums: mtplx.com/releases
  • CLI via Homebrew: brew install youssofal/mtplx/mtplx
  • CLI via pip: pip install mtplx

Existing installs update themselves; the app refreshes its engine automatically after updating.

No more Degraded on every start

The Settings profile picker offered two options the engine never accepted, Auto and Sustained Max. Picking either saved a value that killed the engine at launch, so the app showed Degraded every time you pressed Play, with nothing pointing at the real cause. That class of bug is now closed from both sides: the engine accepts the old values (Auto means the default profile, Sustained Max means Sustained, and pinned fans stay a fan setting), existing configurations heal themselves when the app loads them, and the picker only offers profiles the engine actually runs. mtplx serve --profile auto works from the command line too.

Launches that lose their port also got smarter. If something took the port between the check and the start, or a listener the local probe cannot see is holding it, MTPLX now resolves the conflict and retries once on its own instead of failing with a raw port error.

Parallel agents stop colliding

Agent tools that fire many requests at once without session ids (LiteLLM batches, swival audits, anything that fans out) could fail with "session is already in flight" because requests sharing a prompt prefix resolved to the same session. A busy session now forks the request to a fresh session, so every parallel request completes; explicitly named sessions keep the strict behavior. Anonymous session ids are also random now instead of clock derived, which removes a rare collision under burst load.

Reported and fixed by Frank Denis (@jedisct1) in #95. First community patch in MTPLX, and a clean one.

Downloads work behind blocked networks

huggingface.co is blocked in mainland China, which made the guided download impossible there (#96). You can now set a download mirror, either right in the onboarding download step or later in Settings under Advanced, and MTPLX uses it for model downloads. Your Hugging Face token is never sent to a mirror, so gated repos stay on the official endpoint.

Checksums (SHA-256)

71c88729a7f6101365f2a7cebac01fe5f17a50b004930409b9e55512c96f788b  MTPLX-1.0.2.dmg
16aedb9b01cc1a8a4219381e34f9a5209ff2afb8515546f245bb8dcd34ad5a7c  mtplx-1.0.2-py3-none-any.whl
64b7909f5305d0c674b37c6b613e6f1057b2225534c6034978ad6064bb2cdfe3  mtplx-1.0.2.tar.gz

MTPLX 1.0.1

A bug-fix release, hours after 1.0.0, because two things deserved fixing immediately.

Tuning finishes on every Mac

Some Macs could not complete the first-run tuning step: when the fan helper could not verify a max-fan ramp (a fresh machine without the helper's sudo grant, for example), tuning refused to run and setup stopped. Pinned fans make timing cleaner, but they were never worth a dead setup wizard.

Tuning now runs with fans on automatic when pinning is unavailable, says so in the results, and still saves only a depth that actually beats the baseline. If you want strict pinned-fan measurements for benchmarking, mtplx tune --require-max-fans keeps the old behavior.

Gemma 4 works from the CLI

The official Gemma 4 repos are assistant-pair bundles, and the CLI's Hugging Face preflight did not recognize the layout, so mtplx start refused models the app ran fine. The CLI now reaches the same verdict as the app.

Downloads

Existing installs update themselves; the app refreshes its engine automatically after updating.

Checksums (SHA-256)

d6cffd30950ea9462bc328b1ab03e4bdea35d1135d07f8467ff6a9557bf9a153  MTPLX-1.0.1.dmg
f01c6a605bc221fd3489289f1266521f3a5df689539b340ff8515fd5241f2a1c  mtplx-1.0.1-py3-none-any.whl
21fcc43ba641a168e595848c9239cd2620f4c3c7e755f6d1c480c86ae2f9c7f6  mtplx-1.0.1.tar.gz

MTPLX 1.0.0

Download

Mac app (recommended): mtplx.com/download. Drag to Applications; the app sets up everything else, updates itself, and puts mtplx in your terminal. All releases: mtplx.com/releases.

CLI only:

brew install youssofal/mtplx/mtplx

or python3 -m pip install mtplx. The DMG attached below is byte-identical to the one served at mtplx.com.

Requires Apple Silicon and macOS 14 or newer.


MTPLX 1.0.0 is the first full release: a native macOS app and the mtplx command line working as one product, built for Apple Silicon.

If you are new here: MTPLX runs local language models using their own built-in multi-token prediction heads as a speculative drafter, with exact rejection sampling. Same output distribution as normal decoding, measured 1.6x faster on a 16 GB M4 Mac mini and up to 2.24x on an M5 Max.

The Mac app

New in 1.0.0, and the reason this release exists. Download the DMG, drag it to Applications, and the app does the rest:

  • Guided onboarding: checks your hardware, recommends a model that fits your memory, downloads it, installs its own Python engine and fan control, puts mtplx on your PATH, and tunes decoding depth on your machine.
  • A live dashboard: decode speed gauge, acceptance rate by draft depth, the verify waterfall, context usage, cache state, and an activity feed that tells the truth about what the server is doing.
  • Native chat with streaming, thinking cards, file attachments, web search, and LaTeX rendering.
  • One-click launches for OpenCode, Pi, Hermes, and Open WebUI against your local server.
  • A built-in AIME benchmark runner, so you can score a model yourself instead of trusting a chart.
  • Automatic updates through Sparkle, with the engine refreshed after every update. No Homebrew required at any point: release builds bundle a pinned Python interpreter.

New models

The 0.3.7 engine ran one verified model. The 1.0.0 catalog covers a range of machines, in speed, balance, and quality builds:

  • Gemma 4. Runs as an assistant pair, where the tuned control is the draft block size rather than depth. Long-context behavior was verified for no performance cliff.
  • Qwen 3.6 MoE (35B-A3B). Mixture-of-experts support including prequantized expert sidecars, normalized expert layouts, and hard blocks on layouts that cannot run correctly.
  • Qwen 3.5 (4B, 9B). Smaller machines get first-class models instead of a cut-down experience.
  • Qwen 3.6 27B remains the flagship, now in speed, quality, and FP16 builds.

The catalog is shared by the app and the CLI, and the default is chosen for your machine: chip generation picks the precision, and machines under 32 GB route to the 9B model because the 27B default cannot load safely there.

KV cache reuse, in memory and on disk

Two layers, one goal: never pay for the same tokens twice.

  • In RAM: warm-prefix reuse across turns and requests. Multi-turn chats and agent workloads like OpenCode hit the cache instead of re-processing the conversation, which is the difference between an agent that flows and one that stalls before every reply.
  • On SSD: session state persists to disk with enforced size caps. Quit the server, restart your Mac, come back tomorrow: the session restores near-instantly instead of re-processing thousands of tokens.

Concurrency

1.0.0 adds continuous batching: the server can interleave multiple requests instead of serializing them. Batching presets, a scheduler mode, and explicit caps (--max-active-requests, --decode-batch-max, --batch-wait-ms) control the behavior. Agent workloads, which fire many short requests, benefit the most.

Smart fan mode

Fan control is no longer all-or-nothing. Smart mode ramps the fans when the model is working and restores them when it goes idle, works across the app, the CLI, and the server API, and survives handing a session from the app to a terminal client. The crash-safe watchdog from earlier releases still stands behind all of it: if MTPLX dies for any reason, your fans return to automatic.

A server built for agents

Most of the serving work this cycle came from running real coding agents against MTPLX and fixing everything that broke:

  • OpenCode, Pi, and Hermes each have a hardened lane: correct tool contracts, trimmed read-only toolsets, and long-context depth policy that keeps speculation effective deep into a session.
  • OpenAI stop sequences are honored across chat, completions, and the Anthropic endpoint.
  • /v1/completions streams tokens as they are generated, with real finish reasons and usage.
  • Cancellation is honest: cancelling a request, streaming or not, actually stops decode on the server.
  • A live metrics stream (server-sent events) powers the app's dashboard and is available to your own tools, alongside snapshot, thermal, and prefill-history endpoints.

Forge

Forge turns the engine into a model factory. Point it at a Hugging Face repo and it converts the model to MLX (AWQ, compressed-tensors, NVFP4, and BF16 sources), calibrates and trains the MTP adapter, verifies the result on your hardware with quality gates that reject speed wins that degrade output, and publishes back to the Hub with full provenance if you choose. Vision towers are preserved through conversion. Available in the app as a full visual workflow and as mtplx forge.

The AIME benchmark

The app and mtplx bench aime run a live 30-problem AIME benchmark with fully disclosed, coaching-free prompts: the prompt carries only the answer-format contract, and every run records its exact prompts and rescue policy so results are reproducible.

One product, two surfaces

mtplx start now detects the app's running server and attaches to it instead of loading a second copy of the model. The app and CLI share the same catalog, recommendations, and settings, and mtplx stop knows the app's port.

New commands

  • mtplx stop stops the running server cleanly.
  • mtplx settings get/set reads or changes live server settings.
  • mtplx bench aime [--quick] runs the benchmark from the terminal.
  • mtplx forge builds, verifies, and publishes MTP models.

Reliability and distribution

  • Release builds bundle a pinned CPython, so a pristine Mac needs no Homebrew and no Python.
  • The engine installs into an app-owned environment that ignores whatever pip configuration is on the machine, and packages load on macOS 14 and 15, not just the newest macOS.
  • An old mtplx on your PATH gets updated automatically instead of shadowing the new one. A newer one is left alone.
  • Busy ports resolve themselves: the app moves to the next free port with a banner, and the CLI tells you who owns a port and how to stop it.

Checksums (SHA-256)

377ca58438dce3458aacf8c74b7263a62c8636215da78c74696c7972df4cb649  MTPLX-1.0.0.dmg
74a6e74416dbcd78f9e70b2dac225a93cc2b98d4c12c18991461fcdc4fddf4b9  mtplx-1.0.0-py3-none-any.whl
3e65d8e54ea0ba0ab87010ec7a6d2cbdcc031c7fee104b4b8c9da874da375fe0  mtplx-1.0.0.tar.gz

MTPLX v0.3.7

MTPLX v0.3.7

This is a bugfix release for the recent public reports around the verified default model, Tune, speed reporting, and Claude Code.

Fixed

  • Fixed the verified default model path for start, quickstart, serve, tune, and bench tune. These now use the public Optimized Speed model instead of the old models/... relative path.
  • Fixed mtplx start --dry-run --json so the OpenWebUI handoff no longer says --model None.
  • Fixed Tune failure reporting. Bad model paths now show the real child error instead of a vague missing-artifact result.
  • Made Tune speed output clearer: the headline tok/s is decode speed, while prefill and end-to-end speed are tracked separately.
  • Added Claude Code support through the Anthropic Messages API, including tool use, tool results, streaming, and token counting.

Validation

  • GitHub ci, build, and hygiene passed on the release commit.
  • The tag release workflow built and checked the artifacts.
  • PyPI trusted publishing passed, and PyPI now shows 0.3.7 as latest.
  • Real local checks included source CLI dry-runs, fresh wheel install, real server startup, real Tune, bad-model Tune failure, and Claude Code headless plus interactive tool execution.

Install

python3 -m pip install -U mtplx

MTPLX v0.3.6

MTPLX v0.3.6

Production patch over v0.3.5 focused on the public release pillars: bounded memory, no silent decode/prefill tradeoff, and clean CLI UX.

Highlights

  • Fixes AIME-shaped max_tokens=65536 memory behavior by bounding initial new-token KV reservation while preserving real prompt-context allocation.
  • Avoids retaining full-capacity live cache refs for anonymous one-off sessions.
  • Improves OpenCode tool-result turns so stable cached prefixes are reused instead of cold-prefilling the full history.
  • Ships Tune in the packaged CLI: mtplx tune, mtplx-tune, and mtplx bench tune.
  • Fixes verified-default onboarding/model labeling for the installed Optimized Speed/Q4 artifact.
  • Adds bench tune chip diagnostics with power, frequency, temperature, utilization, fan, and thermal-pressure telemetry; generation-window scope is labeled when available.
  • Tightens README claims so paired same-machine speedup is not described as hardware-independent.

Validation

  • Local: compileall, ruff, full pytest, twine check, fresh venv smoke, git diff check.
  • CLI UX: mtplx --version, OpenCode dry-run, Pi dry-run, mtplx-tune dry-run, and bench tune dry-run.
  • GitHub PR #66: repository-hygiene, wheel, and no-mlx-smoke all passed before merge.
  • Release workflow: trusted PyPI publishing workflow passed for v0.3.6.

Known Non-Claims

  • This release does not claim to fix unrelated open issues #65, #56, or #16.
  • README release claims avoid unavailable fork/patch proof.

MTPLX v0.3.5

What's Changed

  • Fixed OpenCode tool-result turns cold-prefilling the full conversation history. Follow-up OpenCode turns now reuse the stable SessionBank prefix instead of sitting at Thinking... for minutes.
  • Fixed unsafe stream postcommit prefix anchoring so streamed assistant/tool histories do not poison the next cache boundary.
  • Locked down the real-world consecutive Qwen XML tool-call regression so back-to-back tool calls stay structured and do not leak raw XML.

Validation

  • Targeted server/tool/OpenCode pytest suite passed.
  • Built and checked PyPI artifacts with twine check.
  • Ran real local MTPLX server, streaming API, Android Studio doctor, OpenCode CLI, and Pi CLI smoke tests against the local Optimized Speed model.
  • Verified PyPI mtplx==0.3.5 fresh venv install and Homebrew youssofal/mtplx/mtplx 0.3.5 install/test.

MTPLX v0.3.4

MTPLX v0.3.4 is a patch release for coding-agent UX and serving compatibility.

Highlights:

  • Idle SessionBank postcommit is now cooperative and preemptible, so foreground Pi/OpenCode/agent turns do not sit silently behind long background cache work.
  • Consecutive Qwen XML tool calls stay structured while streaming as OpenAI delta.tool_calls.
  • Swival is now available through the start/integrate handoff flow.
  • The locked indirect urllib3 dependency is updated to 2.7.0.

Validation:

  • Local focused pytest and CI-mirror subsets passed.
  • python -m build and twine check passed for the wheel and sdist.
  • Fresh venv no-MLX CLI smoke passed.
  • Real max-fan CLI generation with the local Optimized Speed model loaded and answered successfully.
  • GitHub Actions build, hygiene, and ci are green on the release commit.

MTPLX v0.3.3

MTPLX v0.3.3

Patch release for OpenAI-compatible serving clients and coding-agent tool UX.

Added

  • mtplx doctor android-studio for model discovery, nonstream chat, streaming chat, and tool-bearing request smoke.
  • Android Studio/OpenAI-compatible request-shape tolerance for max_completion_tokens, stream_options, response_format, metadata, and parallel_tool_calls.

Changed

  • Qwen XML tool calls now stream OpenAI delta.tool_calls incrementally, so compatible clients can mount file-write/edit cards before the full argument body finishes.
  • Pi handoff no longer writes a hidden model-level maxTokens cap.

Fixed

  • Fixed Android Studio issue #58 where /v1/chat/completions could fail as 500: null.
  • Hardened malformed, unknown, unclosed, or schema-invalid tool-call output so it falls back safely instead of hanging or storing raw XML as successful assistant tool history.

QA

  • Local source tests: public CLI, OpenCode/onboarding, OpenAI bridge, server, and streaming tool-call translator passed.
  • Built wheel/sdist and passed twine check plus fresh no-MLX venv smoke.
  • Installed the built wheel in a clean temp environment and verified mtplx 0.3.3 plus OpenCode dry-run with no hidden maxTokens.
  • Real local CLI smoke loaded the Optimized Speed model and generated through the MTP path.
  • Real server/API smoke verified /health, max_completion_tokens, OpenAI-shaped invalid-request errors, Android doctor, and incremental streaming write_file tool-call deltas.