Surpassing Frontier Performance with Open Source Synth at 1/3 the price

Synthesize a panel of research agents with the right open-weights engine and you pass a closed frontier model on deep research, for a fraction of its price. Our best synthesizer — a Kimi K2.6 judge feeding a GLM-5.2 synthesizer, both downloadable — scores 73.4 on DRACO over a strong research panel, eight points clear of Anthropic's Fable 5 at 65.3. Keep the panel itself fully open too — MiniMax M3, Kimi K2.6, DeepSeek V4 Pro, Gemma-4, GLM-5.2, no closed API anywhere — and the committee still beats Fable 5 at 69.2 for about a third the price. Same gemini-3.1-pro grader as everywhere; the full judge × synthesizer grid is how we found the 73.4 pair.

The reason this works is that synth rewards diverse error, and five independently-trained open models disagree in useful ways. Each panelist searches the web, reads its own sources, and writes its own report; where one hallucinates a date or misses a primary source, the others usually don't, and the synthesizer keeps what survives cross-examination. A single model, even a good closed one like Fable 5, has one failure mode and repeats it through the whole answer. The committee has five, mostly uncorrelated, and the judge catches the difference. We have argued before that the strongest open models never show up on the leaderboards that rank them solo, and this is the mechanism: their value shows up in the ensemble, and a single-shot score never measures it.

Panel	Synthesizer	DRACO	Cost / 100 tasks
Best synthesizer · frontier panel	Kimi-k2.6 judge → GLM-5.2	73.4	~$500 modeled
All-open (M3 / K2.6 / V4 Pro / Gemma-4 / GLM-5.2)	Kimi-k2.6 judge → GLM-5.2	69.2	~$80 (≈⅓ of Fable)
Fable 5 + GPT-5.5	GPT-5.5	69.0	~$450 modeled
Fable 5 solo	—	65.3	~$250 modeled

The obvious objection: does this beat the closed frontier? No. Our frontier-mixed panel, with GPT-5.5 and Opus sitting on the committee, scores 73.4 with the same Kimi-k2.6 → GLM-5.2 synthesizer. Letting real frontier models into the panel buys about four points over the open-only version, and that gap is consistent run to run — but a panel of closed frontier models costs more than any single one of them, several times the open committee's bill. The open committee also edges past the best closed-ish synth we have published, a Fable 5 plus GPT-5.5 panel at 69.0. So we make the narrower, fully defensible claim: a committee that touches no proprietary API anywhere clears a single mid-tier closed model by a real margin.

And the committee that wins is the cheaper one to rent. Fable 5 lists at $10 per million input tokens and $50 per million output; every open model on the panel comes in under $1.55 in and $4.85 out, most of them well under a dollar in. So even with five of them each running a full research pass and a sixth pass to synthesize the reports, the open committee's bill comes to about $80 over the hundred-task benchmark — roughly a third of the ~$250 a single Fable 5 run models to at the same token budget (Fable 5's own rate is route-blocked and unpublished, so that figure is a model, not a bill). It scores higher and costs about a third as much — and that is before you count that the weights are yours to run.

The synthesizer is the seat that carries the result, and the best one we tested is GLM-5.2 — open weights, and it heads the committee at 69.2. Its one liability is that it censors — on politically loaded tasks it refuses or goes blank, Taiwan being the cleanest example, and a synthesizer that won't write the answer is worthless on exactly the questions where synth matters most. TrustedRouter Synth covers that case directly: when the synthesizer returns nothing, it falls through to a backup model — here Gemma-4, also open weights — and the run keeps going. Run GLM-5.2 for the top score and the fallback catches the rare blank; run the runner-up MiniMax M3 as the censorship-safe default and you never trip it at all — either way no task reaches for a closed model to get unblocked.

What makes the result matter is what it removes. This configuration has no API key, no per-token bill, no provider that can deprecate the model under you or refuse a prompt class on its own policy. The panelists are open weights, the synthesizer is open weights, the fallback is open weights, and the whole loop runs on hardware you own. For deep research with its long horizon, many sources, and real cost per query, that is the difference between renting a capability and holding one. The pieces, the exact panel, and the DRACO harness are all public: the models we route are listed on our models page, and the synth code is in TrustedRouter Synth Draco. Five models you can download, judged by the same grader as everything else, land above a closed model people pay for.