This is Part 2 of our series on building Rig where we share our discoveries and learnings on making coding models work on consumer hardware. Check out Part 1: Teaching a Model to Code, which covers data generation.
In Part 1, we talked about the data problem: how we built a pipeline to generate tens of thousands of realistic coding episodes because the training data we needed didn't exist.
Now comes the question that kept us up at night for months: how do you take all of that data and produce a model that's genuinely capable, but small enough to run on a laptop?
The short answer is: you cheat.
Not in a bad way. In a "we found five different ways to make a model smaller and stacked them on top of each other" way.
Supervised fine-tuning, a reinforcement learning approach we developed called Self-Distillation Policy Optimization, progressive expert pruning, multi-objective knowledge distillation, speculative decoding, and custom quantization.
Each one shaving off size or adding capability, in sequence, like a Rube Goldberg machine.
Let me walk you through the whole thing.
Why Mixture-of-Experts is the key to all of this
Before anything else, you need to understand the architecture we're starting with, because it's the reason any of this is possible.
Our base model uses something called Mixture-of-Experts, or MoE. The idea is simple and kind of beautiful: instead of running every parameter for every token, the model has a "router" that picks a small subset of feed-forward networks, called "experts," for each input. The model might have 80 billion total parameters, but only 3 billion of them activate for any given token.
A quick note on the name: "experts" is a bit misleading. These aren't specialist modules that know about specific languages or topics. They're general-purpose sub-networks, and the router learns to activate different combinations based on patterns in the hidden state. The same expert might fire for a token in Python, a token in a legal doc, and a token in French prose. The value of MoE isn't that you get tidy specializations, it's that you get the total parameter capacity of a much larger model while only paying the compute cost of a small active slice.
This is perfect for running locally. You need enough total capacity to store broad coding knowledge, but you only pay the cost of the parameters that activate per token. It's like having a huge library but only pulling three books off the shelf at a time.
The catch is that MoE models are notoriously difficult to work with. Fine-tuning them is tricky. Compressing them is trickier. The router adds complexity, the balance between experts is fragile, and if you compress naively by just treating the whole thing as one big blob of weights, you destroy the learned routing patterns that make it all work.
Everything we built accounts for this. Every stage of our pipeline respects the expert structure.
That constraint shaped every decision we made.
Stage 1: Teaching the model to be a developer
The first stage is supervised fine-tuning on the episodes we generated in Part 1. This is where the model learns the basics: how to read files, search codebases, make edits, run commands, and reason about code.
Three things matter here, and they're all more subtle than they sound.
You have to train on everything at once. Our episodes span five task types: replay, debugging, code review, fill-in-the-middle, and reasoning… and we train on all of them simultaneously. Each task type gets a carefully tuned loss weight so no single type dominates the gradient updates. The goal is a generalist. Not a code completer. Not a debugger. A full coding agent that can do all of it.
Not every token matters equally. This is where most people would get it wrong. A training episode contains the agent's actions, its reasoning, retrieved context, tool results, and system messages. If you train on all of them with equal weight, the model gets confused. It starts trying to memorize retrieved snippets, or reproduce tool outputs, instead of learning the actual skills. So we mask selectively. Full loss on tool call invocations and assistant reasoning (that's the behavior we want). Zero loss on retrieved context, tool results, and system messages (that's scaffolding, not skill). The model learns from the actions in each episode without being distracted by the environment around them.
Long context isn't optional. Real coding sessions span thousands of lines across dozens of files. We train on sequences up to 500k tokens, using sequence parallelism to split the attention computation across multiple GPUs. This isn't a nice-to-have. If your coding assistant can't hold a long conversation with a complex codebase, it's a toy.
We apply LoRA across all layers, including the per-expert feed-forward networks. This is a deliberate choice that I want to emphasize, because a lot of people skip the experts when applying LoRA.
In an MoE model, most of the parameters live in the experts. If you skip them, you're leaving most of the model untouched. You end up with a model that's been fine-tuned on paper but barely changed in practice.
Stage 2: The model teaches itself
Supervised learning gets you surprisingly far. But it has a fundamental limitation: the model learns to imitate good trajectories without necessarily understanding why those trajectories were good. It copies the actions. It doesn't internalize the reasoning behind them.
This is where reinforcement learning comes in. And this is where we had to develop something new.
The standard RL approaches for language models, PPO, GRPO, DPO, all use trajectory-level rewards. The entire response gets a single score: good or bad. This is like grading a student's exam by looking only at the final answer and ignoring all the work. For a coding agent that makes dozens of tool calls across a long trajectory, that signal is way too sparse. "Your trajectory was bad" doesn't tell the model where it went wrong. Was it the initial file read? The search query? The edit on line 47? The model has no way to know.
So we took something called Self-Distillation Policy Optimization, or SDPO, and adapted the objective to work for multi-turn conversations. The core idea is that the model acts as its own teacher. Here's how it works:
The student, the model we're training, generates a trajectory for a given task. It does its best, the same way it would at inference time.
Then the same model, with the same parameters, gets a second shot at the same task. But this time, it gets to cheat. It sees per-turn observations about what worked and what didn't, the final test results, and optionally a reference solution. With all that hindsight, it generates a "what I would have done differently" trajectory.
The difference between the student's attempt and the teacher's hindsight attempt gives us dense, per-token feedback. Not "this trajectory was bad" but "at this specific token, here's what you should have done instead." The model learns that it should have searched for the test file instead of reading the README. That the edit was close but missed the edge case on line 47. That the right tool call was a code graph search, not a grep.
This is dramatically more informative than a single trajectory-level score. And because the teacher is just the model itself with extra context, we don't need a separate reward model — which would require labeled preference data we don't have.
The teacher stays close to the student through exponential moving average updates, which prevents the catastrophic drift that plagues most RL training setups. We run SDPO in two phases: first on pure code generation (where the reward signal is cleanest), then on full agent trajectories (where it learns to improve tool use and exploration).
Stage 3: Making it actually fit
An 80-billion parameter model, even with MoE, is too large for consumer hardware. So now we compress it. And this is where things get really interesting, because naive compression is catastrophically bad.
You can't just drop experts
The base model has hundreds of routed experts across all its layers. We need to cut that down to 64 total across all layers. The obvious approach is to drop the least-used ones. This doesn't work. Experts that seem "unimportant" on average might activate for critical token patterns scattered across many different tasks. Drop one and the degradation isn't clean or predictable, you don't lose "Rust support," you get subtly worse completions across dozens of unrelated contexts in ways that are hard to diagnose.
Our approach is progressive pruning: we compress in two stages rather than one big cut. Each stage reduces the expert count significantly, with a recovery phase in between. This is gentler than a single aggressive compression and preserves more capability. After both rounds, the model retains roughly 30 billion total parameters — still far more capacity than a dense model of equivalent active size. Think of it like gradually adjusting to altitude rather than being dropped on top of Everest.
The pruning itself uses three signals together. Weight similarity tells us which experts have learned similar things and can be merged with less information loss. Co-activation patterns tell us which experts the router frequently uses together, those serve complementary roles and shouldn't be merged, even if their weights look similar. And importance scoring, based on Fisher information, ensures we never merge away an expert that has an outsized impact on the model's output.
But here's the part I find most elegant. When we merge experts, we don't just average their weights. Neural networks have permutation symmetries: two experts might represent the exact same function, but with their internal neurons in a different order. If you average them directly, you get mush. So we first align the internal neurons using optimal transport, finding the permutation that minimizes the distance between corresponding neurons. Then we filter noise through subspace projection, then combine with importance-weighted averaging.
The result is a merged expert that preserves the collective knowledge of its constituents. Not an average. A synthesis.
Distilling what was lost
Pruning gets us the right structure, but there's inevitably some quality loss. Knowledge distillation recovers it.
The idea of distillation is simple: train the compressed student model to match the behavior of the original teacher model. But we don't do simple distillation. We use four complementary objectives, because we found that no single one is enough.
Output matching teaches the student to produce the same token probability distributions as the teacher. Router alignment ensures the student's routing decisions match the teacher's, accounting for the expert merging via a learned projection. Internal alignment preserves the geometric structure of the teacher's hidden representations, not just the final output, but the relationships between concepts inside the model. And self-correction lets the compressed student generate its own outputs so the uncompressed teacher can evaluate them, providing feedback on the student's actual failure modes rather than just training on the teacher's outputs. This is conceptually similar to SDPO in Stage 2, but applied across the compression gap: the teacher and student are now different models, and the goal is recovering lost capability rather than developing new skill.
That last one matters more than you'd think. Standard distillation trains the student on the teacher's behavior. Self-correction trains the student on the student's own mistakes. It's the difference between "here's how I would do it" and "here's specifically where you went wrong."
Stage 4: Faster, faster, faster
A model that fits in memory isn't useful if it generates tokens at 2 per second. And autoregressive generation, where each token depends on all previous tokens, is inherently sequential. You can't parallelize it.
Unless you cheat. (We cheat a lot.)
Speculative decoding uses a small, fast "drafter" model to predict several tokens ahead. For us, this is a small RNN designed specifically for the nuances of Apple Silicon and unified memory. The main model then verifies all the draft tokens in a single forward pass… which can be parallelized. If the drafts are correct (and for predictable sequences like boilerplate code, common syntax, or the middle of a well-established pattern, they usually are), you get multiple tokens for the cost of one main-model evaluation.
We train our drafter as a lightweight recurrent module embedded directly inside the main model. It shares the model's embeddings and early representations, so it has a strong prediction signal without the overhead of a separate model. And we train it specifically on coding tokens, so it learns the patterns that make up the bulk of generated output: syntax, common APIs, boilerplate, closing brackets.
The speedup is significant. For the kind of structured, pattern-heavy output that code generation produces, speculative decoding often verifies 3-5 tokens at once. That's the difference between "unusably slow" and "feels responsive."
Stage 5: The final squeeze
The last stage is quantization, or representing weights in lower precision to reduce the memory footprint.
We use mixed-precision quantization, because not all weights deserve the same treatment. Core model weights get 4-bit floating point with block-wise scaling, preserving the dynamic range that matters for quality. Routing and gating layers get 8-bit integer precision, because routing decisions need more headroom to maintain expert selection quality. If you compress these too aggressively, the router starts making bad choices, which cascades through everything. Activations stay at full precision during computation and only get quantized for storage.
What we ended up with
The output of this pipeline is a single model that has the coding knowledge of an 80-billion parameter model, activates only a fraction of those parameters per token, fits in under 16GB of memory, generates tokens fast enough for interactive use, handles both agentic workflows and inline completions, and runs entirely on your hardware with no network dependency.
The compression ratio is dramatic. But what matters isn't the ratio.
It's the capability.
Through careful multi-stage training, progressive compression, and multi-objective distillation, we preserve the vast majority of the original model's coding ability at a fraction of the size.
Getting the model small and capable was the hard part.
But a capable model sitting in memory doesn't help anyone if you can't actually run it efficiently on consumer hardware.
This is Part 2 of a 3-part series on building Rig. Part 1: Teaching a Model to Code covers data generation. Part 3: Building an Inference Engine for On-Device AI Coding covers running a model on device.