Daniel's Blog · Rules, Not Weights

Rules, Not Weights

Mainstream ML fixes the scoring rule and trains the weights. We're exploring the opposite — searching the rule.

In mainstream ML, the scoring rule is part of the backbone — softmax over logits, cross-entropy loss, the attention pattern — and iteration works by adapting weights to that fixed rule. From GPT-2 through Llama 3 and Mistral, the pretraining rule is the same: next-token cross-entropy on softmax over logits. What moved between them was parameter count, architecture details like normalization and positional encoding, tokenizer, and training data. The scoring rule itself is a given.

We're exploring the opposite. Weights still get trained for every candidate — gradient descent does not go away — but the outer search variable is the scoring rule itself. A small symbolic engine makes the rule a term — data a search can compare, mutate, and dedup — so rule-search becomes tractable.

Three background forces made leaving the rule fixed rational. Autograd made any well-behaved forward cheap to differentiate. Scaling laws rewarded growing parameters and data against the same loss. Softmax cross-entropy matured into a default — MLE for a categorical, p − y gradients, clean composition with attention. A fourth force kept the rule fixed in practice: the cost of rule search itself.

That fourth force is what our engine changes. Even granting cheap autograd — even granting an LLM that can write candidate forwards on demand — a search still needs a representation it can compare, mutate, and cache, and a population of Python modules is not that. Without tooling that makes a rule into searchable data, rule search at grammar scale is not something a team would run.

Most production training stacks build an autograd graph at runtime. PyTorch's backward and TensorFlow's GradientTape record operations as the forward pass runs, then walk the recorded graph to compute gradients. That makes any fixed forward cheap to differentiate — but the forward is code, and searching code is expensive even when an LLM writes the candidates. Two Python modules that compute near-equivalent things can look arbitrarily different, a population of modules does not support structural crossover or dedup, and the gradient lives inside a tape rather than in a form you can inspect or hash. Our engine makes the rule a term. defsymbolic reads an s-expression, simplifies it, differentiates it symbolically once at macro-expansion time, and stores the expression beside its per-parameter gradient map. Search becomes data manipulation — canonicalize, hash, dedup, mutate subtrees, emit thousands — and the training loop consumes the same-shaped artifact regardless of which expression produced it.

Think of the engine as a compiler for scoring rules. A developer writes a rule as an expression; parse-expr at src/wave_grad/eml.clj:11-37 reads the expression into a tagged tree of :add, :mul, :exp, :log, :eml nodes; simplify at src/wave_grad/eml.clj:48-102 folds constants, eliminates zeros and ones, and flattens associative operators so derivatives do not blow up; diff at src/wave_grad/eml.clj:104-133 walks the tree once per learnable weight to produce another tagged expression; and defsymbolic at src/wave_grad/eml.clj:154-162 stores the expression and the gradient map side by side on the resulting def. Parse, simplify, differentiate, store. Four passes and a macro:

(defmacro defsymbolic
  [name params form]
  (let [expr (-> form parse-expr simplify)
        grads (zipmap params (map #(diff expr %) params))]
    `(def ~name
       {:name ~(keyword name)
        :parameters '~(vec params)
        :expression '~expr
        :gradients '~grads})))

What the developer hands in looks like a piece of math. What comes back is the trainable artifact the training loop already consumes — no hand-written backward pass, no autodiff tape, no retraining-stack rebuild between experiments.

A rule is a short expression that scores one candidate — one token in a sequence, one line in a story, one memory chunk in a retrieval window. The expression has learnable weights inside it, but the weight vector is not the unit of iteration. The expression is. Weights are trained per candidate rule by the inner loop; the outer loop swaps the rule. We swap one expression for another, the engine re-differentiates the new one at compile time, the gradient map moves with the rule, and the training loop runs without code changes.

The primitive the engine is shaped around is EML. Softmax normalizes across a sequence — every token's weight depends on every other token in the same step. EML scores each token on its own evidence, so scores become composable across sequences, comparable across positions, and directly thresholdable. Softmax does not give any of those properties for free.

The base operator is one line (Koji Odrzywołek, All elementary functions from a single operator, https://arxiv.org/html/2603.21852v2):

eml(x, y) = e^x − ln(y)

This formula means: take the exponential of the first argument and subtract the natural log of the second. Pairing EML with the constant 1 is enough to generate every elementary function — the NAND-gate analogy for continuous math.

We call EML with a structured denominator: score = eml(signal_sum, 1 + exp(damping_sum)), which reduces — because ln(1 + e^x) is softplus — to exp(signal) − softplus(damping). This formula means: two learned weighted sums, one signal saying why a token might matter, the other damping penalizing why it might be noisy or stale.

The derivative is also one line:

D[eml(u, v)] = e^u · u' − v' / v

This formula means: the gradient of an EML node is the exponential of the first argument u times u's own gradient, minus the second argument v's gradient divided by v itself. Compose EML into larger weighted expressions and the gradient comes back as another expression tree. This is the reason :eml is a named node type in the engine rather than desugared into exp and log — the tidy derivative is worth encoding once at the primitive.

Because a new rule is a one-line source edit, a generator can write hundreds of rules from a grammar and the training loop consumes the lot without any code change. Rule design becomes rule search. Three stages run the same loop against progressively more realistic settings.

Stage one is a synthetic hard suite. The headline task and-match rewards tokens that match both a color clue and a shape clue; the eml-gated-symbolic rule lands at 0.613 final answer score against an oracle upper bound of 0.663, and the best linear baseline reaches 0.500. Across the harder task variants, different EML rules win on different variants — eml-symbolic wins the hardest at 0.600 against 0.525 for the linear baselines, but no single rule dominates the suite. The interesting finding is not which rule is best; it is that the engine reaches the top group on every task by surfacing a task-shaped rule fast, and a different task rewards a different rule shape.

Stage two is bAbI supporting-fact retrieval on tasks 2 and 3. A broad sweep grammar generated one-term candidates; an expanded sweep added a second damping term to the survivors. The rule that won — eml-auto-03-03-d-drop, a two-term damping variant — landed at support recall 0.204 against the linear baseline's 0.051, four-to-one. Exact retrieval is still 0.000 across the board: the system finds supporting facts above chance, but it does not yet return them in the top slot. The interesting finding is the move to honesty — a proxy metric moves decisively while the strongest possible metric stays at floor — and that the winning rule in the final rerun did not come from a human picking a shape.

Stage three is TinyStories memory selection inside the nanochat inference path — a real-language next-chunk benchmark on a sparse memory budget. The searched rule eml-ts-length_norm-quote_recent-none lands at average continuation loss 3.8344 against 3.8911 for the fixed sparse-memory baseline, 4.0311 for recent-only truncation, and 3.6815 for full context as upper bound. Target overlap moves the same way: 0.2077 for EML against 0.1513 and 0.0815 for the sparse baselines. The interesting finding is that the same iteration loop — engine, rule grammar, sweep, rerun gate — transfers into a real GPT-style inference path on real-language data, and the rule it surfaces beats both sparse baselines without rewiring the backbone.

The memory rule search is not the only track. A parallel experiment applied EML to gradient update rules on the same TinyStories model. Three designs — gradient gating, optimizer replacement, and lr-scaled correction — ran against a baseline Muon optimizer across four seeds (320 steps, 6-layer, batch 4). Averaged across seeds: correction at 3.1726 validation loss, baseline 3.1931, gating 3.1943, replacement 3.2248. Correction beats baseline on every individual seed. The result that initially favored replacement on a short run reversed when the training horizon grew — a finding only visible through iteration at the rule level. The candidate set was hand-designed, not searched. The direction is consistent; the search has not run.

In stage one a human picks the shape and the engine differentiates it. In stage two a sweep grammar picks the shape and the rerun gate keeps it or discards it. Rule search stops being manual. The same loop carried from synthetic tasks to offline retrieval to a live generation path without the backbone changing shape. The thing being searched in all three stages was the expression, not the weights. The loop is not hyperparameter sweeping, and it is not architecture search — it is a search over the expression we score with.

None of this says the engine has solved anything. Exact retrieval on bAbI is 0.000. On TinyStories the prefix-match rate is 0.000 across every strategy — EML selects the right material, but selection is not yet generation. The continuation-loss gap between EML and the fixed sparse baseline is real but small in absolute terms (3.8344 against 3.8911). The synthetic wins are task-local: different rules win on different variants, not one rule across the suite. The 0.204 support recall on bAbI aggregates over a split: Task 2 recall is 0.350, Task 3 is 0.058. The claim this post supports is narrower — rule-level iteration is cheap enough to run an algorithmic search loop, the loop surfaces rules a designer would not have written, and those rules carry across three settings without the backbone changing shape. How far those rules can be pushed is the next question.