Grammar models are back, baby!

Let’s gooooooo!

From the 60s-80s, we saw fundamental contributions to the study of context-free grammar:

1961 the CYK parsing algorithm was published (Younger 1967)

1968 the Earley parser was published (Earley 1968)

1974 the GLR parser was formulated (Lang 1974), and then in 1984 it was implemented (Tomita 1986)

(… skipping a bunch of other things …)

2023 llama.cpp popularized grammar-constrained decoding for local LLMs (Maurer 2023).

2023 JSON mode finally returns valid JSON (OpenAI 2023)

2025 state-machine grammars become mainstream (Mastra Team 2025). Context-Free Grammars (CFG) too: I built a CFG parser to constrain agentic function calls at my current employer.

2025 OpenAI introduced native support for constraining LLM outputs with context-free grammar production rules (OpenAI 2025).

2026 world models have been generating some buzz (Maloo 2026)

Structure is fashionable again. It feels like we’re one good abstraction away from grammars having their moment, and I think that abstraction is finally obvious.

The purpose of this post is to outline a framework to talk about grammar generation, inference, and scoring all at once in a clean interface, with the hope that this explanation will simplify future research efforts on this topic.

#A clean interface

Think of a grammar as a strongly-typed object.

For example, this is G:

#Top-down generation

You can sample from the grammar (assuming uniform distribution over production choices):

p = G.sample()  // returns a Parse

You may get a parse that looks a bit like this:

parse

Then, you can render data from the parse:

x = G.render(p)  // returns an Observation

Rendering drops the latent structure.

observation setup test test test teardown

Sampling is stochastic. So, you may instead get a parse that looks like this:

parse

Producing parses and rendering observations from those parses is the easy part.

#Bottom-up inference

The harder problem is parsing raw observations.

To demonstrate this, let’s change the grammar to be more interesting, one that introduces the concept of running tests in parallel.

Try parsing this observation:

observation setup test then test and test teardown

We can call infer to discover which parses can explain this data:

P = G.infer(obs => obs == x)  // returns Dist<Parse> | null

There are two potential parses for the same sequence.

Candidate 1:

parse

Candidate 2:

parse

That’s why infer(...) returns a probability distribution.

You can imagine the probability distribution looks somewhat like this:

distribution

Abstracted away is the fact that infer is performing a complicated parsing algorithm. It’s giving you access to the posterior:

\[ P(p | x) \propto P(x | p) P(p) \]

And lastly, we can compute a score using score:

s = G.score(p, x)  // log P(x | p) + log P(p)

This interface is sufficient for some really powerful algorithms, as we’ll see below.

#Searching in observation space

G.infer(...) can parse partial data, like a prefix of a sequence, or the list of moves made so far in a game. In other words, it describes the next possible moves.

Monte Carlo Tree Search (MCTS), on the other hand, is an algorithm that finds you the approx. best move.

Don’t get stuck on the name, though. Maybe a better name would have been “Statistical Tree Search.” The “Monte Carlo” part is a nod to the city of Monte Carlo in Monaco, which is famous for its casinos. In the literature, “Monte Carlo” is a loaded term that also implies sampling, repeated simulations, balancing exploration vs. exploitation, and a few other relevant ideas, so unfortunately the name’s here to stay.

\[ \underbrace{ \text{Monte Carlo} \qquad \underbrace{ \text{Tree} \qquad \underbrace{\text{Search}}_{\substack{\\[1.0ex]\text{algorithm}}} }_{\substack{\\[1.3ex]\text{uses a tree data structure}}} }_{\substack{\\[1.6ex]\text{named after Monte Carlo sampling}}} \]

In the algorithm, you construct a tree from scratch, where each path from the root represents the sequence of actions you may take.

Example

The algorithm iterates many times. In each iteration, we build up the tree:

Pick a promising starting path
Roll out fully to see where it could end
Tally up the results

So, to get started, we’ll initialize a bare-bones tree.

// start with just the root node
tree = Tree([
  Node({
    // initial trivial partial parse
    parse: p0,
    // observation
    obs: [],
    // frequency of usage (will be updated by `backprop`)
    visits: 0,
    // mean utility estimate (will be updated by `backprop`)
    value: 0.0,
  })
])

The tree has some typical methods, like:

add(...) adds a new child node onto an existing one

And some atypical methods, that are only relevant for MCTS, like:

select(...) walks from the root to a leaf by repeatedly choosing the child to build a path. It does so using the PUCT/PUCB score (using scoreFn as a prior) (Silver et al. 2017).
backprop(...) updates node.visits and node.value for every node up the ancestor chain.

Let’s take everything we’ve learned, using infer, sample, render, and score to implement MCTS. The following code will be repeated in a loop many times:

// 1. pick a promising starting path
path = tree.select(node => G.score(node.parse, node.obs))
node = path.at(-1)

// 2. roll out fully to see where it could end
P = G.infer(obs => startsWith(obs, node.obs))
p = P.sample({ reject: node.children.map(c => c.obs) })
x = G.render(p)
nextMove = x.at(node.obs.length)
childObs = [...node.obs, nextMove]
child = tree.add(node, Node(
  G.infer(obs => startsWith(obs, childObs)).sample(),
  childObs
))

// 3. Tally up the results
u = utility(x)
tree.backprop(path, node, u)

I know, I threw in a bunch of undefined things in there, such as tree.select and tree.backprop, but you can imagine an API that lets you vibe-mcts 😎.

But c’mon, isn’t that so cool!? You’re basically 80% there with just this interface!

#Searching in grammar space

As mentioned earlier, given a grammar model:

G.infer(...) reveals candidate next moves
The MCTS algorithm above finds the best move

But, how do you learn a grammar in the first place?

Similar to the algorithm above, let’s start with a root node. This time, node.obs will be grammar objects:

// start with just the root node
tree = Tree([
  Node({
    // initial trivial partial parse
    parse: p0,
    // observation, trivial grammar object
    obs: G0,
    // frequency of usage (will be updated by `backprop`)
    visits: 0,
    // mean utility estimate (will be updated by `backprop`)
    value: 0.0,
  })
])

Define a set of edit actions one can perform on grammars.

Do you see where I’m going with this?

Run the same MCTS algorithm above, but instead of searching through the space of observations constrained by grammar actions, search through the space of grammar objects constrained by grammar-edit actions.

For example, let’s say you’d like to learn how to make a peanut butter and jelly (PB&J) sandwich. Let’s establish some atomic concepts:

nouns:
- bread
- Jelly (strawberry, grape)
- Peanut butter (crunchy, smooth)
verbs:
- spread (for spreading things on bread)
- stack (for stacking slices of bread)

The following grammar generates instructions on making PB&J sandwiches:

Sampling from that grammar will produce a sequence of words that maybe at first glance sound like gibberish, but I’ve been careful to ensure it compiles to valid Forth (Moore 1970) code. Hit the “Sample & Run” button to see for yourself!

Searching in grammar space means starting from a trivial grammar, and performing a sequence of edits to discover an optimal grammar. And, it’s harder than it sounds.

Can you edit this grammar below into the one above?

Click the edit operations above to transform the grammar. I gave up after 20 seconds.

The goal of the MCTS algorithm is to find the sequence of actions that will get us there, by mutating the grammar to maximize a pre-defined utility.

#Grammar synthesis

In the demo below, click “Start Search” to watch MCTS discover a grammar that matches a set of target sequences:

The search uses MCTS over grammar-edit actions and scores candidates.

After you run the search, pay special attention to the “novel idea” section. It’ll show you an observation that doesn’t exist in the training data. You can hit run to see the idea play out.

My model learned that you can make a double jelly sandwich 🤣

The beauty of grammar synthesis is that the learned model’s divergence from the training data is a feature not a bug!

For a deeper dive into this approach, see my dissertation (Shukla 2019).

#References

Earley, Jay. 1968. “An Efficient Context-Free Parsing Algorithm.” PhD thesis, Carnegie Mellon University.

Lang, Bernard. 1974. “Deterministic Techniques for Efficient Non-Deterministic Parsers.” Automata, Languages and Programming, Lecture notes in computer science, 14: 255–69.

Maloo, Ankit. 2026. “World Models.” https://ankitmaloo.com/world-models/.

Mastra Team. 2025. “State-Machine Grammar for Agent Behavior.” May 2025. https://mastra.ai/blog/changelog-2025-05-07.

Maurer, Iwan. 2023. “Grammar-Constrained Decoding for Local LLMs.” 2023. https://www.imaurer.com.

Moore, Charles H. 1970. “FORTH: A New Way to Program a Mini-Computer.” In Astronomical Society of the Pacific Conference Series, 5:8–10.

OpenAI. 2023. “New Models and Developer Products Announced at DevDay.” November 6, 2023. https://openai.com/index/new-models-and-developer-products-announced-at-devday/.

———. 2025. “GPT-5 New Parameters and Tools: Context-Free Grammar Support.” August 2025. https://developers.openai.com/cookbook/examples/gpt-5/gpt-5_new_params_and_tools.

Shukla, Nishant. 2019. “Utility Learning, Non-Markovian Planning, and Task-Oriented Programming Language.” PhD thesis, University of California, Los Angeles. https://escholarship.org/uc/item/2nj5n4f1#page=83.

Silver, David, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, et al. 2017. “Mastering the Game of Go Without Human Knowledge.” Nature 550: 354–59. https://doi.org/10.1038/nature24270.

Tomita, Masaru. 1986. Efficient Parsing for Natural Language: A Fast Algorithm for Practical Systems. Kluwer Academic Publishers.

Younger, Daniel H. 1967. “Recognition and Parsing of Context-Free Languages in Time n³.” Information and Control 10 (2): 189–208.