Let’s gooooooo!
From the 60s-80s, we saw fundamental contributions to the study of context-free grammar:
1961 the CYK parsing algorithm was published (Younger 1967)
1968 the Earley parser was published (Earley 1968)
1974 the GLR parser was formulated (Lang 1974), and then in 1984 it was implemented (Tomita 1986)
(… skipping a bunch of other things …)
2023 llama.cpp popularized grammar-constrained decoding for local LLMs (Maurer 2023).
2023 JSON mode finally returns valid JSON (OpenAI 2023)
2025 state-machine grammars become mainstream (Mastra Team 2025). Context-Free Grammars (CFG) too: I built a CFG parser to constrain agentic function calls at my current employer.
2025 OpenAI introduced native support for constraining LLM outputs with context-free grammar production rules (OpenAI 2025).
2026 world models have been generating some buzz (Maloo 2026)
Structure is fashionable again. It feels like we’re one good abstraction away from grammars having their moment, and I think that abstraction is finally obvious.
The purpose of this post is to outline a framework to talk about grammar generation, inference, and scoring all at once in a clean interface, with the hope that this explanation will simplify future research efforts on this topic.
#A clean interface
Think of a grammar as a strongly-typed object.
For example, this is G:
#Top-down generation
You can sample from the grammar (assuming uniform distribution over production choices):
p = G.sample() // returns a ParseYou may get a parse that looks a bit like this:
parse
Then, you can render data from the parse:
x = G.render(p) // returns an ObservationRendering drops the latent structure.
observation setup test test test teardown
Sampling is stochastic. So, you may instead get a parse that looks like this:
parse
Producing parses and rendering observations from those parses is the easy part.
#Bottom-up inference
The harder problem is parsing raw observations.
To demonstrate this, let’s change the grammar to be more interesting, one that introduces the concept of running tests in parallel.
Try parsing this observation:
observation setup test then test and test teardown
We can call infer to discover which parses can explain
this data:
P = G.infer(obs => obs == x) // returns Dist<Parse> | nullThere are two potential parses for the same sequence.
Candidate 1:
parse
Candidate 2:
parse
That’s why infer(...) returns a probability
distribution.
You can imagine the probability distribution looks somewhat like this:
distribution
Abstracted away is the fact that infer is performing a
complicated parsing algorithm. It’s giving you access to the
posterior:
\[ P(p | x) \propto P(x | p) P(p) \]
And lastly, we can compute a score using score:
s = G.score(p, x) // log P(x | p) + log P(p)This interface is sufficient for some really powerful algorithms, as we’ll see below.
#Searching in observation space
G.infer(...) can parse partial data, like a prefix of a
sequence, or the list of moves made so far in a game. In other words, it
describes the next possible moves.
Monte Carlo Tree Search (MCTS), on the other hand, is an algorithm that finds you the approx. best move.
Don’t get stuck on the name, though. Maybe a better name would have been “Statistical Tree Search.” The “Monte Carlo” part is a nod to the city of Monte Carlo in Monaco, which is famous for its casinos. In the literature, “Monte Carlo” is a loaded term that also implies sampling, repeated simulations, balancing exploration vs. exploitation, and a few other relevant ideas, so unfortunately the name’s here to stay.
\[ \underbrace{ \text{Monte Carlo} \qquad \underbrace{ \text{Tree} \qquad \underbrace{\text{Search}}_{\substack{\\[1.0ex]\text{algorithm}}} }_{\substack{\\[1.3ex]\text{uses a tree data structure}}} }_{\substack{\\[1.6ex]\text{named after Monte Carlo sampling}}} \]
In the algorithm, you construct a tree from scratch, where each path from the root represents the sequence of actions you may take.
Example
The algorithm iterates many times. In each iteration, we build up the tree:
- Pick a promising starting path
- Roll out fully to see where it could end
- Tally up the results
So, to get started, we’ll initialize a bare-bones tree.
// start with just the root node
tree = Tree([
Node({
// initial trivial partial parse
parse: p0,
// observation
obs: [],
// frequency of usage (will be updated by `backprop`)
visits: 0,
// mean utility estimate (will be updated by `backprop`)
value: 0.0,
})
])The tree has some typical methods, like:
add(...)adds a new child node onto an existing one
And some atypical methods, that are only relevant for MCTS, like:
select(...)walks from the root to a leaf by repeatedly choosing the child to build a path. It does so using the PUCT/PUCB score (usingscoreFnas a prior) (Silver et al. 2017).backprop(...)updatesnode.visitsandnode.valuefor every node up the ancestor chain.
Let’s take everything we’ve learned, using infer,
sample, render, and score to
implement MCTS. The following code will be repeated in a loop many
times:
// 1. pick a promising starting path
path = tree.select(node => G.score(node.parse, node.obs))
node = path.at(-1)
// 2. roll out fully to see where it could end
P = G.infer(obs => startsWith(obs, node.obs))
p = P.sample({ reject: node.children.map(c => c.obs) })
x = G.render(p)
nextMove = x.at(node.obs.length)
childObs = [...node.obs, nextMove]
child = tree.add(node, Node(
G.infer(obs => startsWith(obs, childObs)).sample(),
childObs
))
// 3. Tally up the results
u = utility(x)
tree.backprop(path, node, u)I know, I threw in a bunch of undefined things in there, such as
tree.select and tree.backprop, but you can
imagine an API that lets you vibe-mcts 😎.
But c’mon, isn’t that so cool!? You’re basically 80% there with just this interface!
#Searching in grammar space
As mentioned earlier, given a grammar model:
G.infer(...)reveals candidate next moves- The MCTS algorithm above finds the best move
But, how do you learn a grammar in the first place?
Similar to the algorithm above, let’s start with a root node. This
time, node.obs will be grammar objects:
// start with just the root node
tree = Tree([
Node({
// initial trivial partial parse
parse: p0,
// observation, trivial grammar object
obs: G0,
// frequency of usage (will be updated by `backprop`)
visits: 0,
// mean utility estimate (will be updated by `backprop`)
value: 0.0,
})
])Define a set of edit actions one can perform on grammars.
Do you see where I’m going with this?
Run the same MCTS algorithm above, but instead of searching through the space of observations constrained by grammar actions, search through the space of grammar objects constrained by grammar-edit actions.
For example, let’s say you’d like to learn how to make a peanut butter and jelly (PB&J) sandwich. Let’s establish some atomic concepts:
- nouns:
bread- Jelly (
strawberry,grape) - Peanut butter (
crunchy,smooth)
- verbs:
spread(for spreading things on bread)stack(for stacking slices of bread)
The following grammar generates instructions on making PB&J sandwiches:
Sampling from that grammar will produce a sequence of words that maybe at first glance sound like gibberish, but I’ve been careful to ensure it compiles to valid Forth (Moore 1970) code. Hit the “Sample & Run” button to see for yourself!
Searching in grammar space means starting from a trivial grammar, and performing a sequence of edits to discover an optimal grammar. And, it’s harder than it sounds.
Can you edit this grammar below into the one above?
Click the edit operations above to transform the grammar. I gave up after 20 seconds.
The goal of the MCTS algorithm is to find the sequence of actions that will get us there, by mutating the grammar to maximize a pre-defined utility.
#Grammar synthesis
In the demo below, click “Start Search” to watch MCTS discover a grammar that matches a set of target sequences:
The search uses MCTS over grammar-edit actions and scores candidates.
After you run the search, pay special attention to the “novel idea” section. It’ll show you an observation that doesn’t exist in the training data. You can hit run to see the idea play out.
The beauty of grammar synthesis is that the learned model’s divergence from the training data is a feature not a bug!
For a deeper dive into this approach, see my dissertation (Shukla 2019).
#References
Earley, Jay. 1968. “An Efficient Context-Free Parsing Algorithm.” PhD thesis, Carnegie Mellon University.
Lang, Bernard. 1974. “Deterministic Techniques for Efficient Non-Deterministic Parsers.” Automata, Languages and Programming, Lecture notes in computer science, 14: 255–69.
Maloo, Ankit. 2026. “World Models.” https://ankitmaloo.com/world-models/.
Mastra Team. 2025. “State-Machine Grammar for Agent Behavior.” May 2025. https://mastra.ai/blog/changelog-2025-05-07.
Maurer, Iwan. 2023. “Grammar-Constrained Decoding for Local LLMs.” 2023. https://www.imaurer.com.
Moore, Charles H. 1970. “FORTH: A New Way to Program a Mini-Computer.” In Astronomical Society of the Pacific Conference Series, 5:8–10.
OpenAI. 2023. “New Models and Developer Products Announced at DevDay.” November 6, 2023. https://openai.com/index/new-models-and-developer-products-announced-at-devday/.
———. 2025. “GPT-5 New Parameters and Tools: Context-Free Grammar Support.” August 2025. https://developers.openai.com/cookbook/examples/gpt-5/gpt-5_new_params_and_tools.
Shukla, Nishant. 2019. “Utility Learning, Non-Markovian Planning, and Task-Oriented Programming Language.” PhD thesis, University of California, Los Angeles. https://escholarship.org/uc/item/2nj5n4f1#page=83.
Silver, David, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, et al. 2017. “Mastering the Game of Go Without Human Knowledge.” Nature 550: 354–59. https://doi.org/10.1038/nature24270.
Tomita, Masaru. 1986. Efficient Parsing for Natural Language: A Fast Algorithm for Practical Systems. Kluwer Academic Publishers.
Younger, Daniel H. 1967. “Recognition and Parsing of Context-Free Languages in Time n3.” Information and Control 10 (2): 189–208.