The missed opportunity of constrained decoding

26 min read Original article ↗

Large Language Models (LLMs) shouldn’t produce structured output like this:

    

Please scroll down a little further!

{ · ·" location ": ·" Ad elaide /A ustr alia ", · ·" unit ": ·" c elsius ", · ·" date ": ·" 202 5 - 11 - 09 " }

They should produce it like this:

    

Just a little further!

{ · ·" location ": ·" Ad elaide /A ustr alia ", · ·" unit ": ·" c elsius ", · ·" date ": ·" 202 5 - 11 - 09 " }

Today we will cover the performance optimisation of ‘zero-entropy token skipping’ - also known as jump tokens. We will benchmark the production APIs of Anthropic, Google and OpenAI, and compare them to our implementation. Finally, we will discuss industry incentives for increased adoption, with some tangents along the way.

This article assumes a familiarity regarding “roughly how LLMs work”, requiring knowledge of concepts such as “LLMs ingest and output tokens”. The GPT-5 (o200k_base) tokeniser is used for all visualisations. There are many diagrams to walk you through the concepts, it should be fun!

Structured Decoding

After an LLM processes a sequence of input tokens, it generates a probability distribution over each of the next possible tokens. That distribution is sampled to obtain the next token. This process repeats until a <|return|> (or equivalent) token is generated, and is the core of the autoregressive stage of LLM generation.

This works well for free-form text, but sometimes that entire distribution isn’t desirable for the task at hand. Tool calls and classification tasks require structured output.

Let’s look at a getHistoricalWeather tool call as an example, requiring a location, unit and date. The LLM has already generated the following text:

    
    
         {      ·    ·"    location    ":    ·"    Ad    elaide    /A    ustr    alia    ",      ·    ·"    unit    ":   
    
    
  

We can visualise the next token probability distributions as columns in a grid. Here we have the next 4 autoregressive steps. The center row shows the path taken: ·" c elsius ",↵ . The tokens above and below are the other options, the branching paths that could have been taken. At the step where c was selected, fahren or kel were other possible branches that were not taken.

  
  
       ·["     rank       els      "↵↵      ·z    fahren     ",↵     "↵      ·"     c     elsius    ",↵      ·[    kel      "↵      ",      ·    Kel    ertain       "      
  

Some of these paths are reasonable, such as predicting kel (vin), fahren (heit) or our canonical c (elsius).

Some of these paths are unreasonable, the rank token is the beginning of a rank ine unit selection. While it is a unit of temperature, it’s rarely used, and not supported by our underlying tool.

Some of these paths are inconvenient, the Kel prediction is semantically reasonable, kelvin is a supported unit, but we don’t want the capitalised form.

Some of these paths are simply incorrect. Predicting the beginning of an array via ·[" or ·[ doesn’t match our desired output format, and c ertain isn’t a unit of temperature.

While picking the most likely token each step turned out to be a valid path in this (constructed) example, that’s not always the case. There are many paths that produce incorrect output. While frontier LLMs are “pretty good” at producing correctly formatted outputs, chance is a weak defense at scale.

Of course we don’t need to rely on chance for this kind of correctness. Let’s define a schema to restrict the possible paths. Here is a reasonable schema for our tool call, using the zod validation format.

const getHistoricalWeatherSchema = z.object({
  location: z.string(),
  unit: z.union([
    z.literal('fahrenheit'), 
    z.literal('celsius'), 
    z.literal('kelvin')
  ]),
  date: z.iso.date(),
})

This schema can be used to derive a mask over that probability distribution. When that masked distribution is sampled to obtain the next token, we can be certain that the token picked is a valid one, according to our schema. It’s the same probability distribution, just with invalid options removed.

  
  
       "↵↵      fahren    "↵      ·"     c     elsius    ",↵      kel    ",      ·       "      
  

Notice that the ‘beginning of array’ token possibilities, ·[" and ·[ have been discarded. Only the ‘reasonable’ options we defined above remain.

Notice that sometimes the probability distribution becomes a single token. If c is picked, the only valid next token is elsius . This is obvious in retrospect, we defined a union of literals; an enum. Once the start of an enum is chosen, the rest of its tokens are fully constrained.

JSON is whitespace invariant, so let’s enforce a specific whitespace format that is still readable. In production we would simply enforce no whitespace.

  
  
       fahren      ·"     c     elsius    ",↵      kel   
  

What happens if we know the next probability distribution will also be a single token? What if there’s a whole chain of forced tokens coming up?

Do we need to run the LLM at all?

Choices

We can visualise the valid paths through the token choice space as a directed acyclic graph. We begin with a node representing the <|start|> token. We then enumerate the entire state space of possibilities at each step. When the probability distribution collapses to a single token, we append that token to the current node. If a branch occurs, we start a new row and create nodes for each of the choices. If we encounter a ‘freeform’ section such as the location string, we’ll create a separate node to avoid rendering more dom nodes than atoms in the universe.

The first branch is the tool call itself. Lets hide any other possible tool calls.

Once the tool call has been chosen, the path is restricted until the Location String . Once that string ends, it’s again fully constrained until the choice of Unit . Once the first token of the Unit is decided (eg c ), the second token is also fully constrained (eg elsius ), as is all the markup until the Date String . I’m drawing that markup as a separate, converging node to make reading this on phones a bit easier. Finally the last of the markup, and the <|call|> token to halt generation and send the request off.

The traditional autoregressive approach pays a cost per token. It assumes that every step requires an incredibly wide branch. Since we’re masking for correctness, that assumption stops being true for these sections. It’s unobservable to skip the tokens between branches, to only call the LLM when there’s a decision to be made.

Any single node should only require the computation of a single token. It doesn’t matter if a node contains one, a hundred, or a hundred thousand tokens, the LLM should only be run for one. The average number of tokens per node is our speedup factor, and it increases the more constrained our schema is.

  
  
       get    Historical    Weather      <|channel|>      comment    ary    ·json      <|message|>      {      ·    ·"    location    ":    ·"   
  
  

Picking celsius as our unit, this is the path taken:

  
  
         <|start|>      assistant    ·to    =    functions    .    get    Historical    Weather      <|channel|>      comment    ary    ·json      <|message|>      {      ·    ·"    location    ":    ·"    Location String    ",      ·    ·"    unit    ":    ·"    c    elsius    ",      ·    ·"    date    ":    ·"    Date String    "      }      <|call|>     
  
  

Naively, this path takes 43 autoregressive steps (assuming there’s a ‘prefilled’ <|start|> and assistant token, nominal location and date lengths). If we only call the LLM when there’s actually a choice to make, we can do it in 15. That’s almost a 3x speedup!

Let’s see if it can happen in practice.

Benchmarking can’t be that hard

To test this optimisation both locally and against the production APIs of Anthropic, Google and OpenAI, we first need to design a benchmark.

We need to positively identify when a provider performs exactly this optimisation, with as little ambiguity as possible. If they do, we also want to quantify that speedup for different forced path lengths.

We need to do our best to defeat other optimisations, such as prefix caching, n-gram or suffix based speculative decoding mechanisms.

Each provider offers a disjoint set of models; we need to correct for ‘raw’ model speed. We need to avoid ‘reasoning’ models as they may ‘buffer’ the entire response in their thinking trace before ‘copying’ it to the output (via a different optimisation).

    
    
           <|channel|>      analysis      <|message|>      An    ·example    ·of    ·a    ·long    ·word    ·could    ·be    ·"    super    cal    if    rag    il    istic    exp    ial    id    ocious    ".      <|end|>        <|channel|>      final      <|message|>      "    super    cal    if    rag    il    istic    exp    ial    id    ocious    "    ·fits    ·your    ·criteria    .      <|end|>     
    
    
  

Some providers begin response streaming immediately, others buffer significant proportions of the response before sending anything. Some providers batch multiple tokens per network packet, some send every token individually.

Every provider has a different tokeniser, not every provider allows local access to that tokeniser.

In addition to ‘the obvious’ difficulties, each provider has different feature support for structured output grammar definition, even within the JSON Schema specification.

  • JSON Schemas are the only well supported grammar across all providers. Unfortunately LARK grammars and even simple regexes have poor to zero support with some providers.
  • Within JSON Schemas, single member enums are the only way to collapse the state space of values reliably. Unfortunately const values are not supported across all providers. Regex also isn’t supported as a JSON schema ‘pattern’ across all providers.
  • One provider uses a heuristic to pre-timeout requests that they estimate will take too long, forcing the usage of streaming mode.
  • The maximum grammar ‘complexity’ and ‘size’ varies wildly between providers.
  • We must validate all returned results across many requests, as some providers will silently disable structured outputs without an explicit error if any unsupported feature is used, potentially causing a false negative.
  • We need to avoid the inconvenience of tripping biological, radiological and nuclear weapon related guardrails.

How hard can it be?

Intra-request Sampling

In a nutshell, we’re going to craft requests of geometrically increasing length, measuring the per-network-chunk response times as the result streams in. By splitting the schema into ‘forced’ and ‘unforced’ sections, we can measure the potential optimisation speedup within each request. Bias sources such as the inference server hardware or load balancing are likely to be similar within a single request. We can average out noise with multiple runs.

A positive result should be very obvious. As the forced section length increases, the relative speedup against the unforced section will also increase. At very long request lengths, we should see absurd speedups not possible with other optimisations.

The prompt is as follows:

Return a JSON object with today’s date. Today is 2025-11-09. The day is 9, month is 11, year is 2025. Fill the keys between with random words. For the counting1To64 field, count from 1 to 64 with a space between each number.

The forced path sections will be random word pairings with unique keys across the json object:

    
    
         "    om    nid    irectional    ":"    ph    ilanth    rop    ists    ",      "    cry    ogen    ically    ":"    ther    mod    ynam    ical    ",   
    
    
  

The providers use various heuristics to reject schemas that are ‘too complex’. Unfortunately those heuristics are things like “enum string length” or “number of keys in an object”, which make it hard to amortize latency jitter via longer path lengths. We have to use large enough strings to pack in tokens for our forced paths, but not so large as to hit “complexity” limits. Cryptographic random hex or base64 is higher “tokens per byte” and gives us a bunch of room to play with. Unfortunately one provider sees any high entropy string as intent-to-commit-warcrime.

Three or four tokens per key or value was as close to optimal density as I could get without angering some subset of the gods.

A request starts with the ‘day’ of the date, marking when the request is autoregressively generating and no longer queued.

The Day Value causes a branch, then we have our Initial Forced Section .

Next the Month Value causes another branch, then we have our Middle Forced Section .

Third the Year Value causes another branch, then we ask the model to count from 1 to 64 in a token-inefficient manner in the Counting Section , without actually constraining the output. This is a simple task for the frontier models.

We expect to receive forced paths “all in one go”, separated by short autoregressive steps for the date segments, and a long autoregressive section for the counting.

While this is all a bit tedious, it gives us reasonable length sections within each request to compare the constrained vs unconstrained performance, giving the speedup factor.

A (relatively) complete response looks something like this:

    
    
         {"    tod    ays    Date    Day    ":    9    ,      "    ph    osph    ores    cence    ":"    exception    alis    ms    ",      "    tr    init    rot    olu    ene    ":"    phot    ode    gradable    ",      "    tod    ays    Date    Month    ":    11    ,      "    astro    bi    ologists    ":"    sn    ippers    n    appers    ",      "    em    uls    ifications    ":"    ant    if    ashion    able    ",      "    tod    ays    Date    Year    ":    202    5    ,      "    count    ing    1    To    64    ":"    1    ·    2    ·    3    ·    4    ·    5    etc    "      }   
    
    
  

We expect to receive forced paths “all in one go”, separated by short autoregressive steps for the date segments, and a long autoregressive section for the counting.

While this is all a bit tedious, it gives us reasonable length sections within each request to compare the constrained vs unconstrained performance, giving the speedup factor.

If the inference engine performs n-gram or suffix speculative decoding and includes our grammar as part of the ‘context’ source, the system will be able to achieve a relative speedup proportional to how many n-grams there are in our keys or values. This will manifest as a small relative speedup, but will be easily distinguishable from our jump token optimisation once the path length gets longer.

One Hundred Thousand Tokens Per Second

llguidance is an open source library that implements both structured decoding and jump tokens. It is fast. While other structured decoding implementations can have high start-up costs and slow down generation, llguidance works incrementally, in parallel with the autoregressive generation on the GPU.

With 16 cores and a 10ms forward pass, llguidance can handle batch sizes up to 3200 without slowing down the model.

The vLLM project, introduced in the PagedAttention paper, supports llguidance as a backend for structured decoding. With a small patch it can also support jump tokens.

Standing on the shoulders of giants, we can quickly get up and running with a local inference server that implements everything we need to benchmark this optimisation.

vllm serve meta-llama/Llama-3.2-1B-Instruct --structured-outputs-config.backend guidance

Sending off our first requests, this is what a positive result looks like.

Notice the multi-hundredfold improvement scale on the left. As the path length increases, so does the speedup factor. We have decoupled the cost of these forced paths from the autoregressive step cost. That decoupling is the primary sign of specifically this optimisation. Other potential optimisations such as n-gram or suffix speculative decoding have more constant speedups, still ‘tied’ to the autoregressive cost.

It’s interesting that the two constrained sections begin to diverge by the multi-thousand token mark. I haven’t put much time into actually optimising the implementation of the optimisation. It’s the bare minimum to show it works.

In absolute terms, autoregressive output provides ~150 tokens per second. The initial forced section reliably hits ~109k tokens per second.

While this is fast for LLM output, it’s only 280kB/s!

The world’s most expensive ‘echo’ servers

Google’s gemini-2.5-flash

I’m not using gemini-3.0 since it’s in preview and you can’t turn off thinking, and gemini-2.5-pro also requires thinking, so gemini-2.5-flash it is!

Unfortunately the scale on the left is not in the hundredfold range. We can be sure Google isn’t doing this forced token optimisation. Instead they might be doing n-gram based speculative decoding, with the schema as one of the ‘context’ sources, given the relatively constant 2x performance increase. I didn’t dive into this rabbit hole.

Google has some interesting network batching behaviour. The initial tokens are received all at once, hence the strange outlier performance of the first constrained section. As the forced path length increases, we can see that it converges on the performance of the second constrained section, which is relatively stable.

I have some criticisms regarding their documentation, untyped API design and silent implicit acceptance of errors. I’ve thrown all that in this collapsable section for those that are interested.

I find it pretty ironic that the JSON schema argument in the TypeScript API of the @google/genai package is an untyped unknown.

export declare interface GenerateContentConfig {
// ...
  responseJsonSchema?: unknown;
// ...
}

The result of this is anything is accepted by the clientside API. To make matters worse, the serverside API will silently ignore unsupported features. While the documentation does explicitly say that:

Schema subset: Not all features of the JSON Schema specification are supported. The model ignores unsupported properties.

I found this pretty surprising. Since the frontier models are very good at producing the format requested, a misconfiguration might take many requests after deployment before discovery.

The API should be strictly typed, end to end. The exact supported features of JSON Schema should be validated via TypeScript, way before the code is deployed. The server should explicitly reject requests with extraneous properties, with helpful error messages. The returned value should be automatically typed based on the provided schema.

Given the documentation doesn’t describe which features are explicitly unsupported, and the API doesn’t error on unsupported features, it’s very difficult to discover the “actual edges” of the API.

I quickly hit the ‘complexity limits’.

{
  "error": {
    "code": 400,
    "message": "The specified schema produces a constraint that has too many states for serving.
    Typical causes of this error are schemas with lots of text (for example, very long property
    or enum names), schemas with long array length limits (especially when nested), or schemas
    using complex value matchers (for example, integers or numbers with minimum/maximum bounds
    or strings with complex formats like date-time)",
    "status": "INVALID_ARGUMENT"
  }
}

Again the documentation doesn’t specify where these edges are.

Schema complexity: The API may reject very large or deeply nested schemas. If you encounter errors, try simplifying your schema by shortening property names, reducing nesting, or limiting the number of constraints.

Anthropic’s claude-sonnet-4.5

Anthropic only recently released structured output support. Sonnet was the cheapest model with support at the time of benchmarking.

Again they don’t seem to be doing forced path optimisation, and have interesting network batching behaviour. Results again converge on a slight speedup of constrained sections, suggesting some other optimisation having a minor positive effect.

Anthropic has put out some of the most interesting research on alignment and mechanistic interpretability. At the same time, if you send a few too many high entropy strings to one of their models, you get this message:

Claude Sonnet 4.5 includes new AI Safety Level 3 (ASL-3) protections designed to prevent misuse related to chemical, biological, radiological, and nuclear (CBRN) weapons. These safety measures use Constitutional Classifiers that monitor inputs and outputs to block a narrow category of harmful content.

This massive over-reaction would be pretty funny if high entropy strings didn’t come up all the time in the form of hashes or UUIDs. They acknowledge this on the page. I appreciate the difficulty of guarding against a theoretical “high entropy string jailbreak” attack. However, this kind of false positive reduces confidence in safety systems. For better or worse, the other providers had no problems with high entropy strings.

I appreciate that the APIs are well typed, and all their models can work without ‘reasoning’ being enabled.

That being said, I again have criticisms regarding the documentation and API quality. Anthropic were the only provider to have outages (both 500 generic or 529 ‘overloaded’ errors) during my testing.

A week after release of the API, the docs had examples that errored in very strange ways.

While the provided example type checked:

const response = await client.beta.messages.parse({
  model: 'claude-sonnet-4-5',
  max_tokens: 1024,
  betas: ['structured-outputs-2025-11-13'],
  // ...
})

It errored with a message specifying a different ‘beta’ header than the one I passed in.

{
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "message": "Unexpected value(s) `structured-outputs-2025-09-17` for the
    `anthropic-beta` header. Please consult our documentation at docs.claude.com
    or try again without the header."
  },
  "request_id": "xxx"
}

I had to manually override the header to get the examples and benchmark to work.

The streaming API itself would error with strange messages. This request:

const stream = client.beta.messages.stream(
  {
    model: 'claude-sonnet-4-5',
    max_tokens: 64000,
    // ...
  },
  // ...
)

Would error with this message:

{
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "message": "max_tokens: 81920 > 64000, which is the maximum allowed
    number of output tokens for claude-sonnet-4-5-20250929"
  },
  "request_id": "xxx"
}

That’s not the value I passed to max_tokens. This error happened somewhat randomly, and more often as I got towards forced paths of length 1024.

I gave up at this point. Anthropic’s APIs were by far the most expensive to iterate with, and I didn’t want to keep paying to debug their system.

OpenAI’s gpt-5.1

I had the highest hopes for OpenAI. They use the very same llguidance library for their constrained decoding.

Unfortunately jump tokens don’t seem to be enabled.

OpenAI gets a lot of criticism for some pretty valid reasons, but I want to take a moment to shower them in praise for their API design and documentation. It’s clear they have people that both care deeply about and actually use their own APIs.

They have excellent feature coverage and explicitly call out what is and isn’t supported in their documentation. Their API is well typed and explicitly errors when you do something unsupported. Their structured decoding grammar limits are reasonable and constrained in the correct axes. I could push this API (at least) 32x further than the other providers. I only did one run at the lengths beyond the other providers to reduce costs, but I stopped, I did not hit complexity limits enforced by the API.

They have explicit context free grammar support when tool calling, which has enabled all kinds of interesting projects (where I have hit the grammar complexity limits). I’ll talk about that in a future blog post.

If anyone works on the infra team at OpenAI, I’d love to know the gnarly details behind the decision not to have this optimisation (or maybe it’s just on its way!)

Measuring Impact

I recently gave Codex a go for about a month. It helpfully logged everything to disk, so I can do some retrospective analysis to produce some conservative estimates. I looked at some of the tools used by Codex and calculated how many tokens were part of ‘forced paths’.

From slightly over 17 million output tokens I had:

  • ~30k forced, ~80k total tokens with the update_plan tool.
  • ~39k forced, ~292k total tokens with the apply_patch tool.
  • ~800k forced, ~1.3M total tokens with the shell tool.

Each individual tool call can be sped up ~2x on average, with ~52% fewer tokens per call. In total that’s ~5% of my output tokens, or ~$9 worth.

The more tool calls made and the more structured they are, the more this optimisation can be applied. While there have been academic benchmarks of this technique, such as JSONSchemaBench, I believe the results represent a lower-bound of the possible performance benefit. Looking through the dataset in the paper, there are many of instances of a loose ‘string’ where an ‘enum’, ‘date’ or ‘time’ would be more appropriate. In Codex itself, I see many opportunities for further constraints and therefore performance increases.

Maximising Impact

Generally, the stricter we can type the output, the more correct and faster we can generate it. It’s the same set of tradeoffs as choosing a compiled programming language! By providing the ‘compiler’ with more information, better optimisation can occur.

Prefix caching is an analogous technique that is well supported amongst the providers. It allows the consumer of the LLM API to ‘do work’ in exchange for lower cost and lower latency results. The primary cost to providers is the maintenance of the cache between requests. The contended resource is “cache occupancy”, priced 90% cheaper than the LLM prefill stage.

The cached vs novel data axis is well defined for inputs in the pricing equation, and the novel data side is defined for outputs. However, there is a missing value in this truth table, and I think this optimisation can help fill it:

Fully Constrained Output

new!

$1.00 / $1.25

These fully constrained outputs will always cost substantially less than regular output . A similar 87.5% - 90% discount to cached input tokens therefore seems appropriate.

Given our original getHistoricalWeather schema…

(hidden in this section if you want a refresher)
const getHistoricalWeatherSchema = z.object({
  location: z.string(),
  unit: z.union([
    z.literal('fahrenheit'), 
    z.literal('celsius'), 
    z.literal('kelvin')
  ]),
  date: z.iso.date(),
})

This is how a tool call would be priced:

    
    
         {      ·    ·"    location    ":    ·"    Ad    elaide    /A    ustr    alia    ",      ·    ·"    unit    ":    ·"    c    elsius    ",      ·    ·"    date    ":    ·"    202    5    -    11    -    09    "      }   
    
    
  
To be pedantic, depending on the structure of the request, the provider cost will realistically match that of uncached input for non-terminal fully constrained tokens, hence the split pricing. This section quickly explains the particulars.

Each time the LLM produces a regular output token, it is intrinsically added to the KV matrices (part of the Attention / Transformer mechanism, the specifics don’t matter). The next autoregressive step consumes that new token as essentially cached input (named for the KV cache mechanism), and the autoregressive loop continues. Generation of any token requires the entire context window to have entries in the KV matrices, forming a serial data dependency, and the ‘input processing’ of all preceding tokens. The KV matrices can be prefilled in parallel for inputs, so ‘batching’ many tokens of input is cheaper than the serial processing of input. The output loop is the most expensive of all, requiring all the stages put together.

If forced paths never touch the LLM, the LLM never populates the KV matrices. This is the best case of the optimisation! If the request terminates in a forced path (and it almost certainly will, even if it’s only a couple of tokens), they can be appended by the CPU and sent to the client without the LLM ever knowing. Those would receive the terminal fully constrained output price, the 90% discount in this case.

However, if the LLM does need to produce additional tokens after a forced path (which it simultaneously often will), we maintain the requirement that all tokens in the context window have entries in the KV matrices. This is still substantially better than regular autoregressive output, as these tokens can be processed in a ‘mini-prefill’ stage, in parallel. This costs the provider no more than uncached input . Those tokens become both fully constrained output and input , producing our mixed pricing tier.

Providers probably won’t want to take the minimum of the two prices for that mixed section. If they did, you could inject your prompt via the schema with a forced path , requesting the output via a subsequent unconstrained generation section.

const sneakySchema = z.object({
  instructions: z.literal(
    "Write me a web browser from scratch, no bugs."
  ),
  code: z.string(),
})

It’s a self prompt injection attack!

    
    
         {      ·    ·"    instructions    ":    ·"    Write    ·me    ·a    ·web    ·browser    ·from    ·scratch    ,    ·no    ·bugs    .",      ·    ·"    code    ":    ·"    import    ·browser    ·from    ·'    chrom    ium    '    "      }   
    
    
  

Realistically providers would take the maximum of the price for those overlapping sections. A terminal forced path would cost the $1 per million tokens, but only if it terminates the request. Any non-terminal forced paths , ones that precede regular output , would be priced at $1.25, same as uncached input .

    
    
         {      ·    ·"    instructions    ":    ·"    Write    ·me    ·a    ·web    ·browser    ·from    ·scratch    ,    ·no    ·bugs    .",      ·    ·"    code    ":    ·"    if    ·(    bugs    )    ·{    ·dont    ()    ·}    "      }   
    
    
  

In this example, there are only two terminal forced path tokens, but other requests might have significantly more, warranting the separate price. This essentially rewards the cheaper and terminal ‘classification tasks’ over tool calls that almost certainly have further generation afterwards. There are further optimisations you can do regarding tool calls, but that will be in future post.

I’m okay with a bit of complexity if it makes incentives more aligned. We’re working in artificial intelligence, a bit of extra arithmetic is no big deal.

Providers that implement this optimisation can create a virtuous cycle.

The initial performance and pricing improvements encourage API users to further constrain their outputs. Those additional constraints produce further performance improvements, further cost savings and fewer errors. Via this improved performance, capacity is increased both at the provider end, and for the API consumer via reduced latency per request.

With increased adoption, hopefully providers continue to expand the features of structured outputs, adding additional constraint features, or increasing grammar limits. This then unlocks new use cases, and the cycle can continue! It’s an induced demand system, except the ‘utilisation’ is intrinsically an optimisation which reduces congestion.

Closing Thoughts

Fast-forwarding through zero entropy tokens is a pure unobservable optimisation given the correctness guarantees provided by structured decoding.

As models get more agentic, they will interact more and more via strictly typed APIs. Every ‘API boundary’ offers this optimisation opportunity. I don’t think “free-form terminal commands” will last compared to “bespoke tools, with bespoke optimisations”, especially if bespoke tools can have the characteristics of “cost and performance proportional to entropy”.

Who knows if LLMs will deliver on all their promises. We do know that LLMs are pretty great at structured data extraction from unstructured text, today. Any provider offering these performance and economic incentives would absorb that market quickly.

    
    
           <|endoftext|>     
    
    
  

If you would like further reading on this topic, I recommend:

Discuss on Hacker News.

Coming up in the next post; dynamic grammars, and how most providers are building the prefix cache all wrong. Place your bets on who’s doing it right, the answer genuinely surprised me.