Show HN: B-IR – An LLM-optimized programming language

62 points by ImJasonH 21 days ago · 42 comments

Reader

If a new programming language doesn’t need to be written by humans (though should ideally still be readable for auditing), I hope people research languages that support formal methods and model checking tools. Formal methods have a reputation for being too hard or not scaling, but now we have LLMs that can write that code.

https://martin.kleppmann.com/2025/12/08/ai-formal-verificati...

jaggederest 21 days ago

Absolutely agreed. My theory is that the more tools you give the agent to lock down the possible output, the better it will be at producing correct output. My analogy is something like starting a simulated annealing run with bounds and heuristics to eliminate categorical false positives, or perhaps like starting the sieve of eratosthenes using a prime wheel to lessen the busywork.
I also think opinionated tooling is important - for example, the toy language I'm working on, there are no warnings, and there are no ignore pragmas, so the LLM has to confront error messages before it can continue.
- WithinReason 21 days ago
  
  It should be impossible for an LLM to generate invalid code, as long as you force it to only generate tokens that the language allows.
  - measurablefunc 16 days ago
    
    Tokens do not encode semantics.
    
    WithinReason 16 days ago
    
    You can choose which token to sample based on language semantics. You simply don't sample invalid ones. So the language should be restrictive on what tokens it allows enough that invalid code is impossible.
    
    SkiFire13 16 days ago
    
    > You can choose which token to sample based on language semantics
    Can you though?
    > the language should be restrictive on what tokens it allows
    This is a restriction on the language syntax, not its semantics.

kburman 21 days ago

An LLM is optimized for its training data, not for newly built formats or abstractions. I don’t understand why we keep building so-called "LLM-optimized" X or Y. It’s the same story we’ve seen before with TOON.

ImJasonHOP 21 days ago

Yeah fwiw I agree. I was impressed at how well the agents were able to understand and write their invented language, but fundamentally they're only able to do that because they've been trained on "similar" code in many other languages.

Mathnerd314 21 days ago

I get that this is essentially vibe coding a language, but it still seems lazy to me. He just asked the language model zero-shot to design a language unprompted. You could at least use the Rosetta code examples and ask it to identify design patterns for a new language.

Snacklive 21 days ago

I was thinking the same. Maybe if he tried to think instead of just asking the model. The premise is interesting "We optimize languages for humans, maybe we can do something similar for llms". But then he just ask the model to do the thing instead of thinking about the problem, maybe instead of prompting "Hey made this" a more granular, guided approach could've been better.
For me this is just a lost of potential on the topic, and an interesting read made boring pretty fast.
- ImJasonHOP 21 days ago
  
  I don't disagree at all. :)
  This was mainly an exercise in exploration with some LLMs, and I think I achieved my goal of exploring.
  Like I said, if this topic is interesting to you and you'd like to explore another way to push on the problem, I highly recommend it. You may come up with better results than I did by having a better idea what you're looking for as output.
  - Mathnerd314 15 days ago
    
    I tried a thread, I got that both LLMs and humans optimize for the same goal, working programs, and the key is verifiability. So it recommended Rust or Haskell combined with formal verification and contracts. So I think the conclusion of the post holds up - "the things that make an LLM-optimized language useful also happen to make them easier for humans!"
forgotpwd16 21 days ago

There's also the issue, which is also noted by the author, that LLM-optimization quite often becomes, when shouldn't be just that, token-minimization.

sublinear 21 days ago

I think I've come full circle back to the idea that a human should write the high-level code unassisted, and the LLM should autocomplete the glue code and draft the function implementations. The important part is that the human maintains these narrow boundaries and success criteria within them. The better the scaffolding, the better the result.

Nothing else really seems to make sense or work all that well.

On the one extreme you have people wanting the AI to write all the code on vibes. On the other extreme you have people who want agents that hide all low-level details behind plain english except the tool calls. To me these are basically the same crappy result where we hide the code the wrong way.

I feel like what we really need is templating instead of vibes or agent frameworks. Put another way, I just want the code folding in my editor to magically write the code for me when I unfold. I just want to distribute that template and let the user run it in a sandbox. If we're going to hide code from the user at least it's not a crazy mess behind the scenes and the user can judge what it actually does when the template is written in a "literate code" style.

flir 16 days ago

Feels like you're talking about pseudocode?

discrisknbisque 21 days ago

The Validation Locality piece is very interesting and really got my brain going. Would be cool to denote test conditions in line with definitions. Would get gross for a human, but could work for an LLM with consistent delimiters. Something like (pseudo code):

``` fn foo(name::"Bob"|genName(2)): if len(name) < 3 Err("Name too short!")

  print("Hello ", name)
    return::"Hello Bob"|Err

```

Right off the bat I don't like that it relies on accurately remembering list indexes to keep track of tests (something you brought up), but it was fun to think about this and I'll continue to do so. To avoid the counting issue you could provide tools like "runTest(number)", "getTotalTests", etc.

One issue: The Loom spec link is broken.

ImJasonHOP 21 days ago

Fixed the loom link!

petesergeant 21 days ago

A language is LLM-optimized if there’s a huge amount of high-quality prior art, and if the language tooling itself can help the LLM iterate and catch errors

giancarlostoro 21 days ago

The real question is what existing language is perfect for LLMs? Is Lisp? ASM? We know some LLMs are better at some languages but what existing language are they best at? Would be interesting to see. I know one spot they all fail at is niche programming libraries. They have to pull down docs or review raw code pulled down for the dependency, issue is in some languages like C# those dependencies are precompiled to bytecode, Java too.

nzach 21 days ago

> The real question is what existing language is perfect for LLMs?
I think verbosity in the language is even more important for LLMs than it is for humans. Because we can see some line like 'if x > y1.1 then ...' and relate it to the 10% of overbooking that our company uses as a business metric. But for the LLM would be way easier if it was 'if x > base overbook_limit then ...'.
For me, it doesn't make too much sense to focus on the token limit as a hard constraint. I know that for current SOTA LLMs we still have pretty small context windows, and for that reason it seems reasonable try to find a solution that optimizes the amount of information we can put into our contexts.
Besides that we have the problem of 'context priming'. We rarely create abstract software, what we generally create is a piece of software what interacts with the real world. Sometimes directly through a set of APIs and sometimes through a human that reads data from one system and uses it as input in another one. So, by using real world terminology we improve the odds for the LLM to do the right thing when we ask for a new feature.
And lastly there is the advantage of having a source code that can be audited when we need.
quatonion 16 days ago

I have been having a crack at it in my spare time. A kind of intentional LISP where functions get turned into WASM in the cloud.
The functions are optionally tested using formal verification. I plan to enable this by default soon, as time allows.
These functions that get written can then be composed, and enzymes that run in the cloud actively look for functions to fuse.
Also more people use it, the faster the compiler gets via network scaling laws.
It's very much research at the moment, but kinda works.
Jupyter Notebook style interface with the beginnings of some image and media support.
https://prometheus.entrained.ai
Can try looking at some of the examples or trying something yourself.
Would love some feedback.
ImJasonHOP 21 days ago

Early in one of the conversations Gemini actually proposed a Lisp-like language with S-expressions. I don't remember why it didn't follow that path, but I suspect it would have been happy there.
wfn 15 days ago

I've been thinking about this, take a look at this:
> From Tool Calling to Symbolic Thinking: LLMs in a Persistent Lisp Metaprogramming Loop
https://arxiv.org/abs/2506.10021
edit but also see cons[3] - maybe viable for very constrained domains, with strict namespace management and handling drop into debugger. Also, after thinking more, it likely only sounds nice (python vs lisp training corpus and library ecosystems; and there's mcp-py3repl (no reflection but otherwise more viable), PAL, etc.) Still - curious.
In theory (I've seen people discuss similar things before though), homoiconicity and persistent REPL could provide benefits - code introspection (and code is a traversable AST), wider persistent context but in a tree structure where it can choose breadth vs depth of context loading, progressive tool building, DSL building for given domain, and (I know this is a bit hype vibe) overall building up toolkit for augmented self-expanding symbolic reasoning tools for given domain / problem / etc. (starting with "build up toolkit for answering basic math questions including long sequences of small digits where you would normally trip up due to your token prediction based LLM mechanism"[2]). Worth running some quick experiments maybe, hm :)
P.S. and thinking of agentic loops (a very uh contemporary topic these days), exposing ways to manage and construct agent trees and loops itself is (while very possibly recipe for disaster; either way would need namespaces not to clash) certainly captivating to me (again given effective code/data traversal and modification options; ideally with memoization / caching / etc.)
[1] https://arxiv.org/abs/2506.10021
[2] https://www.youtube.com/watch?v=AWqvBdqCAAE on need for hybrid systems
[3] cons (heh): hallucination in the metaprogramming layer and LLMs being fundamentally statistical models and not well trained for Lisp-like langs, and inevitable state pollution (unless some kind of clever additional harness applied) likely removes much of the hype...

evacchi 21 days ago

weeks ago I was also noodling around with the idea of programming languages for LLMs, but as a means to co-design DSLs https://blog.evacchi.dev/posts/2025/11/09/the-return-of-lang...

AlexCoventry 21 days ago

I'm looking for a language optimized for use with coding agents. Something which helps me to make a precise specification, and helps the agent meet all the specified requirements.

barnacs 21 days ago

I'm looking for a language optimized for human use given the fundamental architectural changes in computing in the last 50 years. That way we could skip both the boilerplate and the LLMs generating boilerplate.
jaggederest 21 days ago

I'm working on something similar. Dependently typed, theorem proving, regular syntax, long form english words instead of symbols or abbreviations. It's not very well baked yet but claude/codex are already doing really well generating it. I expect that once the repo has been around long enough to be included in training data it'll improve. Probably next year or the year after.

forgotpwd16 21 days ago

There was one other just yesterday: https://news.ycombinator.com/item?id=46571166

Surac 21 days ago

so where are the millions of line code you need to train the LLM in your new language? Remember AI is just a statistic prediction thing. No input -> No output

scotty79 20 days ago

Are there coding AI benchmarks that show in which language AI models are best at solving tasks?

Sort of like Rosetta code but for AI and more complex tasks?

atlintots 21 days ago

I'm assuming OP is not aware of APL, J, or similar array programming languages.

middayc 21 days ago

He has good points about languages.

But it reminds me of the SEO guys optimizing for search engines. At the end of the day, the real long term strategy is to just "make good content", or in this case, "make a good language".

In the futuristic :) long term, in "post programming-language world" I predict each big llm provider will have its own propertiary compiler/vm/runtime. Why basically do transpiling if you can own the experience and result 100% and compete on that with other llm providers.

internet_points 21 days ago

llm-optimized in reality would mean you asked and answered millions of stackoverflow questions about it and then waited a year or so for all the major models to retrain.

jaggederest 21 days ago

This is part of my strategy with my toy language actually. By putting a repo on github and hopefully building up useful examples there, I expect within a year or two the language will be understood at a minimal level by the next group of LLMs, and thus make it more useful.
Honestly they're already pretty great at fitting to the syntax if you provide the right context, so that may not be as much of an advantage as I initially thought, but it's fun to think about "just let them train on it next year once it's complete"

mooktakim 21 days ago

Just get LLM to write assembly

lifetimerubyist 16 days ago

I asked Claude to rewrite my Rails app in Brainfuck and it refused. What are we even doing here?!

mike_hearn 21 days ago

I've thought about this too.

The primary constraint is the size of the language specification. Any new programming language starts out not being in the training data, so in context learning is all you've got. That makes it similar to a compression competition - the size of the codec is considered to be a part of the output size in such contests, so you have to balance codec code against how effective it is. You can't win by making a gigantic compressor that produces a tiny output.

To me that suggests starting from a base of an existing language and using an iterative tree-based agent exploration. It's a super expensive technique and I'm not sure the ROI is worth it, but that's how you'd do it. You don't want to create a new language from scratch.

I don't think focusing on tokenization makes sense. The more you drift from the tokenization of the training text the harder it will be for the model to work, just like with a human (and that's what the author finds). At best you might get small savings by asking it to write in something like Chinese, but the GPT-4/5 token vocabularies already have a lot of programming related tokens like ".self", ".Iter", "-server" and so on. So trying to make something look shorter to a human can easily be counter productive.

A better approach is to look at where models struggle and try to optimize a pre-existing language for those issues. It might all be rendered obsolete by a better model released tomorrow of course, but what I see is problems like this:

1. Models often want to emit imports or fully qualified names into the middle of code, because they can't go backwards and edit what they already emitted to add an import line at the top. So a better language for an LLM would be one that doesn't require you to move the cursor upwards as you type, e.g. Python/JS benefits here because you can run an import statement anywhere, languages like Java or Kotlin are just about workable because you can write out names in full and importing something is just a convenience, but languages that force you to import types only at the very top of the file are going to be hell for an LLM.

Taking this principle further it may be useful to have a PL that lets you emit "delete last block" type tokens (smarter ^H). If the model emits code that it then realizes was wrong, it no longer has to commit to it and build on it anyway, it can wipe it and redo it. I've often noticed GPT-5 use "no op" patterns when it emits patches, where it deletes a line and then immediately re-adds the exact same line, and I think it's because it changed what it wanted to do half way through emitting a patch but had no way to stop except by doing a no-op.

The nice thing about this idea is that it's robust to model changes. For as long as we use auto-regression this will be a problem. Maybe diffusion LLMs find it easier but we don't use those today.

2. As the article notes, models can struggle with counting indentation especially when emitting patches. That suggests NOT using a whitespace sensitive language like Python. I keep hearing that Python is the "language of AI" but objectively models do sometimes still make mistakes with indentation. In a brace based language this isn't a problem, you can just mechanically reformat any file that the LLM edits after it's done. In a whitespace sensitive language that's not an option.

3. Heavy use of optional type inference. Types communicate lots of context in a small number of tokens, but demanding the model actually write out types is also inefficient (it knows in its activations what the types are meant to be). So what you want is to encourage the model to rely heavily on type inference even if the surrounding code is explicitl, then use a CLI tool that automatically adds in missing type annotations, i.e. you enrich the input and shrink the output. TypeScript, Kotlin etc - all good for this. Languages like Clojure, I think not so good, despite it being apparently token efficient on the surface.

4. In the same way you want to let the model import code half way through a file, it'd be good to also be able to add dependencies half way through a file, without needing to manually edit a separate file somewhere else. Even if it's redundant, you should be able to write something like "import('@foo/bar:1.2.3').SomeType.someMethod". Languages like JS/TS are the closest to this. You can't do it in most languages, where the definition of a package+version is very far both textually and semantically from the place where it's used.

5. Agree with the author that letting test and production code be interleaved sounds helpful. Models often forget to write tests but are good at following the style of what they see. If they see test code intermixed with the code they're reading and writing they're more likely to remember to add tests.

There's probably dozens of ideas like these. The nice thing is, if you implement it as a pre-processor on top of some other language, you exploit the exiting training data as much as possible and in fact the codebase it's working on becomes 'training data' as well just via ICL.

rvz 21 days ago

> Humans don't have to read or write or undestand it. The goal is to let an LLM express its intent as token-efficiently as possible.

Maybe in the future, humans don't have to verify the spelling, logic or grounding truth either in programs because we all have to give up and assume that the LLM knows everything. /s

Sometimes, I read these blogs from vibe-coders that have become completely complacent with LLM slop, I have to continue to remind others why regulations exist.

Imagine if LLMs should become fully autonomous pilots on commercial planes or planes optimized for AI control and the humans just board the plane and fly for the vibes, maybe call it "Vibe Airlines".

Why didn't anyone think of that great idea? Also completely remove the human from the loop as well?

Good idea isn't it?

ImJasonHOP 21 days ago

I hope I didn't give the impression that I thought this language was ready to be put into commercial planes. :)
This was the result of an afternoon/evening exploring a problem space, and I thought it was interesting enough to share.
- rvz 20 days ago
  
  The comment was sarcastic, hence the "/s" at the end of the first sentence.
  Everything else was a thought experiment to show how the idea of LLMs on everything including commercial planes is a very bad idea and would give regulators a hard time.
  The point is: just because you can (build and run anything) does not mean you should (put it on commercial planes).
eadwu 21 days ago

There are multiple layers and implicit perspectives that I think most are purposefully omitting as a play for engagement or something else.
The reason why LLMs are still restricted to higher level programming languages is because there are no guarantees of correctness - any guarantee needs to be provided by a human - and it is already difficult for humans to review other human's code.
If there comes a time where LLMs can generate code - whether some term slop or not - that has a guarantee of correctness - it is indeed probably a correct move to probably have a more token-efficient language, or at least a different abstraction compared to the programming abstractions of humans.
Personally, I think in the coming years there will be a subset of programming that LLMs can probably perform while providing a guarantee of correctness - likely using other tools, such as Lean.
I believe this capability can be stated as - LLMs should be able to obfuscate any program code - which is pretty decent guarantee.

Settings

Show HN: B-IR – An LLM-optimized programming language

Keyboard Shortcuts