1
People are constantly debating which LLM is better for writing Elixir code, so I decided to compare the three SoA models from Google, OpenAI, and Antropic to see which one would be better in designing a medium-size feature for a medium-size project.
The project is ReqLLM, a wonderful new LLM library by @mikehostetler, and the feature is adding image generation support, a first part of Add Image Generation and Audio Transcription Support · Issue #14 · agentjido/req_llm · GitHub.
I’ve used Gemini 3 Pro, GPT 5.2, and Claude Opus 4.5 in gemini, codex, and claude code clis, respectively, with the same prompt. After each model wrote a plan, I asked each (in a separate session) to compare the three plans. The results are here: ReqLLM image support plans · GitHub
Bottom line:
- My ranking of the plans is GPT 5.2 > Opus 4.5 > Gemin 3 Pro; and each of the three models agreed with this assessments.
- Arguably, GPT’s is the only correct plan; while Opus’s plan works, it essentially introduces a parallel response parsing infrastructure, and would make it hard or impossible to extend the image support going forward, add streaming, etc.
- For some reason, Claude likes to write big implementation chunks as part of its plan
- Gemini’s is the least concrete and least accurate plan (and also uses the wrong image generation endpoints, for some reason).
This matches my experience working with Claude Code and Codex daily: while Claude Code has a nicer output, more features (like parallel/background execution), and works faster, Codex is much, much more thorough and most often generates higher quality code.
And also, the “/review” function in Codex is underrated. My current workflow is to always run a “/review”, for code written by me, Claude, or another codex. It excells at finding some very subtle edge cases and bugs that were introduced by the latest patch.
egeersoz 2
GPT is really bad with Elixir in my experience. I regularly run experiments where I ask multiple models the same question (about design or troubleshooting a bug) and GPT is consistently bottom tier. It’s also slow as hell. Not sure why people like it as a coding assistant.
I used to use it for product management to build domain expertise but Gemini 3 is better at that now.
each of the three models agreed with this assessments.
This should really say “each of these models generated text which said that they agreed.”
If you worded your question slightly differently the models would write something else. They are just text generators, they can’t “agree”.
Technically, your brain is just a biological text generator too - it doesn’t “agree” in any literal sense. It processes inputs and produces outputs based on patterns, associations, and prior experiences, much like an LLM does (albeit through vastly different mechanisms). The feeling of “agreeing” is just part of the narrative your mind constructs around its outputs.
I’m going to have to agree to disagree on this.
LLMs and brains don’t work the same way at all.
Fair enough - you’re right that LLMs and brains work very differently under the hood. But the point isn’t about the mechanism; it’s about the illusion of agency. When you say you “agree” that feeling emerges from neural processes you’re not consciously controlling - just like an LLM’s response emerges from its training and inputs. Neither system “chooses” in a philosophical sense; both produce outputs shaped by prior data. So while the architectures differ wildly, the claim that only humans can truly “agree” might say more about how we interpret our own cognition than about what’s actually happening.
vkryukov 7
Interesting. I do concur that it is slow as hell; that’s why I often reach to Claude Code when I need to code something that I believe will be relatively straightforward, and doesn’t need any “depth”.
However, in my experience of using codex daily, it produces, albeit slowly, a much higher quality of code. You can hopefully see it for yourself in the design documents each model produces.
With my workflow, it doesn’t matter that much that codex is slow: I’m usually running 2-3 sessions in parallel, and/or reviewing the code that one of the agents produce to either approve and commit it or ask a clarifying questions or steer the model in a different direction. And 2-3 is about the number of different topics I can effectively work on at the same time. So longer work time doesn’t result in a real slowdown in practice, most of the time.
vkryukov 8
You are technically right, but I personally don’t find such descriptions of how LLM work useful. My own mental model of it is of an “AI person”, with some reasoning abilities, encyclopedic knowledge, and occasional tendency to hallucinations.
You might say that I anthropomorphize a tensor, but so what? The mental model above is more useful in predicting what LLMs will produce and how best to use them to generate useful code than the alternative you propose, at least to me.