Frontier LLMs like GPT-5.3-Codex, Gemini 3.1 Pro, and Claude Opus 4.6 have spiky capabilities, performing several stddevs above median human performance at some tasks while failing at some “easy” ones.
LLM pretraining datamixes often emphasize general knowledge, reasoning, and coding. The human “training mix” includes far more samples of visual/spatial/motor tasks, which come about naturally in the embodied human experience.
World models like Sora and Genie are pretrained on video and 3D video game data and excel at predicting the behavior of the real world. But no current model is at the frontier of both reasoning/coding and spatial reasoning/world modeling.
We’d expect (and it seems empirically true) that LLMs trained primarily on text are worse than humans on visual/spatial tasks. Computer-aided design (CAD) tasks require strong 3D reasoning ability as well as common-sense world knowledge, so LLMs might struggle with these tasks.
The experiment
I started with a practical CAD task I wanted done: designing a 3D-printable wall mount for my bike pump. Could some LLM do this task for me?
Current models can’t use graphical CAD programs like FreeCAD, but they’re great at writing code, so I had the models use OpenSCAD. Here’s the prompt:
Design a wall mount for this Lezyne Steel Floor Drive bike pump that I can 3D print. […] It should hold the bike pump by the handle, so that the bike pump hangs with the dial facing outward. It should hold the pump far enough away from the wall that the valve (which sticks out from the bottom of the pump) doesn’t touch the wall. Orient and position the design so that the wall is the YZ plane, and the mount protrudes into the positive X direction and is symmetric about the XZ plane. […]
Implement your design in openscad. […] Keep iterating on your design using the provided tool(s) until your most recent mujoco_mount_sim call returns *ONLY* the status “object_held” and *NO OTHER STATUSES*. If you get any other status, it means your design was not successful. Before each call to mujoco_mount_sim, write 1-3 sentences about how your design will work and/or how you will fix the issue(s) with previous versions of your design. […]
![]()
![]()
![]()
To test the designs, I made a 3D scan of the bike pump using Luma AI and created a simulation using MuJoCo to check whether the mount holds the pump.
I put each model in an agentic loop where it could call the simulator up to 10 times. If the mount held the pump (i.e. the pump was touching the mount after 5 seconds) then the design passed.
I also tried two other objects: a model of a pan from Amazon Berkeley Objects and a mug from Google Scanned Objects. I evaluated 7 LLMs and did 10 trials per (LLM x object) pair.
Code for this project is here.
| Model | Pass rate (out of 10) | Examples | Total cost | Total time | ||||
|---|---|---|---|---|---|---|---|---|
| Pump | Mug | Pan | Pump | Mug | Pan | |||
| Claude Opus 4.6 | 10 | 10 | 10 | pass | pass | pass | $41.11 | 5.2h |
| Gemini 3 Flash | 6 | 4 | 5 | pass, fail | pass, fail | pass, fail | $4.01 | 3.7h |
| Gemini 3.1 Pro | 5 | 6 | 4 | pass, fail | pass, fail | pass, fail | $7.06 | 3.0h |
| GLM-4.6V | 1 | 0 | 1 | pass, fail | fail | pass, fail | $1.49 | 6.3h |
| GPT-5.2 | 8 | 9 | 9 | pass, fail | pass, fail | pass, fail | $12.15 | 7.7h |
| Kimi K2.5 | 4 | 2 | 0 | pass, fail | pass, fail | fail | $3.39 | 8.5h |
| Qwen 3.5 397B | 2 | 1 | 0 | pass, fail | pass, fail | fail | $2.64 | 5.6h |
Claude Opus 4.6 is best at this task. In the table I only evaluate whether the mount held the object, and Claude gets perfect marks. Subjectively, most of its designs are not directly usable but almost all are close. They are sometimes too large, too small, would be too weak if 3D-printed in plastic, or are random shapes that coincidentally work. This capability seems new; I did a smaller run with Claude Opus 4.1 and it failed 100% of the time.
GPT-5.2 has a good pass rate but its designs are subjectively quite bad and almost all would need to be completely reworked. Almost all of its designs have redundant parts, and they often are too weak or have “floating” components that are physically impossible (I could check for this but wanted to avoid scope creep).
Gemini 3.1 Pro and 3 Flash sometimes produce great designs. For example, here is Flash one-shotting a usable design for 2.5 cents. However, these models often end up in loops or fail to make use of all 10 turns. Other times they produce garbled designs similar to GPT-5.2. They often act erratically, producing random words in their commentary. Pro and Flash perform and behave very similarly.
All the open-weight models do poorly. Even in cases where they technically hold the object, their designs are subjectively quite bad and are often overly simplistic, only work by accident, or have floating parts. Kimi K2.5 is noticeably closer to producing usable designs than the other two however.
Under the hood
Creating the simulator and building the agentic harness was the bulk of the work on this project. MuJoCo is complex and powerful and often has surprising behaviors. LLMs often make mistakes when calling tools and I had to carefully validate the simulator input to distinguish tool call failures from legitimate bugs in my code.
One surprising bottleneck was convex decomposition. MuJoCo can only simulate objects composed of convex components, and so concave geometries have to be broken down into multiple convex geoms. The SOTA method for this is CoACD, and it’s quite slow. Generating the above table took 15.9 hours of CoACD processing time on my potato-class Hetzner server (almost as long as the 21.8 hours spent calling the model providers).
Future work
I built a simple custom agent harness for this, but it’s possible I could get better results using an off-the-shelf harness like Codex or Claude Code and turning my MuJoCo simulator into a CLI or MCP tool. These could provide a better system prompt and tools like memory to help keep the agent on-track.
Including more objects would make this into a better, more realistic eval. Amazon Berkeley Objects and Google Scanned Objects have ~8k and ~1k 3D models respectively, and although some are irrelevant (e.g. couches), the set of objects could be expanded without much effort.
The biggest thing that could be improved is the grading of results, by checking many aspects of each design and scoring them on a rubric. Here’s a non-exhaustive list of additional things that could be checked:
- Does the mount have “floating” parts or multiple parts that would have to separately be attached to the wall?
- Does the mount still hold the object if the object is perturbed?
- Can the object be easily lifted off the mount? (Try moving the object along several reasonable trajectories and see if the object hits the mount).
- Can the object be easily grabbed while in the mount? (Define an exclusion zone around the point where one would grab the object and see if it intersects the mount).
- Does the mount have a big enough contact patch with the wall?
- Does the mount intersect the wall?
- How much material does the mount use? (Actually slice it with PrusaSlicer and check the estimated filament usage).
- Does the mount fit in the build volume of a typical 3D printer?
- Are there thin sections of the model which would be weak when printed?
- How much weight does the model hold before deforming, using a finite element analysis?
- Can the screw / nail holes in the mount be accessed by a screwdriver / hammer? (Define exclusion zones around the holes and see if they intersect the mount).
