Does O3 beat a specalized numeric compiler?

ChatGPT 3.5, and even o1-preview, would sometimes make programs worse than they started, basically by making math errors and simply changing the real-valued program. That rarely happens with o1 and basically not at all with o3, so that's a big improvement. Also, o3-mini-high basically crushes the "easy" problems like expq2, at times getting better results than Herbie. For example, expq2 is the program:

\[ \frac{e^x}{e^x - 1} \]

Herbie has the uninspired solution

\[ \frac{e^x}{\mathsf{expm1}(x)} ]

It's fine, but o3-mini-high has the much better

\[ \frac{-1}{\mathsf{expm1}(-x)} \]

It's faster (because you don't need the exp call) and also more accurate for some reason I can't quite fathom (something about overflow?) but whatever the reason it's clearly the best solution. So well done o3-mini-high!¹¹ I'll add that in my earlier blog post there was something similar with eccentricity, but Herbie now achieves, I think, a similarly good result to o3-mini-high.

Some stuff could also maybe be improved with minor work. For example, in csqrt, Herbie's weakest example, o3-mini-high got basically the right answer but fumbled by forgetting to use hypot; if it hadn't forgotten, it actually would have beat Herbie and acheived a basically perfect result. But it's possible that prompting it with a list of supported functions would help.

On the other hand, the three most "real-world" examples, mixing, maksimov2, and jcobi, all of which are drawn from real code bases or papers, not textbook examples o3-mini-high really struggled. I think this is a case where feedback from actually evaluating examples and seeing how they do would be very valuable, and the LLMs don't have that.

In short, I think o3-mini-high is close to being competitive with Herbie, at least on easier textbook examples, and it's possible that some kind of fine-tuned or reinforcement-learning setup would actually beat Herbie—at least when it comes to maximum accuracy. Of course, Herbie is still faster to run, and it does produce a Pareto curve of outputs, and so on, but LLMs are close to competitive on this task.