Settings

Theme

Gemini 3 Deep Think drew me a good SVG of a pelican riding a bicycle

simonwillison.net

130 points by stared 19 hours ago · 61 comments

Reader

segmondy 18 hours ago

For those claiming they rigged it. Do you have any concrete evidence? What if the models have just gotten really good?

I just asked Gemini pro to generate an SVG of an octopus dunking a basketball and it did a great job. Not even Deep Think model. Then I did "generate an svg of raccoon at a beach drinking a beer" you can go try this out yourself. Ask it to generate anything you want in SVG. use your imagination.

Rant: This is why AI is going to take over, folks are not even trying the least.

  • WarmWash 18 hours ago

    Simon has a private set of SVG tests he uses as well. He said that the private ones were just as impressive.

  • JumpCrisscross 18 hours ago

    > What if the models have just gotten really good?

    Kagi Assistant remains my main way of interacting with AI. One of its benefits is you're encouraged to try different models.

    The heterogeneity in competence, particular per unit in time, is growing rapidly. If I'm extrapolating image-creation capabilities from Claude, I'm going to underestimate what Gemini can do without fuckery. Likewise, if I'm using Grok all day, Gemini and Claude will seem unbelievably competent when it comes to deep research.

  • colecut 18 hours ago

    and it will be folks using AI taking over for at least a while...

    Some people try, most people don't.

    AI makes doing almost anything easier for the people that do..

    Despite the prophesied near-term obliteration of white collar work, I've never felt luckier to work in software.

  • raincole 17 hours ago

    Every bit of improvement on AI ability will have the corresponding denial phrase. Some people still think AI can't generate the correct number of fingers today.

    • halJordan 16 hours ago

      I love to hate it when someone unironically thinks asking an llm how many letters are in a word is a good test

      • Jerrrrrrrry 15 hours ago

        It is a good test now, for reasoning models.

        It was a terrible test for pure tokenized models, because the logit that carries the carry digit during summation has a decent chance at getting lost.

        SOTA models should reason to generate a function that returns the count of a given character, evaluate the function with tests, and use it for the output.

  • irthomasthomas 17 hours ago

    Why frame it as rigging? I assume they would teach the models to improve on tasks the public find interesting. Then we just have to come up with more challenges for it.

  • bayindirh 17 hours ago

    > For those claiming they rigged it.

    I don't think they "rigged" it, but might be given a bit more push on that part since it's going for a very long time now.

    Another benchmark is going on at [0]. It's pretty interesting. A perfect scoring model "borks" in the next iteration, for example.

    > Rant: This is why AI is going to take over, folks are not even trying the least.

    It might be drawing things alright, at least some cases. I seldom use it when my hours long researches doesn't take me to the place I want, and guess what? AI can't go there, either. It hallucinates things, makes up stuff, etc. For a couple of things I asked, it managed to find a single reference, and it was the thing I was looking for, so it works rarely in my cases.

    Rant: This is why people are delusional. They test the happy path and claims it knows all the paths, and then some.

    [0]: https://clocks.brianmoore.com/

  • ej88 16 hours ago

    "not enough people are emotionally prepared for if it’s not a bubble"

  • dw_arthur 17 hours ago

    Everyone should have their own private evals for models. If I ask a question and a model flat out gets it wrong sometimes I will put it in my test questions bank.

vessenes 19 hours ago

Simon notes this benchmark is win-win, since he loves pictures of pelicans riding bicycles — if they spend time benchmaxxing it’s like free pelicans for him.

He originally promised to generate a bunch more animals when we got a “good” pelican. This is not a good pelican. This is an OUTSTANDING pelican, a great bicycle, and it even has a little sun ray over the ocean marked out. I’d like to see more animals please Simon!

  • hnuser123456 18 hours ago

    It is visually outstanding. The only thing that sticks out to me is that the steering column bends out forwards towards the ground (negative trail), which would make it oversteer rather than self-stabilize. Interestingly there's a slight positive trail bend in the second one, though.

    • ManuelKiessling 16 hours ago

      > the steering column bends out forwards towards the ground

      It’s really fascinating in a way because I would have thought that this would be one of the least difficult parts of the challenge.

  • romanhn 18 hours ago

    Agreed, good is quite an understatement. Every item is drawn superbly, and the basket with the fish is just great. Feels like a big jump over the other models (though granted, this is such a known "benchmark" by now, it's likely gamed to some extent).

    • robocat 12 hours ago

      If we took a random sample of people from around the world, I wonder how long it would take them to be trained to create a better SVG.

  • alterom 18 hours ago

    > a great bicycle

    It's not. Sorry.

    Go look at some real bicycles for reference.

    • sdenton4 18 hours ago

      This is a very reasonable drawing of a bicycle. It has a solid rear triangle, and forward swept front fork, which is an important detail for actually being able to steer the bike. The drivetrain is single speed, but that's fine, and the wheels are radially laced, which is also fine: both of those simplified details are things which occur in real bicycles.

      • alterom 13 hours ago

        >This is a very reasonable drawing of a bicycle

        Reasonable? Sure. Great? Absolutely not.

        Let's not move the goal posts here.

        > and forward swept front fork, which is an important detail for actually being able to steer the bike.

        ...but no crown or any sort of indication that the fork is separate from the frame, making it impossible to steer. Yay.

        > the drivetrain is single speed, but that's fine, and the wheels are radially laced, which is also fine: both of those simplified details are things which occur in real bicycles.

        Sure! Now do the bicycle saddle.

    • losthubble 18 hours ago

      What part is missing? It appears to have all the core parts of a bicycle to me?

      • alterom 13 hours ago

        > What part is missing? It appears to have all the core parts of a bicycle to me?

        * I am not an expert in pelican anatomy, but it appears that the bicycle's saddle, if present, would be very deeply inside this pelican (try drawing it onto the frame, and see what you get). Let's be kind, and assume that the seat is simply missing.

        * The crown on the fork[1] is missing. The fork is not separate from the frame; so you can't turn the front wheel.

        * Neither of pelican's feet are on the pedals (the left foot appears to be missing altogether)

        * The wheels are way too small for that frame.

        Note: Brakes and gears are nowhere to be seen, but OK, let's just assume it's a fixie (or a single-speed with a coaster brake). Let's also assume the red thing on the basket is there for decoration.

        Also, the spokes on a typical bicycle wheel follow a cross-lacing pattern [2], at an angle to the radial direction, and you can see them crossing each other. But OK, let's let that slide.

        My point is: sure, this thing at least passes as a bicycle. But it's not a good bicycle drawing.

        [1] https://en.wikipedia.org/wiki/Bicycle_fork

        [2] https://spokecalc.io/how-to-lace-a-wheel.php

        • Firerouge 13 hours ago

          It's fun when a subject matter expert shows up and blows your mind with their level of knowledge on normally unnoticed intricate details that go into everyday things.

        • HDThoreaun 9 hours ago

          I just think this is insanely pedantic. Most art is not photorealistic, and in fact imo having a crown on the fork may very well make this image worse by taking away from the aesthetic.

          • alterom 7 hours ago

            Sure, if the "uncanny valley" aesthetic is what you're going for, then by all means, tell the machine to shrink the wheels further.

            It's just not what makes a great bicycle picture.

            Also, that bicycle's seatpost must be outright penetrating that pelican's cloaca.

            You can say I'm being anal about it, and you'll have a point. But it's not that pedantic to notice it.

            • HDThoreaun 7 hours ago

              I think the seat post and the angle of the fork are issues but I just don’t think that the greatness of an image of a bicycle has anything to do with whether you can tell that the fork is a separate piece of metal from the frame. The prompt here isn’t about making an image so that an alien who has never seen a bike can make one, it’s about people recognizing that it’s a normal looking bike and the emotional response that comes from that. The emotional response the vast majority of people have looking at this isn’t “that bike won’t ride”

    • romanhn 18 hours ago

      A better comparison would be the monstrosities generated by older models.

      • alterom 14 hours ago

        Those were shit.

        This one is somewhat OK. Not good, merely passing for a bicycle-shaped object.

rustyhancock 19 hours ago

The intensity of competition between models is so intense right now they are definitely benchmaxxing pelican on bike SVGs and Will Smith spaghetti dinner videos.

  • bonesss 18 hours ago

    Parallel hypothesis: the intensity of competition between models is so intense that any high-engagement high-relevance web discussion about any LLM/AI generation is gonna hit the self-guided self-reinforced model training and result in de facto benchmaxxing.

    Which is only to say: if we HN-front-page it, they will come (generate).

  • staredOP 18 hours ago

    There was Lenna for digital image compression (https://en.wikipedia.org/wiki/Lenna).

    A pelican on a bike is SFW, inclusive, yet cool.

    It is not a full benchmark - rather a litmus test.

  • thatguysaguy 18 hours ago

    You can just try other svgs, I got some pretty good ones.

    (*Disclaimer: I work for Google, but also I have zero idea about what they trained deepthink on)

  • bayindirh 19 hours ago

    So, again, when the indicator becomes a target, it stops being a good indicator.

    • JumpCrisscross 18 hours ago

      > when the indicator becomes a target, it stops being a good indicator

      But it's still a fair target. Unless it's hard coded into Gemini 3 DT, for which we have no evidence and decent evidence against, I'd say it's still informative.

    • kakugawa 18 hours ago

      That's how you know you've made it: when your pet benchmark becomes a target.

    • rcbdev 18 hours ago

      Goodhart's law in action.

  • yieldcrv 18 hours ago

    note that this benchmark aside, they've gotten really good at SVGs, I used to rely on the nounproject for icons, and sometimes various libraries, but now coding agents just synthesize an SVG tag in the code and draw all icons.

rcarmo 19 hours ago

I don't think this is a good "benchmark" anymore. It's probably on everyone's training set by now.

  • staticassertion 18 hours ago

    I think it could still be an interesting benchmark. Like, assuming AI companies are genuinely trying to solve this pelican problem, how well do they solve it? That seems valid, and the assumption here is that the approach they take could generalize, which seems plausible.

  • HDThoreaun 9 hours ago

    The point of this benchmark is that making decent SVG art is actually useful. Simon has private image prompts he uses, since he didnt say gemini failed at those it is reasonable to assume those were also successful.

WarmWash 18 hours ago

Are AI labs training on the bike Pelican?

From the blog:

>The strongest argument is that they would get caught. If a model finally comes out that produces an excellent SVG of a pelican riding a bicycle you can bet I’m going to test it on all manner of creatures riding all sorts of transportation devices. If those are notably worse it’s going to be pretty obvious what happened.

He mentioned in the Deep Think thread the other day that his secret test set also was impressive.

fulafel 6 hours ago

Has anyone benchmarked humans on this task? Without visual feedback.

alestainer 18 hours ago

Interesting thing: I've got my internal request that is similar to this pelican. And there was 0 progress on it in the past ~2 years. Which might have at least a couple of explanations. 1. Spillage into the pre-training: some real artist had drawn a pelican riding a bicycle. 2. Seeing it as an important discourse for model intelligence in the training data might affect allocation of compute into solving this problem, either thru engineers or the model itself finding the texts about this challenge.

tylervigen 18 hours ago

That’s among the most artistic SVGs I’ve ever seen, period.

Springtime 18 hours ago

I have wondered if with these tests it'll reach a point where online models cheat by generating a line art raster reference then behind the scenes deciding how to vectorize it in the most minimalist way (eg: using strokes and shape elements, etc, rather than naively using path outlines for all forms).

  • simonw 16 hours ago

    This Deep Think one was so good that I did get suspicious that maybe it was at least rendering the SVG to an image and then "looking" at the image and tweaking it over a few iterations.

    But the reasoning trace doesn't hint at that and looks legit to me: https://gist.github.com/simonw/7e317ebb5cf8e75b2fcec4d0694a8...

    I also asked Deep Think what tools it has access to and it has Python and Bash but no internet access, and as far as I can tell that environment doesn't have any libraries or tools installed that can render an SVG to an image format that it could view.

  • taberiand 18 hours ago

    Is that cheating, or is that just working smarter not harder?

    • Springtime 18 hours ago

      The interesting aspect of the ongoing tests I feel is seeing how models can plan out an image directly using SVG primatives solely through reasoning (code-to-code). If they have a reference then it's a different type of challenge (optimizing for a trace).

aidos 19 hours ago

The bicycles are getting pretty cyclable now. I’m enjoying this pelican that’s already sliced and ready to bbq.

bfung 18 hours ago

In the spirit of Winter Olympics, I vote “Lion on a bobsled” next bench . :)

stephc_int13 18 hours ago

Many tests are asymmetrical. They can reliably show an issue/abnormality but they are a lot less reliable on the other side of the curve.

manojlds 18 hours ago

It's funny how I can know where the post is from just by looking at the title (and it's not just about pelicans)

throwaway333444 19 hours ago

Since it’s a* FAQ… Also that pelican is pretty fly

  • bstsb 18 hours ago

    read it aloud. “since it’s an FAQ”, where FAQ is pronounced “eff-ay-queue”

pcmaffey 10 hours ago

Now animate it with SMIL

kittbuilds 18 hours ago

SVG generation is a surprisingly good benchmark for spatial reasoning because it forces the model to work in a coordinate system with no visual feedback loop. You have to hold a mental model of what the output looks like while emitting raw path data and transforms. It's closer to how a blind sculptor works than how an image diffusion model works.

What I find interesting is that Deep Think's chain-of-thought approach helps here — you can actually watch it reason about where the pedals should be relative to the wheels, which is something that trips up models that try to emit the SVG in one shot. The deliberative process maps well to compositional visual tasks.

  • actsasbuffoon 10 hours ago

    Yeah, spatial reasoning has been a weak spot for LLMs. I’m actually building a new code exercise for my company right now where the candidate is allowed to use any AI they want, but it involves spatial reasoning. I ran Opus 4.6 and Codex 5.3 (xhigh) on it and both came back with passable answers, but I was able to double the score doing it by hand.

    It’ll be interesting to see what happens if a candidate ever shows up and wants to use Deep Think. Might blow right through my exercise.

bulletsvshumans 19 hours ago

They rigged it.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection