Settings

Theme

Ask HN: Why can't image generation models spell?

40 points by adam_gyroscope 2 years ago · 61 comments · 1 min read


I’ve tried using the various state of the art commercially available models to generate artwork for my kids room, with his name in letters. None can render an image with his name properly spelled, and even after long chats where I explain the problem the models fail to spell correctly. It’s a common five letter name. What can’t the models spell?

kelseyfrog 2 years ago

Because tokenization is an imperfect way to encode the required information.

Think about how we chunk words[1] and recognize them. We have whole word(shape recognition), morphme recognition, and spelling(letter-by-letter chunking). Text models receive tokens(akin to morpheme chunks) and don't have access to the underlying letters(spelling data) unless that was part of their training. For the most part, individual letters, something I think we can agree is necessary for rendering text, is not accessible.

An appropriate analogy is an illiterate artist. Someone who can hear chunks of words and recognizes them verbally I'd asked to do their best job at painting text. They can piece together letter clusters based on inference, but they cannot spell.

1. https://en.m.wikipedia.org/wiki/Chunking_(psychology)

  • gwern 2 years ago

    Correct. Most image models to date have used very small text models, which are unable to spell many words (https://arxiv.org/abs/2108.11193) - if you cannot spell a word letter by letter because you don't know what the letters are, how are you going to generate a pixel image of each of those letters...? (It is not as if letters were intrinsically hard. GANs were generating letters easily like 7 years ago.) Their general inability is due to BPEs and non-character-based tokenization.

    This was noted in the DALL-E 2 paper (https://arxiv.org/pdf/2204.06125.pdf#page=16&org=openai), and it can be experimentally established by swapping out even a very large LLM like PaLM for a humble, small, weak, but not badly-tokenized ByT5 and noting instant solution of the 'problem' (https://arxiv.org/abs/2212.10562#google). Skip to the appendix of the second paper if you have any doubts about the difference that switching to ByT5 makes in terms of spelling. The solution is just scaling up the LLM models (which is necessary to get better instruction-following and image quality in general, quite aside from spelling inside images) and eventually switching to character tokenization.* See, as always, https://gwern.net/gpt-3#bpes

    (Hands and cats, however, are just genuinely difficult and require biting the bullet of scaling. And I wonder if it will take video supervision to truly solve them?)

    * on a recent-news note, I suspect Claude-3 may have done something interesting with tokenization - possibly but not necessarily switching to character/byte encodings - and this is part of why it confabulates in ways unusual for ChatGPT but also is a lot more pleasant to use.

  • bonzini 2 years ago

    > An appropriate analogy is an illiterate artist.

    Or tattoos with Chinese/Japanese characters. More often than not they use the "wrong" character (even if it is technically correct) and the calligraphy is considered not artistic by native speakers.

  • barkingcat 2 years ago

    A good analogy is asking a English speaking tattoo artist who doesn't know any Japanese to tattoo a Kanji letter.

    The tattoo artist knows roughly what a lettershape is, but has no idea how to write it.

    Leads to hilarious tattoos

  • radarsat1 2 years ago

    > Text models receive tokens(akin to morpheme chunks) and don't have access to the underlying letters

    Heh, this gives me the idea of training an image generator not on tokens but on rendered text of the prompt in some 8x8 font, that would be a fun experiment ;)

    • kelseyfrog 2 years ago

      That would be an interesting experiment!

      I'd also be curious how text rendering performance changes if the tokenizer could be made aware of quoted strings and instead tokenizes the contents as characters instead. Surely someone has tried this, (right?) but I haven't seen it in the literature.

candrewlee14 2 years ago

Very interesting to me that our dreams make some of the same mistakes. Some of the usual reality checks to know if you’re dreaming:

- looking at your hands

- looking at clocks

- trying to read

It’s funny that diffusion models often make those exact same mistakes. There’s clearly a similar failure mode where both are drawing from a distribution and losing fine details. Has this been studied?

  • krapp 2 years ago

    I think this is a phenomenon with a lot of variance. I'm able to read in my dreams and look at clocks. I don't recall whether or not I've ever looked at my hands in my dreams, but I also don't recall seeing sixteen fingers on my hand. What I can't ever seem to do is turn lights on or off, which is apparently a common thing.

    I don't know that it's necessarily the case that there's a strong relationship between the way these models work and the way human brains, particularly dreaming, work.

    • card_zero 2 years ago

      I theorise that dreams are more garbled than we realise, filled with pseudo-language and incoherent vague impressions, and ambiguous images, and that memories of dreams are actually post-hoc reconstructions formed at the point of waking, and are not to be trusted. So maybe in the dream there was a sense of a presence and a nearby blob with appendages that could be a hand, but with uncountable fingers, and it could also be an octopus, and it moves restlessly around: on waking, this resolves into a much more sensible memory of a person waving a hand.

      The similarity of things observed in dreams to AI is then because both procedures involve constructing coherence out of noise. "Gradient descent" or something, I wouldn't really know about that. Pareidolia.

      • krapp 2 years ago

        >I theorise that dreams are more garbled than we realise, filled with pseudo-language and incoherent vague impressions, and ambiguous images, and that memories of dreams are actually post-hoc reconstructions formed at the point of waking, and are not to be trusted.

        This seems more like a post-hoc rationalization than a theory. If you can't trust memories of dreams, how can you even know you dream at all? What do you even base your assumptions on?

        • bigfudge 2 years ago

          I think it’s a fair assumption that you cant trust memories of dreams. Heck, you can’t even trust memories of eyewitnesses very much. We are all doing a -lot- of active inference and post hoc reconstruction all the time. We just don’t notice the gaps because 1. We are good at it and are often right and 2. People rarely call you out on mistakes even if they’re obvious, because it’s impolite or they are not sure either.

        • card_zero 2 years ago

          True. To be honest this is just based on having spent a lot of time half-asleep (for science), and trying to remember what's going on, and imagining that I'm better at it than you. I may be dead wrong. I get the impression of a process of rationalization at the point of waking up (which I like to do slowly). But then, you got an equally persuasive impression of a vivid detailed dream, probably, so at this point it's just one impression against another, that's qualia for you.

      • Wowfunhappy 2 years ago

        > I theorise that dreams are more garbled than we realise [...] and that memories of dreams are actually post-hoc reconstructions formed at the point of waking

        I theorize that our experience of life is more garbled than we realize, and that memories of life are actually post-hoc reconstructions formed at the point of recall.

        • card_zero 2 years ago

          Yeah, that sounds about right. Then we get into the territory of the Mandela effect and false memories and suggestibility, and the idea that memories in general are fabrications, or guesses, to explain the facts (including neurological facts like sensations and emotions?) with a narrative, and - yeah, we can't process the world without imposing theories on it: for instance the raw physical world has no definite boundaries between objects, those are just ideas to organize it with.

    • dylan604 2 years ago

      > don't recall seeing sixteen fingers on my hand.

      to be fair, I don't recall that from generative images either. it's either 11 or 12 type of situations

      • dragonwriter 2 years ago

        Extra hands as well as 4-7 fingers per hand are not that uncommon, so 16 total fingers is not unreasonable, though its a bit much (as is 11-12) for one hand.

        • dylan604 2 years ago

          no, i meant 11-12 in total suggesting that i've typically seen 6 fingers rather than 7 or 8. AI is clearly trying to mess with Iñigo Montoya

  • zeta0134 2 years ago

    I have never once encountered a piano keyboard in a dream with enough octaves to actually play. My dream self will now reliably recognize this and voice frustration at being stuck in a dream, unable to practice.

    • hn_throwaway_99 2 years ago

      > My dream self will now reliably recognize this and voice frustration at being stuck in a dream, unable to practice.

      If you recognize you're actually dreaming, though, is this a lucid dream?

      I've only had a couple lucid dreams in my life, and I thought they were some of the most awesome things I've ever experienced. Real life holodeck!

      • TacticalCoder 2 years ago

        > If you recognize you're actually dreaming, though, is this a lucid dream?

        I don't know but my absolute favorite when I recognize it's a dream, and I've "trained" myself to do that is to say: "Nice, this is a dream, so I can fly!". It's an awesome, awesome, awesome feeling. Usually doesn't last long but it's a cool thing to do.

      • Snacklive 2 years ago

        As someone who almost always knows that I'm dreaming i would say no/maybe.

        While most of the time i recognize that im dreaming very few times i can actually control the outcome and even fewer i have gain admin control to change anything i think those are what people would assume when talking about lucid dreams.

        Also, if i start doing weird stuff or making obvious that I'm aware of my status, the NPCs on my dream become apathetic and would plainly ask for me to stop playing and just wake up.

        Ok that sounds like nightmare material noe that i wrote it, but is not scary at all they sound more annoyed that anything else.

  • dudeinjapan 2 years ago

    Tying my shoes in a dream is damn near impossible. I just woke up in frustration.

disconcision 2 years ago

(the following is speculation)

text like hands belong to class of imagery satisfying two characteristics: 1) They are intricately structured, having many subcomponents which have precise spatial interrelationships over a range of scales; there are a lot of ways to make things that are like text/hands except wrong 2) The average person is intimately familiar with said structures, having spent thousands of hours engaging looking at them while performing complex tasks involving a visiospatial feedback loop.

image generation models tend to have trouble with (1), but people only tend to notice it when paired with (2).

(1) can be improved by scale and more balanced training data; consider that for a person, their own hands are very frequently in their own field of view, but the photos they take only rarely feature hands as the focus. this creates a differential bias.

as for (2), image models tend to generate all kinds of implausibilities that the average person doesn't notice. try generating a complex landscape and ask a geologist how it formed.

  • Wowfunhappy 2 years ago

    > (1) [Hands] are intricately structured, having many subcomponents which have precise spatial interrelationships over a range of scales; there are a lot of ways to make things that are like text/hands except wrong 2) The average person is intimately familiar with said structures, having spent thousands of hours engaging looking at them while performing complex tasks involving a visiospatial feedback loop.

    Shouldn't this apply even more strongly to faces versus hands? AI seems to have a significantly easier time with those.

    • dragonwriter 2 years ago

      Faces don't have lots of repeating similar subcomponents beyond some things that are just two items in bilateral symmetry (teeth are a big exception, and teeth, when visible, can be a problem.)

      And, actually, faces still, especially outside of closeups of just the face, can be a problem, too, which is why a separate face restoration with a GAN or inpainting pass for faces with the same or different diffusion model is common.

      • disconcision 2 years ago

        faces are simpler since unlike hands most of their major constituents are at fixed relative positions to each other. but the flip side is that people are hyper-biased towards attending to facial details, hence why they were basically the first-handled special case

    • wongarsu 2 years ago

      Faces are probably vastly over-represented in the training data. Normal people and professional photographers alike love photographing faces, at zoom levels that are quite rare to experience in real life.

db48x 2 years ago

For the same reason they cannot do hands very well. There just aren’t enough bits in their internal representations to encode specific details of that type. Scaling up the model can help, but that requires something on the order of 100× as much computer time and storage. Since the model you are using may already have taken months and dozens or hundreds of millions of dollars to create, the company behind it may be reluctant to spend years and billions to replace it.

  • adam_gyroscopeOP 2 years ago

    Given the amount of training image data with hands (and text in the image), I don’t understand the lack of specific detail. For hands it’s even stranger - the models add fingers, for example, which seems like something that the training data never sees.

    • thfuran 2 years ago

      I don't think it's nearly as simple as there not being enough bits. I think the structure of the model just isn't designed to effectively encode that sort of information. The average real finger is between other fingers, which is probably why generated hands sometimes end up with too many.

raesene9 2 years ago

My experience is that whilst it's not perfect, modern models can create images with text that is correct, relatively often.

As a test I just tried ChatGPT with the prompt :-

Hi ChatGPT can you give me a picture of a banner that says "Hacker news"

And the resultant image does indeed have that text on it. Where I've seen this approach fall down, is where the text is long and/or complex or the words are uncommon.

so while there's some way to go, things are definitely improving here.

reader5000 2 years ago

My intuition would be that both text and fingers are high frequency parts of the image and most image patches across most images don't have that amount of frequency. Text and fingers are relatively rare in images compared to e.g. clouds and textures on surfaces. Because of the rareness and difficulty of text image patches the models just don't dedicate that many parameters to it.

This could be entirely wrong however.

It would be interesting what would happen on a dataset with nothing but text.

swyx 2 years ago

the question is incorrect. they -can- spell. sometimes very long phrases. five letter words should have no issue. have you tried using ideogram? or even just dalle3 prompted well? https://twitter.com/swyx/status/1765091085943218571

in other words.. what have you actually tried? be specific.

  • geor9e 2 years ago

    Exactly. The premise was true years ago, definitely. But today it's not hard to get correct spelling out of the top models. Ask DALL-E 3 to make a picture with some text, and it will spit out 4 image. Usually 2 or 3 are perfectly spelled. Lesser or older diffusion models (whatever OP is using) sometimes mix cyrillic and latin letters, or invent plausible-looking letters that don't exist in any language. But think about how they work - they are trained to turn pure pixel noise into a more plausible array of pixels for the prompt. It's pretty close to plausible text - misspelling is a nitpick for what it's getting right. Technology progresses over time.

    • gwern 2 years ago

      > Ask DALL-E 3 to make a picture with some text, and it will spit out 4 image. Usually 2 or 3 are perfectly spelled.

      Yes, this is the 'miracle of spelling' (as https://arxiv.org/abs/2212.10562#google calls it): for many words, larger models can manage to deduce the spelling somehow despite the tokenization. It may even fool you into thinking it understands spelling in general. But if you ask DALL-E 3 to generate a random string of ASCII, you'll quickly discover the limits to the 'miracle'.

    • adam_gyroscopeOP 2 years ago

      I tried gpt4, Gemini, stable diffusion. I had sorta assumed DALLE wasn’t the hotness anymore, and had never heard of Ideogram- I’ll try both!

huevosabio 2 years ago

Try Ideogram! V1 is pretty good with spelling out words

https://ideogram.ai/login

jvm___ 2 years ago

Because they can only generate something that looks LIKE the thing.

Generating a cat that looks LIKE a cat is fine because there are differences between cats.

The problem is that you can't make something that looks LIKE a letter K, it needs to satisfy the rules of K and can't just look LIKE a K and not some made up character.

They're LIKE generators and have trouble with the bits that need to be exact.

  • dylan604 2 years ago

    > The problem is that you can't make something that looks LIKE a letter K, it needs to satisfy the rules of K and can't just look LIKE a K

    there are several fonts on the market that would profoundly disagree with this.

  • amelius 2 years ago

    Sorry, but this is not a very scientific answer.

    • spiderxxxx 2 years ago

      The asker didn't ask for a very scientific answer. They've come across a limitation and are asking why. Any answer is still an answer, sorry if it didn't match your criteria.

spiderxxxx 2 years ago

What is spelling but putting into letters what is heard? If you can give it the text, then it's not spelling is it, it would be copying! I think you want to give it ideas, that will translate into words, which is certainly not spelling. It's creating. SD models start with simply static, and it's asked to find some pattern and expand upon it until it matches the pattern better. Letters on a sign for example are not right or wrong by their placement, but upon judgement by someone who has the knowledge of the language. Sale might mean 'salt' if you speak italian, but it might mean there's a discount for a short period of time if you speak english. Sensibel might seem like a misspelling if sensible if you speak english, but it's perfectly correct spelling in German.

My suggestion is to use Image to Image, start with the text of your son's name, and give it some gaussian noise background, and then paint out the parts you want to keep.

Frummy 2 years ago

It's not like writing where it grabs a virtual pen and draws the letters one by one, or types the letters one by one. The image is diffused everywhere at once, and if you for example close your eyes and imagine the word "generation" in symbols as an image, it's quite difficult to visualise all the details in the right order in the right detail and so on. The model does it in multiple layers but starts with a slightly defined "hunch" in full image size and just makes the initial hunch more legible with each layer

Zetobal 2 years ago

It's the dataset most of the images that were tagged don't have the text shown in the captions. We do a lot of car loras and if you tag the shown text as an example on the numberplate you can prompt/replace it in your prompt without problem.

Newer models like cascade or SD 3 are using multimodal llms to caption images including text. Dall-E was at the forefront because they had access to gpt4-vision before everyone else. You will see that all new models will be able to spell. The problems we see are still mostly because of gigo.

  • robertclaus 2 years ago

    This was my feeling too. Considering a bunch of new images models are coming out at once claiming they can all spell now implies to me it was likely just a training set issue - the caption generators just needed to be told to include any text in the images.

bxguff 2 years ago

I think it just boils down to what they were trained on, some models do better when the training sets are more specific even if they're smaller sometimes, so the engineers chase better wholesale performance while leaving some of the weirder edge cases to be cleaned up later eg text generation. maybe start with the image and try adding the text after in a separate prompt if you haven't already?

d-z-m 2 years ago

If you have the resources to run a model locally, you can try ControlNet with a reference image. If you dial the strength up enough, it will coerce the model into spelling correctly(at the cost of the generated image looking an awful lot like the reference image: font/size/etc.)

sfmz 2 years ago

Many articles/videos available about how Midjourney can spell since version 6.

https://medium.com/community-driven-ai/midjourney-can-spell-...

nextaccountic 2 years ago

Can't SD3 spell?

  • dragonwriter 2 years ago

    From what I’ve seen, on lettering SD3 > Stable Cascade > SDXL (SD1.x and SD2.x aren't even worth considering), but none do it great consistently.

MatthiasPortzel 2 years ago

Because they’re trained to generate images, not words.

There’s an anecdote about blind men whose sight was restored. They were adult men, who had felt cubes and heard about cubes, and could describe a cube. After their sight was restored, they were shown a cube and a sphere and were asked to identify them by sight. They were unable to, having never seen these objects before.

Many people (including very smart people) make the mistake of equating all forms of intelligence. They assume that computer programs have an intelligence level, and should be able to handle all tasks below that intelligence level, but machine learning models break our intuition for this. A model which has been trained on stock market data and is extremely intelligent in this area may be able to predict the stock market tomorrow. But if it has not been trained on words than it is no more able to write a sentence than a newborn baby. ChatGPT can eloquently generate words but it is completely unable to generate or understand pictures. (Ask ChatGPT to generate some ASCII-art.) Eventually OpenAI will create a sophisticated multi-modal model capable of generating poems or reading words in an image or predicting the stock market, but this model will be completely unable to answer questions about the physical world, because it’s only been trained on words and images.

  • ssl-3 2 years ago

    > ChatGPT can eloquently generate words but it is completely unable to generate or understand pictures. Ask ChatGPT to generate some ASCII-art.)

    Ok. I did both things.

    I took a photo of my feet up on a stool in my living room, and told CharGPT to describe it.

    It was reasonably (and rather surprisingly) successful.

    I also told it to generate an ASCII image of a car. It did that, too.

    • Retr0id 2 years ago

      GP likely meant GPT-4 (the LLM), not ChatGPT the product as a whole (which has additional models for doing things with images).

      Feed original (not copy-pasted from the web) ASCII art of a foot into GPT-4 and I'd be very impressed if it can tell you it's a foot.

      I'm actually mildly impressed it could generate ASCII art of a car, because that's a lot better than I've been able to get out of it (albeit on gpt3.5). Try anything more complex and I believe you'll see it's limitations.

  • disconcision 2 years ago

    chatpgt can in fact both read words in images and generate images

    • ericbarrett 2 years ago

      But ChatGPT is a product, not a model. Might there be an augmenting OCR process feeding it the image's text?

      • dragonwriter 2 years ago

        With data analysis enabled, I’ve seen it try to use pytesseract to do OCR when you ask it to extract text from an image. (I’ve seen it fail with a timeout trying that on an image that it worked almost instantly on with a similar approach on my laptop.)

      • exe34 2 years ago

        CogVLM and Llava can both read text from images I think, although I haven't tried this myself. They certainly describe images very accurately, including giving the pixel location.

Wowfunhappy 2 years ago

This isn't a scientific answer, but it feels pretty intuitive to me that writing and drawing are very different skills.

If an AI that could draw was also able to write, that would be artificial general intelligence. And pretty much everyone seems to agree we don't have that yet.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection