I was playing around with a vision language model and it thinks this xfinity logo is a hate symbol

It’s also overly careful about copyright

And here’s the results on the AI logo Rorschach test1

The reason the model makes these predictions is because I tricked it by creating adversarial examples. If you look closely you’ll see weird noise patterns in the images, those are modifications I’ve added to make the model output exactly what I want it to.
First I’ll explain how it works. At the end of the blog is a link to code you can use to make your own images.
How it works (high level)
Normally when we train a neural network we have an input and target output, and use gradient descent on the model parameters to maximize the likelihood of the output.
Adversarial examples are a slight twist on this, we fix the output and model, and use gradient ascent to modify the input in order to maximize the likelihood that the model gives the desired output.
Basically, that’s all you need to do. I expected I would need some tricks like a term in the loss function that kept the image close to the original, but surprisingly it just worked, with only small changes to the input image.
Advice for creating your own: Some input outputs pairs are easier than others. For example, if the input question is “What is shown in this image?” The model has a high probability of starting with “The image…” (at least, it did for the input images I used). If you try something else, it might take longer, or require larger changes to the input image. When creating your own adversarial images, make sure to play around with different inputs and outputs to see what works best.
Deeper details
I used the Llava 7B model (paper), a vision language model that takes text and image as input, and outputs text (e.g. you can ask questions about the image).
The image pre-processing takes the raw pixel values, divides by 255, then normalizes the pixel values. We’ll use these normalized pixel values for optimization. Afterwards we’ll need to convert back.
Modifying the normalized pixel values is a simple pytorch gradient ascent loop. At each step, run the inputs through the model, calculate the logprobs of the desired output tokens, and maximize these logprobs (by default pytorch minimizes when using .backward, so I actually minimize the negative loss).
One trick I found that helped: instead of directly trying to optimize the probability of the full output token sequence, I do it token by token. First do gradient ascent on the input to maximize the likelihood of outputting the first token of the desired output. Once it can do that, do 2 tokens, then 3 etc.
Another tricky thing is that we are doing the optimization on normalized pixel values, which are continuous, but the true pixel values must be integers. When we convert the normalized pixel values back to an image, it’s possible the rounding will make it so that it changes the output and no longer outputs exactly what we wanted. To deal with this, I put a check in the loop, that converts back and checks if the image gives the desired output. Once it does, it stops the loop.
After this, we have everything needed to run the code. Generating the images only takes a few minutes, although the exact amount of time will depend on what your chosen inputs and outputs are.
Side note: I initially tried a different model that had a smaller image input size (Llava processor resizes images to 336×336, the other model used smaller inputs). This didn’t work as well, the modifications required to make the adversarial images were large and ugly and made the picture look very different. I hypothesize that larger inputs are important because it’s easier to optimize in higher dimensional space, but I did not rigorously test this.
Do these adversarial images generalize?
The Llava model is made by connecting the CLIP visual encoder with Vicuna through a linear projection, and fine tuning the projection + LLM. I thought maybe CLIP (and other VLMs that use CLIP) could be affected by the adversarial examples since they use the same image encoder. However when I tested on CLIP, my adversarial images did not confuse it, so these adversarial examples are specific to this model.
However, there is some effect when using different prompts with the same image. This one gave a pretty funny result:

For code to create your own adversarial images, see: https://github.com/rosmineb/vlm_adversarial_examples/blob/main/vlm_adversarial.ipynb
Acknowledgements
Thanks to @algomancer for supporting this and other projects, and thanks to thanks to Fufu for his modeling service for image 2 (and tolerating a much less handsome AI image than him)
- If you work at one of these companies and are thinking about hiring me, please note that I did not say your company logo looks like a butthole. I meant that your company’s logo is a psychological canvas onto which people project their unresolved issues with buttholes. ↩︎