AI Pictionary with Bot Ross

Before we dive into the project, we should at least provide a brief explanation of Stable Diffusion. Since it’s already been the subject of mass interweb hysteria for quite some time, I’m not going to go into any great detail.

Stable Diffusion

Stable Diffusion is a generative art modal algorithm that was originally put forth by the Computer Vision research group at Ludwig Maximilian University. Why is this significant? Although there have been papers concerning generative art system papers published prior to Stable Diffusion, the key difference is that a proper model has not only been trained but more importantly freely released to the public thanks to Stability AI. This is in stark contrast to DALL-E, where OpenAI keeps a tight proprietary lid on any of their models.

So how good is stable diffusion, you ask? Feast your eyes on this stunning 4K render of Julie Andrews from beloved Disney classic, Mary Poppins, when you get too excited and accidentally misspell her last name in the prompt as “Ploppins”:

mary-ploppins — *Mary Ploppins doesn’t feed the birds, the birds *FEED* Mary Ploppins*

Fantastic. Okay, but seriously, how about this one?

As time as progressed, the open source community has coalesced around making Stable Diffusion more approachable and deployable even on machines with modestly powerful discrete GPUs.

Ever since seeing Midjourney, a popular AI generative art system fully accessible from Discord, I’d played around with the idea of having a local generative art system available in a similarly accessible manner, but with the added goal of building a game around it.

Stable diffusion and other models are particularly strong at taking existing concepts and applying different art styles to them.

spocks — *realism, line drawing, cubism, and pop art*

Here we have four different takes on the vulcan Spock in realism, line drawing, cubism, and pop art respectively.

This looks like a perfect application for the classic game of pictionary. Pictionary, for the uninitiated, is a game where one person is given a cue card and then attempts to get players to guess what it is by drawing it on a piece of paper.

Hosting Models Locally

By repurposing an old RTX 2060 linux laptop from a years ago, we can use it as a dedicated diffusion machine.

So we fetch the latest version of AUTOMATIC1111 , one of the most active and popular forks of Stable Diffusion. Bonus, it’s capable of running on relatively modest hardware, in my case, a 6GB vram NVIDIA card.

Unfortunately for us, it uses Gradio to generate a web interface which means all the endpoints are obscurely named.

Digging around for a bit, we find that TomJamesPearce put together a simple proof-of-concept API using uvicorn and fastapi on top of it. Not knowing if this was going to be extended or not and having more familiarity with javascript, I wrote a minimal node express server with super basic header-based authentication that would call this API.

Using this API, we can get back a base64 encoded image.

The Discord Bot

First, let’s get our bot registered with Discord. As this was my first time building a discord bot, I was a bit apprehensive, but turns out it’s as simple as logging into the developer portal, registering an app and then adding a bot. Here’s our initial bot’s profile:

From there, we take our bot token, plop it into an environment variable, and bring in the discord.js npm package.

Discord bots are relatively simple. Register a set of / based commands, create a client and subscribe to the appropriate events and you’re off to the races.

Schema

We’ll need a set of “pictionary cards”, so let’s create a simple schema for a pack:

/**
 * A Category pack
 * @typedef {Object} Pack
 * @property {string} name - pack name
 * @property {string[]} words - words in the pack
 * @property {string} description - pack description
 * @property {string[]} tags - Suffixes to a word
 */

What are the tags for? Some concepts are a little too general, so tags are there to add guidance to SD while at the same time not interfering with the words themselves. If we had a pack for Disney characters, doing a render for the word “Mickey Mouse” would be simple since its unambiguous, but what about this one:

That’s what happened when we generated an image with a prompt of “Basil“, which is the name of the Disney mouse detective without any tags.

Here’s another render with the prompt Basil, Disney :

basil-real — *I’m going to pretend the marshmallow on a stick is a clue that he found.*

That’s a little more guessable. So we separate the word from the tags.

From here, a player types pictionary to start a round, we pick a random word, generate the image and wait for a player to type the correct response. Simple, right?

Styles

Now this is fine as a proof of concept and works well enough, but let’s leverage Stable Diffusion’s ability to render images in different styles. Furthermore, let’s add a random set of modifiers - stuff like highly realistic, 3D, steampunk, etc.

In the end, we landed on the following template:

[word] in the style of [artist], [medium], [modifiers]

Here’s an example of something generated:

joe pesci — *Joe Pesci in the style of Caravaggio*

meerkat — *Meerkat in the style of Leonardo da Vinci, stained glass*

Play testing

Let’s give it a shot on Discord:

Sigh. It takes between 10-15 seconds to generate a 512×512 image at 30 steps or iterations. This is not terrible, but since it is a measurable amount of time, the players are just kind of twiddling their virtual thumbs until stable diffusion spits out our image. What if we could make this slowness an asset?

That got me to thinking about college bowl trivia. Contrasted with your local trivia night which usually feature pretty straightforward questions and answers, a college bowl question is often an entire paragraph consisting of highly specific clues which start incredibly obscure and gradually become more general as the card is read. In this manner, it attempts to rewards players with a deep knowledge into the subject who can buzz earlier as the clue is read aloud.

For example: “Early in this novel, a rule is established that only the person holding a conch shell may talk during group meetings. Glasses belonging to (*) Piggy are broken by the choirboy Jack. School-aged children are stranded on an island in—for 10 points—name this novel by William Golding.”

So… what if we just started displaying the image immediately? Each time a certain number of steps have been generated, we’ll update the existing image in discord, giving the players an interactive slideshow which gets progressively more detailed as the round continues.

anderson-cooper — *Daniel Crag -> Anderson Cooper*

Success! Our overall logic now looks like this:

And now we’re finally ready to play a few rounds!

Post mortem

Originally, we had thought that using a constant seed would be better - since it would help to ensure a certain level of image consistency between successive iterations. Unfortunately, this has the side effect that if a specified seed isn’t particularly relevant to the prompt, it can make the experience frustrating for the users. Take a look at this image sequence.

shan-yu-disney-johannes-vermeer-oil-painting — *Try guessing what cartoon character this clue is trying to depict*

It’s like some unholy fusion of Oswald Cobblepot and Shrek. None of these iterative images seem to have any connection to the prompt. Care to guess what the original prompt was?

shan-yu — *Duh, it’s Shan You from Mulan, the resemblance is uncanny.*

However if we randomize the seed every iteration, it’s more likely that at least one of the progressively generated step-based images will bear a passing resemblance to our original prompt.

What’s left for Bot Ross?

There’s a lot of stuff that we could add – for one, it would be great to make this bot publicly available so that anyone could invite it to their servers, but it doesn’t really scale… to put it delicately.

We generate our pictionary images “on demand” for any given game, and since I don’t have a load balancer sitting in front of a couple dozen RTX builds just lying around the house, I would have to revise the design.

To build it in a more scalable manner, we could go one of two routes.

Route 1: Continual pregeneration

Instead of having the discord bot request images in realtime, we’d have worker thread in our node Communicator who would generate a complete list “prompt permutations” and generate infinite images that would be uploaded to an S3 bucket stacked like cordwood. Then our communicator would parse an incoming prompt and fetch the “cached” image sequence from s3.

This feels… fine I guess, but also kind of a cop-out, and not as cool as a “realtime diffusion pictionary game”.

Route 2: Multi-game Sync

Our discord bot would take the first channel to “click” New Round, and kick off a sequence of generations. Any other subsequent guild/channel would effectively be joining that existing round, though they would be unaware of it.

This comes with the caveat that channels cannot pick specific category packs, since everybody is locked into the same one ongoing game.

How can I play it?

If you really really really want to play it, shoot me a message via Discord, wunderbaba and I’ll invite you to my guild to try it out.

All the code (such as it is) is fully available on my Github. It was written as a proof-of-concept, and is about as robust as Samuel L. Jackson’s character from the Unbreakable movie.

Credits

This project would not have been possible without the tireless efforts of the AI open-source community at large. Big shoutout to AUTOMATIC1111 who currently hosts one of the best Gradio UI/UX frontends for running Stable Diffusion.