The Chosen One: Consistent Characters in Text-to-Image Diffusion Models

Amir Hertz¹, Yael Vinker^1,3, Moab Arar^1,3, Shlomi Fruchter¹, Ohad Fried⁴, Daniel Cohen-Or^1,3, Dani Lischinski^1,2

¹Google Research, ²The Hebrew University of Jerusalem, ³Tel Aviv University, ⁴Reichman University

SIGGRPAH 2024

The Chosen One - given a text prompt describing a character, our method distills a representation that enables consistent depiction of the same character in novel contexts.

Abstract

Recent advances in text-to-image generation models have unlocked vast potential for visual creativity. However, the users that use these models struggle with the generation of consistent characters, a crucial aspect for numerous real-world applications such as story visualization, game development, asset design, advertising, and more. Current methods typically rely on multiple pre-existing images of the target character or involve labor-intensive manual processes. In this work, we propose a fully automated solution for consistent character generation, with the sole input being a text prompt. We introduce an iterative procedure that, at each stage, identifies a coherent set of images sharing a similar identity and extracts a more consistent identity from this set. Our quantitative analysis demonstrates that our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods, and these findings are reinforced by a user study. To conclude, we showcase several practical applications of our approach.

Video

Method

Our fully-automated solution to the task of consistent character generation is based on the assumption that a sufficiently large set of generated images, for a certain prompt, will contain groups of images with shared characteristics. Given such a cluster, one can extract a representation that captures the "common ground" among its images. Repeating the process with this representation, we can increase the consistency among the generated images, while still remaining faithful to the original input prompt.

We start by generating a gallery of images based on the provided text prompt, and embed them in a Euclidean space using a pre-trained feature extractor. Next, we cluster these embeddings, and choose the most cohesive cluster to serve as the input for a personalization method that attempts to extract a consistent identity. We then use the resulting model to generate the next gallery of images, which should exhibit more consistency, while still depicting the input prompt. This process is repeated iteratively until convergence.

Results

Consistent Characters Examples

Using our method, we can generate a consistent character in novel scenes. Notice that our method works on different character types (e.g., humans, animals etc.) and styles (e.g., photorealistic, renderings etc.)

"A photo of a 50 years old man with curly hair"

"in the park"

"reading a book"

"at the beach"

"holding an avocado"

"A portrait of a man with a mustache and a hat, fauvism"

"in the park"

"reading a book"

"at the beach"

"holding an avocado"

"A rendering of a cute albino porcupine, cozy indoor lighting"

"in the park"

"reading a book"

"at the beach"

"holding an avocado"

"a 3D animation of a happy pig"

"in the park"

"reading a book"

"at the beach"

"holding an avocado"

"a sticker of a ginger cat"

"in the park"

"reading a book"

"at the beach"

"holding an avocado"

Life Story

Using our method, we can generate a consistent life story of the same character, from different stages in life.

"a photo of a man with short black hair"

"as a baby"

"as a small child"

"as a teenager"

"with his first girlfriend"

"before the prom"

"as a soldier"

"in the college campus"

"sitting in a lecture"

"playing football"

"drinking a beer"

"studying in his room"

"happy with his accepted paper"

"giving a talk in a conference"

"graduating from college"

"a profile picture"

"working in a coffee shop"

"in his wedding"

"with his small child"

"as a 50 years old man"

"as a 70 years old man"

"a watercolor painting"

"a pencil sketch"

"a rendered avatar"

"a 2D animation"

"a graffiti"

Story Illustration

Our method can be used for story illustration. For example, we can illustrate the following story:

"This is a story about Jasper, a cute mink with a brown jacket and red pants. Jasper started his day by jogging on the beach, and afterwards, he enjoyed a coffee meetup with a friend in the heart of New York City. As the day drew to a close, he settled into his cozy apartment to review a paper"

Scene 1

Scene 2

Scene 3

Scene 4

Local Text-Driven Image Editing

Our method can be integrated with Blended Latent Diffusion for the task of consistent local text-driven image editing:

Input image + mask

"sitting"

"jumping"

"wearing sunglasses"

Additional Pose Control

Our method can be integrated with ControlNet for the task of consistent pose-driven image generation:

Input pose 1

Result 1

Input pose 2

Result 2

BibTeX

If you find this research useful, please cite the following:

@inproceedings{avrahami2024chosen,
  author = {Avrahami, Omri and Hertz, Amir and Vinker, Yael and Arar, Moab and Fruchter, Shlomi and Fried, Ohad and Cohen-Or, Daniel and Lischinski, Dani},
  title = {The Chosen One: Consistent Characters in Text-to-Image Diffusion Models},
  year = {2024},
  isbn = {9798400705250},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3641519.3657430},
  doi = {10.1145/3641519.3657430},
  abstract = {Recent advances in text-to-image generation models have unlocked vast potential for visual creativity. However, the users that use these models struggle with the generation of consistent characters, a crucial aspect for numerous real-world applications such as story visualization, game development, asset design, advertising, and more. Current methods typically rely on multiple pre-existing images of the target character or involve labor-intensive manual processes. In this work, we propose a fully automated solution for consistent character generation, with the sole input being a text prompt. We introduce an iterative procedure that, at each stage, identifies a coherent set of images sharing a similar identity and extracts a more consistent identity from this set. Our quantitative analysis demonstrates that our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods, and these findings are reinforced by a user study. To conclude, we showcase several practical applications of our approach.},
  booktitle = {ACM SIGGRAPH 2024 Conference Papers},
  articleno = {26},
  numpages = {12},
  keywords = {Consistent characters generation},
  location = {Denver, CO, USA},
  series = {SIGGRAPH '24}
}