ConsiStory: Training-Free Consistent Text-to-Image Generation

¹NVIDIA, ²Tel Aviv University, ³Independent

Accepted to SIGGRAPH 2024

TLDR: We enable Stable Diffusion XL (SDXL) to generate consistent subjects across a series of images, without additional training.

We present ConsiStory, a training-free approach that enables consistent subject generation in pretrained text-to-image models. It does not require finetuning or personalization, and as a result it takes ~10 seconds per generated image on an H100 (x20 faster than previous state-of-the-art methods). We enhance the model by introducing a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. ConsiStory can naturally extend to multi-subject scenarios and even enable training-free personalization for common objects.

Teaser.

Consistent Set Generations

Abstract

Text-to-image models offer a new level of creative flexibility by allowing users to guide the image generation process through natural language. However, using these models to consistently portray the same subject across diverse prompts remains challenging. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects or add image conditioning to the model. These methods require lengthy per-subject optimization or large-scale pre-training. Moreover, they struggle to align generated images with text prompts and face difficulties in portraying multiple subjects. Here, we present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model. We introduce a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. We compare ConsiStory to a range of baselines, and demonstrate state-of-the-art performance on subject consistency and text alignment, without requiring a single optimization step. Finally, ConsiStory can naturally extend to multi-subject scenarios, and even enable training-free personalization for common objects.

How does it work?

Architecture outline (left): Given a set of prompts, at every generation step we localize the subject in each generated image 𝐼𝑖. We utilize the cross-attention maps up to the current generation step, to create subject masks 𝑀𝑖. Then, we replace the standard self-attention layers in the U-net decoder with Subject Driven Self-Attention layers that share information between subject instances. We also add Feature Injection for additional refinement. Subject Driven Self-Attention: We extend the self-attention layer so the Query from generated image 𝐼𝑖 will also have access to the Keys from all other images in the batch (𝐼𝑗,where 𝑗 ≠ 𝑖), restricted by their subject masks 𝑀𝑗. To enrich diversity we: (1) Weaken the SDSA via dropout and (2) Blend Query features with vanilla Query features from a non-consistent sampling step, yielding 𝑄∗.
Feature Injection (right): To further refine the subject’s identity across images, we introduce a mechanism for blending features within the batch. We extract a patch correspondence map between each pair of images, and then inject features between images based on that map.

Comparison To Current Methods

We evaluated our method against IP-Adapter, TI, and DB-LORA. Some methods failed to maintain consistency (TI), or follow the prompt (IP-Adapter). Other methods alternated between keeping consistency or following text, but not both (DB-LoRA). Our method successfully followed the prompt while maintaining consistency

Quantitative Evaluation

Automatic Evaluation (left): ConsiStory (green) achieves the optimal balance between Subject Consistency and Textual Similarity. Encoder-based methods such as ELITE and IP-Adapter often overfit to visual appearance, while optimization-based methods such as LoRA-DB and TI do not exhibit high subject consistency as in our method. 𝑑 denotes different self-attention dropout values. Error bars are S.E.M.
User Study (right): results indicate a notable preference among participants for our generated images both in regards to Subject Consistency (Visual) and Textual Similarity (Textual).

Multiple Consistent Subjects

ConsiStory can generate image sets with multiple consistent subjects.

ControlNet Integration

Our method can be integrated with ControlNet to generate a consistent character with pose control.

Training-Free Personalization

We utilize edit-friendly inversion to invert 2 real images per subject. These inverted images are used as anchors in our method for training-free personalization.

Seed Variation

Given different starting noise, ConsiStory generates different consistent set of images.

Ethnic Diversity

The underlying SDXL model may exhibit biases towards certain ethnic groups, and our approach inherits them. Our method can generate consistent subjects belonging to diverse groups when these are provided in the prompt.

BibTeX

@article{tewel2024training,
  title={Training-free consistent text-to-image generation},
  author={Tewel, Yoad and Kaduri, Omri and Gal, Rinon and Kasten, Yoni and Wolf, Lior and Chechik, Gal and Atzmon, Yuval},
  journal={ACM Transactions on Graphics (TOG)},
  volume={43},
  number={4},
  pages={1--18},
  year={2024},
  publisher={ACM New York, NY, USA}
}