1NVIDIA, 2Tel Aviv University, 3Independent
Accepted to SIGGRAPH 2024
TLDR: We enable Stable Diffusion XL (SDXL) to generate consistent subjects across a series of images, without additional training.
We present ConsiStory, a training-free approach that enables consistent subject generation in pretrained text-to-image models. It does not require finetuning or personalization, and as a result it takes ~10 seconds per generated image on an H100 (x20 faster than previous state-of-the-art methods). We enhance the model by introducing a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. ConsiStory can naturally extend to multi-subject scenarios and even enable training-free personalization for common objects.
Consistent Set Generations
Abstract
Text-to-image models offer a new level of creative flexibility by allowing users to guide the image generation process through natural language. However, using these models to consistently portray the same subject across diverse prompts remains challenging. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects or add image conditioning to the model. These methods require lengthy per-subject optimization or large-scale pre-training. Moreover, they struggle to align generated images with text prompts and face difficulties in portraying multiple subjects. Here, we present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model. We introduce a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. We compare ConsiStory to a range of baselines, and demonstrate state-of-the-art performance on subject consistency and text alignment, without requiring a single optimization step. Finally, ConsiStory can naturally extend to multi-subject scenarios, and even enable training-free personalization for common objects.
How does it work?
Architecture outline (left): Given a set of prompts, at every generation step we localize the subject in each generated image πΌπ.
We utilize the cross-attention maps up to the current generation step, to create subject masks ππ.
Then, we replace the standard self-attention layers in the U-net decoder with Subject Driven Self-Attention layers that share information between subject instances.
We also add Feature Injection for additional refinement.
Subject Driven Self-Attention: We extend the self-attention layer so the Query from generated image πΌπ will also have access to the Keys from all other images in the batch
(πΌπ,where π β π), restricted by their subject masks ππ.
To enrich diversity we: (1) Weaken the SDSA via dropout and (2) Blend Query features with vanilla Query features from a non-consistent sampling step, yielding πβ.
Feature Injection (right): To further refine the subjectβs identity across images, we introduce a mechanism for blending features within the batch.
We extract a patch correspondence map between each pair of images, and then inject features between images based on that map.
Comparison To Current Methods
We evaluated our method against IP-Adapter, TI, and DB-LORA. Some methods failed to maintain consistency (TI), or follow the
prompt (IP-Adapter). Other methods alternated between keeping consistency or following text, but not both (DB-LoRA). Our method successfully followed the
prompt while maintaining consistency
Quantitative Evaluation
Automatic Evaluation (left): ConsiStory (green)
achieves the optimal balance between Subject Consistency and Textual
Similarity. Encoder-based methods such as ELITE and IP-Adapter often
overfit to visual appearance, while optimization-based methods such as
LoRA-DB and TI do not exhibit high subject consistency as in our method.
π denotes different self-attention dropout values. Error bars are S.E.M.
User Study (right): results indicate a notable preference among participants
for our generated images both in regards to Subject Consistency (Visual)
and Textual Similarity (Textual).
Multiple Consistent Subjects
ConsiStory can generate image sets with multiple consistent subjects.
ControlNet Integration
Our method can be integrated with ControlNet to generate a consistent character with pose control.
Training-Free Personalization
We utilize edit-friendly inversion
to invert 2 real images per subject. These inverted images are used as anchors
in our method for training-free personalization.

Seed Variation
Given different starting noise, ConsiStory generates different consistent set of images.
Ethnic Diversity
The underlying SDXL model may exhibit biases towards certain ethnic groups, and our approach inherits them. Our method can generate consistent subjects belonging to diverse groups when these are provided in the prompt.
BibTeX
@article{tewel2024training,
title={Training-free consistent text-to-image generation},
author={Tewel, Yoad and Kaduri, Omri and Gal, Rinon and Kasten, Yoni and Wolf, Lior and Chechik, Gal and Atzmon, Yuval},
journal={ACM Transactions on Graphics (TOG)},
volume={43},
number={4},
pages={1--18},
year={2024},
publisher={ACM New York, NY, USA}
}