Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers
Stability AI1, LMU Munich2, Birchlabs3, Independent Researchers4
ICML 2024
*Indicates Equal Contribution
Samples generated directly in RGB pixel space using our HDiT models trained on FFHQ-10242 and ImageNet-2562.
Abstract
We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution (e.g. 10242) directly in pixel-space. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. HDiT trains successfully without typical high-resolution training techniques such as multiscale architectures, latent autoencoders or self-conditioning. We demonstrate that HDiT performs competitively with existing models on ImageNet-2562, and sets a new state-of-the-art for diffusion models on FFHQ-10242.
Efficiency

Scaling of computational cost w.r.t. target resolution of our HDiT-B/4 model vs. DiT-B/4 (Peebles & Xie, 2023), both in pixel space. At megapixel resolutions, our model incurs less than 1% of the computational cost compared to the standard diffusion transformer DiT at a comparable size.
High-level Architecture Overview

High-level overview of our HDiT architecture, specifically the version for ImageNet at input resolutions of 2562 at patch size p = 4, which has three levels. For any doubling in target resolution, another neighborhood attention block is added. "lerp" denotes a linear interpolation with learnable interpolation weight. All HDiT blocks have the noise level and the conditioning (embedded jointly using a mapping network) as additional inputs.
Works Building upon HDiT
There have been a bunch of cool uses of HDiT and improvements upon it, including:
- DiffLocks: Generating 3D Hair from a Single Image using Diffusion Models (Rosu et al., CVPR 2025)
- Effective Cloud Removal for Remote Sensing Images by an Improved Mean-Reverting Denoising Model with Elucidated Design Space (Liu et al., CVPR 2025)
- Fast LiDAR Data Generation with Rectified Flows (Nakashima et al., ICRA 2025)
- CryoFM: A Flow-based Foundation Model for Cryo-EM Densities (Zhou et al., ICLR 2025)
- Posterior-Mean Rectified Flow: Towards Minimum MSE Photo-Realistic Image Restoration (Ohayon et al., ICLR 2025)
- ARFree: Autoregression-free video prediction using diffusion model for mitigating error propagation (Ko et al., ICIP 2025)
- Latent Posterior-Mean Rectified Flow for Higher-Fidelity Perceptual Face Restoration (Luo et al., 2025)
- VideoPDE: Unified Generative PDE Solving via Video Inpainting Diffusion Models (Li et al., 2025)
- InvFussion: Bridging Supervised and Zero-shot Diffusion for Inverse Problems (Elata et al., 2025)
- LeDiFlow: Learned Distribution-guided Flow Matching to Accelerate Image Generation (Zwick et al., 2025)
- VariFace: Fair and Diverse Synthetic Dataset Generation for Face Recognition (Yeung et al., 2025)
- Phy-Diff: Physics-guided Hourglass Diffusion Model for Diffusion MRI Synthesis (Zhang et al., MICCAI 2024)
- Second Edition FRCSyn Challenge at CVPR 2024: Face Recognition Challenge in the Era of Synthetic Data (DeAndres-Tame et al., 2024)
- Pictures Of MIDI: Controlled Music Generation via Graphical Prompts for Image-Based Diffusion Inpainting (Hawley, 2024)
- joliGEN: Generative AI Image & Video Toolset with GANs, Diffusion and Consistency Models for Real-World Applications
If you use or build upon HDiT in your work and would like to be listed here, please let us know.
Supplementary Material
We provide the 50k generated samples used for FID computation for our 557M ImageNet model without CFG (part 1, 2, 3, 4, 5, 6, 7, 8), with CFG = 1.3 (part 1, 2, 3, 4, 5, 6, 7, 8), and for our FFHQ-10242 model without CFG (part 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21).
BibTeX
@InProceedings{crowson2024hourglass,
title = {Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers},
author = {Crowson, Katherine and Baumann, Stefan Andreas and Birch, Alex and Abraham, Tanishq Mathew and Kaplan, Daniel Z and Shippole, Enrico},
booktitle = {Proceedings of the 41st International Conference on Machine Learning},
pages = {9550--9575},
year = {2024},
editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
volume = {235},
series = {Proceedings of Machine Learning Research},
month = {21--27 Jul},
publisher = {PMLR},
pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/crowson24a/crowson24a.pdf},
url = {https://proceedings.mlr.press/v235/crowson24a.html},
abstract = {We present the Hourglass Diffusion Transformer (HDiT), an image-generative model that exhibits linear scaling with pixel count, supporting training at high resolution (e.g. $1024 \times 1024$) directly in pixel-space. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. HDiT trains successfully without typical high-resolution training techniques such as multiscale architectures, latent autoencoders or self-conditioning. We demonstrate that HDiT performs competitively with existing models on ImageNet $256^2$, and sets a new state-of-the-art for diffusion models on FFHQ-$1024^2$. Code is available at https://github.com/crowsonkb/k-diffusion.}
}