
A Lightweight On-Device Unified Model for Image Generation and Editing
On-Device Demo
Real-time generation & editing on iPhone 17 Pro — no cloud, fully on-device.
Human Portrait & Style Transfer
Nature Landscape & Background Change
Samples
Click on any image to view it in full resolution along with the prompt.
Generated Samples
Text-to-Image generation results
Edited Samples
Text-guided image editing results
Prompt
About DreamLite
In this paper, we propose DreamLite, a compact unified on-device diffusion model (0.39B) that supports both T2I generation and text-guided image editing within a single network. DreamLite is built on a pruned mobile U-Net backbone and unifies conditioning through In-Context spatial concatenation in the latent space. To stabilize the training of this compact model, we introduce a Task-Progressive Joint pretraining strategy that sequentially targets T2I, editing, and joint tasks. After SFT and RL, DreamLite outperforms existing on-device models and remaining competitive with several server-side models in both generation and editing tasks. By employing step distillation, we further achieve 4-step inferencing, enabling our DreamLite could generate or edit a 1024 × 1024 image in ~3s (using 4-bit Qwen VL and fp16 VAE+UNet) on a iPhone17 pro.
Our contributions are summarized as follows:
- We propose, to the best of our knowledge, the first unified on-device model that supports both text-to-image generation and text-based image editing, eliminating the need to deploy two separate models.
- We introduce an in-context conditioning mechanism for UNet to unify generation and editing, and propose a task-progressive joint pretraining scheme (i.e., T2I → Edit → Unified Joint Training) to stably train the model.
- DreamLite achieves competitive performance on standard benchmarks and consistently outperforms prior mobile models. After deployment on mobile device, DreamLite could generate or edit a 1024 × 1024 image in less than 5s.
Model Architecture
Overview of the proposed framework and its key components.
Visual Results
Main Results
Table 1. Comparison with existing methods on GenEval, DPG, ImgEdit and GEdit-EN Benchmarks.
| Method | Params | GenEval ↑ | DPG ↑ | ImgEdit ↑ | GEdit-EN-Q ↑ |
|---|---|---|---|---|---|
| FLUX.1-Dev / Kontext | 12B | 0.67 | 84.0 | 3.76 | 6.79 |
| BAGEL | 7B | 0.82 | 85.1 | 3.42 | 7.20 |
| OmniGen2 | 4B | 0.80 | 83.6 | 3.44 | 6.79 |
| LongCat-Image / Edit | 6B | 0.87 | 86.6 | 4.49 | 7.55 |
| DeepGen1.0 | 2B | 0.83 | 84.6 | 4.03 | 7.54 |
| SANA-1.6B | 1.6B | 0.67 | 84.8 | - | - |
| MEISSONIC | 1B | 0.54 | 65.3 | - | - |
| VIBE | 1.6B | - | - | 3.85 | 7.28 |
| SANA-0.6B | 0.6B | 0.64 | 83.6 | - | - |
| SnapGen++ (small) | 0.4B | 0.66 | 85.2 | - | - |
| EditMGT | 0.96B | - | - | 2.89 | 6.33 |
| DreamLite (Ours) | 0.39B | 0.72 | 85.8 | 4.11 | 6.88 |
Table 2. Ablation study on GenEval and ImgEdit benchmarks. "TPJ" denotes "Task-progressive Joint".
| Experiments | Mechanism | Training Stage | GenEval ↑ | ImgEdit ↑ |
|---|---|---|---|---|
| Text-to-image Pretraining | 0.70 | - | ||
| Condition Mechanism | Pix2Pix | T2I → Edit | 0.56 | 3.67 |
| Pix2Pix | T2I → Edit → Unified | 0.61 | 3.65 | |
| Training Strategy | In-context | T2I → T2I | 0.65 | - |
| In-context | T2I → Edit | 0.64 | 3.88 | |
| In-context | T2I → Unified | 0.65 | 3.14 | |
| In-context | T2I → Edit → Unified | 0.71 | 3.94 | |
| Reinforcement Learning | In-context | TPJ Pretrain → RLHF | 0.72 | 4.11 |
| Step Distillation | In-context | TPJ Pretrain → RLHF → DMD | 0.70 | 3.8 |
Roadmap & Contact
Our release plan and how to reach us.
BibTeX
@article{feng2026dreamlite,
title={DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing},
author={Kailai Feng and Yuxiang Wei and Bo Chen and Yang Pan and Hu Ye and Songwei Liu and Chenqian Yan and Yuan Gao},
journal={arXiv preprint arXiv:2603.28713},
year={2026}
}