DreamLite - NFHN Reader

A Lightweight On-Device Unified Model for Image Generation and Editing

On-Device Demo

Real-time generation & editing on iPhone 17 Pro — no cloud, fully on-device.

Human Portrait & Style Transfer

Nature Landscape & Background Change

Samples

Click on any image to view it in full resolution along with the prompt.

Generated Samples

Text-to-Image generation results

Edited Samples

Text-guided image editing results

About DreamLite

In this paper, we propose DreamLite, a compact unified on-device diffusion model (0.39B) that supports both T2I generation and text-guided image editing within a single network. DreamLite is built on a pruned mobile U-Net backbone and unifies conditioning through In-Context spatial concatenation in the latent space. To stabilize the training of this compact model, we introduce a Task-Progressive Joint pretraining strategy that sequentially targets T2I, editing, and joint tasks. After SFT and RL, DreamLite outperforms existing on-device models and remaining competitive with several server-side models in both generation and editing tasks. By employing step distillation, we further achieve 4-step inferencing, enabling our DreamLite could generate or edit a 1024 × 1024 image in ~3s (using 4-bit Qwen VL and fp16 VAE+UNet) on a iPhone17 pro.

Our contributions are summarized as follows:

We propose, to the best of our knowledge, the first unified on-device model that supports both text-to-image generation and text-based image editing, eliminating the need to deploy two separate models.
We introduce an in-context conditioning mechanism for UNet to unify generation and editing, and propose a task-progressive joint pretraining scheme (i.e., T2I → Edit → Unified Joint Training) to stably train the model.
DreamLite achieves competitive performance on standard benchmarks and consistently outperforms prior mobile models. After deployment on mobile device, DreamLite could generate or edit a 1024 × 1024 image in less than 5s.

Model Architecture

Overview of the proposed framework and its key components.

Visual Results

Training Pipeline — Figure 2. Generation and Editing Results on Mobile Device.

Main Results

Table 1. Comparison with existing methods on GenEval, DPG, ImgEdit and GEdit-EN Benchmarks.

Method	Params	GenEval ↑	DPG ↑	ImgEdit ↑	GEdit-EN-Q ↑
FLUX.1-Dev / Kontext	12B	0.67	84.0	3.76	6.79
BAGEL	7B	0.82	85.1	3.42	7.20
OmniGen2	4B	0.80	83.6	3.44	6.79
LongCat-Image / Edit	6B	0.87	86.6	4.49	7.55
DeepGen1.0	2B	0.83	84.6	4.03	7.54
SANA-1.6B	1.6B	0.67	84.8	-	-
MEISSONIC	1B	0.54	65.3	-	-
VIBE	1.6B	-	-	3.85	7.28
SANA-0.6B	0.6B	0.64	83.6	-	-
SnapGen++ (small)	0.4B	0.66	85.2	-	-
EditMGT	0.96B	-	-	2.89	6.33
DreamLite (Ours)	0.39B	0.72	85.8	4.11	6.88

Table 2. Ablation study on GenEval and ImgEdit benchmarks. "TPJ" denotes "Task-progressive Joint".

Experiments	Mechanism	Training Stage	GenEval ↑	ImgEdit ↑
Text-to-image Pretraining			0.70	-
Condition Mechanism	Pix2Pix	T2I → Edit	0.56	3.67
Condition Mechanism	Pix2Pix	T2I → Edit → Unified	0.61	3.65
Training Strategy	In-context	T2I → T2I	0.65	-
	In-context	T2I → Edit	0.64	3.88
	In-context	T2I → Unified	0.65	3.14
	In-context	T2I → Edit → Unified	0.71	3.94
Reinforcement Learning	In-context	TPJ Pretrain → RLHF	0.72	4.11
Step Distillation	In-context	TPJ Pretrain → RLHF → DMD	0.70	3.8

Roadmap & Contact

Our release plan and how to reach us.

BibTeX

@article{feng2026dreamlite,
  title={DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing},
  author={Kailai Feng and Yuxiang Wei and Bo Chen and Yang Pan and Hu Ye and Songwei Liu and Chenqian Yan and Yuan Gao},
  journal={arXiv preprint arXiv:2603.28713},
  year={2026}
}