Self-Refining Video Sampling

5 min read Original article ↗

1KAIST    2NYU    3NTU Singapore    4DeepAuto.ai

*Equal Contribution,      Equal Advising


[TL;DR] We present self-refining video sampling method that reuses a pre-trained video generator as a denoising autoencoder to iteratively refine latents. With ~50% additional NFEs, it improves physical realism (e.g., motion coherence and physics alignment) without any external verifier, training, or dataset.


Methods


Flow Matching as Denoising Autoencoder

We revisit the connection between diffusion models and denoising autoencoders (DAE) [1-2], and extend this connection to interpret flow matching as a DAE from a training objective perspective. In particular, up to a time-dependent weighting, the flow-matching loss is equivalent to a standard DAE [3] reconstruction objective.

for which the model learns to denoise the corrupted input $z_t$ back to the clean sample $z_1$.

Thus, flow matching trained model can be used as a time-conditioned DAE. At inference time, we repurpose the pre-trained model as DAE, self-refiner, iteratively refining samples toward the data manifold.

In practice, only 2–3 iterations are sufficient to improve temporal coherence and physical plausibility. The refined state $z_t^*:= z^{(K)}_{t}$ is then used as the updated state and passed to the next ODE step, enabling plug-and-play integration with existing solvers by simply replacing $z_t$ with $z_t^*$.


[1] Vincent Pascal, A Connection Between Score Matching and Denoising Autoencoders, Neural computation, 2011
[2] Song & Ermon, Generative Modeling by Estimating Gradients of the Data Distribution, NeurIPS, 2019
[3] Bengio et al., Generalized Denoising Auto-encoders as Generative Models, NeurIPS, 2013


Uncertainty-aware Predict-and-Perturb

However, we observe that multiple P&P updates ($K \geq 3$) with classifier-free guidance can lead to over-saturation or simplification in static regions such as the background. To address this, we propose an Uncertainty-aware P&P that selectively refines only the locally uncertain regions. Specifically, we estimate the model confidence of the prediction, $\mathbf{U}=\|\hat{z}_1^{(k)}-\hat{z}_1^{(k+1)}\|_1,$ which measures how sensitive the prediction is to the Perturb step.


Here, we apply the P&P update only to regions where the uncertainty $\mathbf{U}$ exceeds a predefined threshold $\tau$. Notably, it does not require any additional model evaluations (NFE) since both predictions are already computed during the P&P process. Please refer to the paper for more details.

Results


Motion Enhanced Video Generation with Wan2.2-A14B T2V

Wan2.2-A14B already generates human motion reasonably well, but our method can substantially further improve it.

Image-to-Video Robotics Video Generation with Cosmos-predict2.5-2B

Our method is also applicable to Image-to-Video generation. Applied to Cosmos-Predict2.5-2B, it reduces common robotics artifacts such as unstable grasps and implausible interactions, and produces more consistent motion than rejection sampling with Cosmos-Reason1-7B (best-of-4). This may be useful for downstream tasks such as vision-language-action (VLA) models, where even small artifacts can significantly affect perception and action.

Table Robot
PAI-Bench-G evaluation results on robotics I2V generaiton. Grasp and Robot-QA are measure by Gemini 3 Flash.

Additional Results

See more results in our paper!

Citation


@article{jang2026selfrefining,
    title={Self-Refining Video Sampling}, 
    author={Sangwon Jang and Taekyung Ki and Jaehyeong Jo and Saining Xie and Jaehong Yoon and Sung Ju Hwang},
    year={2026},
    journal={arXiv preprint arXiv:2601.18577},
}
            

Acknowledgement


This page is based on REPA.