USD

Optimized View and Geometry Distillation from Multi-view Diffuser

IJCAI 2025

Youjia Zhang¹, Zikai Song¹, Junqing Yu¹, Yawei Luo², Wei Yang¹

¹ Huazhong University of Science and Technology ² Zhejiang University

Abstract

Our technique produces multi-view images and geometries that are comparable, sometimes superior particularly for irregular camera poses, when benchmarked against concurrent methodologies such as SyncDreamer and Wonder3D, without training on large-scale data

Generating multi-view images from a single input view using image-conditioned diffusion models is a recent advancement and has shown considerable potential. However, issues such as the lack of consistency in synthesized views and over-smoothing in extracted geometry persist. Previous methods integrate multi-view consistency modules or impose additional supervisory to enhance view consistency while compromising on the flexibility of camera positioning and limiting the versatility of view synthesis. In this study, we consider the radiance field optimized during geometry extraction as a more rigid consistency prior, compared to volume and ray aggregation used in previous works. We further identify and rectify a critical bias in the traditional radiance field optimization process through score distillation from a multi-view diffuser. We introduce an Unbiased Score Distillation (USD) that utilizes unconditioned noises from a 2D diffusion model, greatly refining the radiance field fidelity. We leverage the rendered views from the optimized radiance field as the basis and develop a two-step specialization process of a 2D diffusion model, which is adept at conducting object-specific denoising and generating high-quality multi-view images. Finally, we recover faithful geometry and texture directly from the refined multi-view images. Empirical evaluations demonstrate that our optimized geometry and view distillation technique generates comparable results to the state-of-the-art models trained on extensive datasets, all while maintaining freedom in camera positioning.

Overall pipeline of our approach

Denoising with unconditional noise

The unconditional noise predicted by Zero-1-to-3 model tends to be biased. The right subfigure shows the averaged difference between the predicted noise and the added noise. We take the unconditional noise predicted by Zero-1-to-3 to remove noise from the noisy input and recover the original image. We can see that even though a very low level of noise has been added, the denoised result deviates from the original image largely. In contrast, if we use the unconditional noise predicted by Stable Diffusion for denoising, only subtle details change while the main structure and identity of ‘Mario’ are preserved.

Optimized View and Geometry Distillation from Multi-view Diffuser

IJCAI 2025

Youjia Zhang¹, Zikai Song¹, Junqing Yu¹, Yawei Luo², Wei Yang¹

Abstract

Our technique produces multi-view images and geometries that are comparable, sometimes superior particularly for irregular camera poses, when benchmarked against concurrent methodologies such as SyncDreamer and Wonder3D, without training on large-scale data

Overall pipeline of our approach

Denoising with unconditional noise

Unbiased Sampling of Multi-view Diffuser

More results

Citation

Optimized View and Geometry Distillation from Multi-view Diffuser

IJCAI 2025

Youjia Zhang1, Zikai Song1, Junqing Yu1, Yawei Luo2, Wei Yang1

Abstract

Our technique produces multi-view images and geometries that are comparable, sometimes superior particularly for irregular camera poses, when benchmarked against concurrent methodologies such as SyncDreamer and Wonder3D, without training on large-scale data

Overall pipeline of our approach

Denoising with unconditional noise

Unbiased Sampling of Multi-view Diffuser

More results

Citation

Youjia Zhang¹, Zikai Song¹, Junqing Yu¹, Yawei Luo², Wei Yang¹