MV-Adapter: Multi-view Consistent Image Generation Made Easy

Here we show that MV-Adapter generates viewpoints with elevation ranging from 0 to 30 degrees.

Abstract

Existing multi-view image generation methods often make invasive modifications to pre-trained text-to-image (T2I) models and require full fine-tuning, leading to (1) high computational costs, especially with large base models and high-resolution images, and (2) degradation in image quality due to optimization difficulties and scarce high-quality 3D data. In this paper, we propose the first adapter-based solution for multi-view image generation, and introduce MV-Adapter, a versatile plug-and-play adapter that enhances T2I models and their derivatives without altering the original network structure or feature space. By updating fewer parameters, MV-Adapter enables efficient training and preserves the prior knowledge embedded in pre-trained models, mitigating overfitting risks. To efficiently model the 3D geometric knowledge within the adapter, we introduce innovative designs that include duplicated self-attention layers and parallel attention architecture, enabling the adapter to inherit the powerful priors of the pre-trained models to model the novel 3D knowledge. Moreover, we present a unified condition encoder that seamlessly integrates camera parameters and geometric information, facilitating applications such as text- and image-based 3D generation and texturing. MV-Adapter achieves multi-view generation at 768 resolution on Stable Diffusion XL (SDXL), and demonstrates adaptability and versatility. It can also be extended to arbitrary view generation, enabling broader applications. We demonstrate that MV-Adapter sets a new quality standard for multi-view image generation, and opens up new possibilities due to its efficiency, adaptability and versatility.

Method

MV-Adapter is a plug-and-play adapter that learns multi-view priors transferable to derivatives of T2I models without specific tuning, and enable T2Is to generate multi-view consistent images under various conditions. At inference time, our MV-Adapter, which contains a condition guider (yellow) and the decoupled attention layers (blue), can be directly inserted into a personalized or distilled T2I to constitute the multi-view generator.

Our MV-Adapter consists of two components: (1) a condition guider that encodes camera condition or geometry condition; (2) decoupled attention layers that contain multi-view attention layers for learning multi-view consistency, and optional image cross-attention layers to support image-conditioned generation, where we use the pre-trained U-Net to encode the reference image to extract fine-grained information.

Text-to-Multiview

Image-to-Multiview

Sketch-to-Multiview (with ControlNet)

Text-condition 3D Generation

Image-condition 3D Generation

BibTeX

@article{huang2024mvadapter,
  title={MV-Adapter: Multi-view Consistent Image Generation Made Easy},
  author={Huang, Zehuan and Guo, Yuanchen and Wang, Haoran and Yi, Ran and Ma, Lizhuang and Cao, Yan-Pei and Sheng, Lu},
  journal={arXiv preprint arXiv:2412.03632},
  year={2024}
}