Parts2Whole: Generalizable Multi-Part Portrait Customization

TIP 2025

Beihang University
Teaser Image

Parts2Whole generates realistic human images in various postures from referential human part images
of any quantity and different origins.

Abstract

Multi-part portrait customization aims to generate realistic human images by assembling specified body parts from multiple reference images, with significant applications in digital human creation. Existing customization methods typically follow two approaches: (1) test-time fine-tuning, which learn concepts effectively but is time-consuming and struggles with multi-part composition; (2) generalizable feed-forward methods, which offer efficiency but lack fine control over appearance specifics. To address these limitations, we present Parts2Whole, a diffusion-based generalizable portrait generator that harmoniously integrates multiple reference parts into high-fidelity human images by our proposed multi-reference mechanism. To adequately characterize each part, we propose a detail-aware appearance encoder, which is initialized and inherits powerful image priors from the pre-trained denoising U-Net, enabling the encoding of detailed information from reference images. The extracted features are incorporated into the denoising U-Net by a shared self-attention mechanism, enhanced by mask information for precise part selection. Additionally, we integrate pose map conditioning to control the target posture of generated portraits, facilitating more flexible customization. Extensive experiments demonstrate the superiority of our approach over existing methods and applicability to related tasks like pose transfer and pose-guided human image generation, showcasing its versatile conditioning.

Method

Method Overview

Overview of Parts2Whole. Based on the text-to-image diffusion model, our method designs an appearance encoder for encoding various parts of human appearance into multi-scale feature maps. We build this encoder by copying the network structure and pretrained weights from denoising U-Net. Features obtained from reference images with their textual labels are injected into the generation process by shared attention mechanism layer by layer. To precisely select the specified parts from reference images, we enhance the vanilla self-attention mechanism by incorporating subject masks in the reference images. Illustration of one block in U-Net is shown on the right part.

Customize Your Whole Body

Result 1

Parts2Whole supports generating human images conditioned on selected parts from different humans as control conditions. For example, the face from person A, the hair or headwear from person B, the upper clothes from person C, and the lower clothes from person D.

Specifiy Any Part

Result 2

Parts2Whole supports generating human images from varying numbers of condition images, such as single hair or face input, or arbitrary combinations like "Face + Hair", "Face + Clothes", and "Upper body clothes + Lower body clothes".

BibTeX

@article{fan2025parts2whole,
  title={Parts2Whole: Generalizable Multi-Part Portrait Customization},
  author={Fan, Hongxing and Huang, Zehuan and Wang, Lipeng and Chen, Haohua and Yin, Li and Sheng, Lu},
  journal={IEEE Transactions on Image Processing},
  year={2025},
  publisher={IEEE}
}