This paper aims to bring fine-grained expression control while maintaining high-fidelity identity in portrait generation. This is challenging due to the mutual interference between expression and identity. On one hand, fine expression control signals inevitably introduce appearance-related semantics (e.g., facial contours, and ratio), which impact the identity of the generated portrait. On the other hand, even coarse-grained expression control can cause facial changes that compromise identity, since they all act on the face. These limitations remain unaddressed by previous generation methods, which primarily rely on coarse control signals or two-stage inference that integrates portrait animation. Here, we introduce EmojiDiff, the first end-to-end solution that enables simultaneous control of extremely detailed expression (RGB-level) and high-fidelity identity in portrait generation. To address the above challenges, EmojiDiff adopts a two-stage scheme involving decoupled training and fine-tuning. For decoupled training, we innovate ID-irrelevant Data Iteration (IDI) to synthesize cross-identity expression pairs by separately manipulating the expression and identity, achieving a stable and high-quality data generation. Training the model with this data, we effectively disentangle fine expression features in the expression template from other extraneous information (\textit{e.g.}, identity, skin). Subsequently, we present ID-enhanced Contrast Alignment (ICA) for further fine-tuning. ICA achieves rapid reconstruction and joint supervision of identity and expression information, thus aligning identity representations of images with and without expression control. Experimental results demonstrate that our method remarkably outperforms counterparts, achieves precise expression control with highly maintained identity, and generalizes well to various diffusion models.
To integrate RGB-driven expression control into diffusion models, we aim to synthesize cross-identity data for the model's decoupled training, and mitigate the negative impact on the original identity through contrastive alignment fine-tuning. Before decoupled training, the fundamental expression controller (i.e., Base E-Adapter) is trained with same-identity triplet data. Next, the trained Base E-Adapter and FaceFusion are utilized to alter the identity of portraits while maintaining consistent expressions, thereby creating cross-identity expression pairs. Subsequently, the Refined E-Adapter uses newly synthesized data for disentangled training, facilitating dual control of identity and expression without ID leakage. Finally, the Refined E-Adapter is fine-tuned by expression and identity loss based on ANI.
@article{jiang2024emojidiff,
title={EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation},
author={Liangwei Jiang, Ruida Li, Zhifeng Zhang, Shuo Fang, Chenguang Ma},
journal={arXiv preprint arXiv:2412.01254},
year={2024},
}