EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation

Liangwei Jiang Ruida Li Zhifeng Zhang Shuo Fang Chenguang Ma

Terminal Technology Department, Alipay, Ant Group.

EmojiDiff is an end-to-end solution that integrates fine-grained expression control, high-fidelity ID preservation, and strong adaptability to various diffusion models.

Abstract

This paper aims to bring fine-grained expression control to identity-preserving portrait generation. Existing methods tend to synthesize portraits with either neutral or stereotypical expressions. Even when supplemented with control signals like facial landmarks, these models struggle to generate accurate and vivid expressions following user instructions. To solve this, we introduce EmojiDiff, an end-to-end solution to facilitate simultaneous dual control of fine expression and identity. Unlike the conventional methods using coarse control signals, our method directly accepts RGB expression images as input templates to provide extremely accurate and fine-grained expression control in the diffusion process. As its core, an innovative decoupled scheme is proposed to disentangle expression features in the expression template from other extraneous information, such as identity, skin, and style. On one hand, we introduce ID-irrelevant Data Iteration (IDI) to synthesize extremely high-quality cross-identity expression pairs for decoupled training, which is the crucial foundation to filter out identity information hidden in the expressions. On the other hand, we meticulously investigate network layer function and select expression-sensitive layers to inject reference expression features, effectively preventing style leakage from expression signals. To further improve identity fidelity, we propose a novel fine-tuning strategy named ID-enhanced Contrast Alignment (ICA), which eliminates the negative impact of expression control on original identity preservation. Experimental results demonstrate that our method remarkably outperforms counterparts, achieves precise expression control with highly maintained identity, and generalizes well to various diffusion models.

Method

To integrate expression control into diffusion models, EmojiDiff aims to train the model using cross-identity triplet data and mitigate the negative impact on the original structure through contrastive alignment. The method involves four stages. First, the fundamental expression controller (i.e., Base E-Adapter) is trained with same-identity triplet data. Next, the trained Base E-Adapter and FaceFusion are utilized to alter the identity of portraits while maintaining consistent expressions, thereby creating cross-identity expression pairs. Subsequently, the Refined E-Adapter is trained using newly synthesized data, facilitating dual control of identity and expression without ID leakage. Finally, the Refined E-Adapter is fine-tuned by introducing expression and identity loss, further minimizing its negative impact on identity.



Comparison Results

Gallery

SD1.5-based Results

SDXL-based Results

Dataset

BibTex

@article{jiang2024emojidiff,
title={EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation},
author={Liangwei Jiang, Ruida Li, Zhifeng Zhang, Shuo Fang, Chenguang Ma},
journal={arXiv preprint arXiv:2412.01254},
year={2024},
}