Academic Project Page

Abstract

Audio-driven talking head generation holds significant potential for film production. While existing 3D methods have advanced motion modeling and content synthesis, they often produce rendering artifacts, such as motion blur, temporal jitter, and local penetration, due to limitations in representing stable, fine-grained motion fields. Through systematic analysis, we reformulate talking head generation into a unified framework comprising three steps: video preprocessing, motion representation, and rendering reconstruction. This framework underpins our proposed M2DAO-Talker, which addresses current limitations via multi-granular motion decoupling and alternating optimization. Specifically, we devise a novel 2D portrait preprocessing pipeline to extract frame-wise deformation control conditions (motion region segmentation masks, and camera parameters) to facilitate motion representation. To ameliorate motion modeling, we elaborate a multi-granular motion decoupling strategy, which independently models non-rigid (oral and facial) and rigid (head) motions for improved reconstruction accuracy. Meanwhile, a motion consistency constraint is developed to ensure head-torso kinematic consistency, thereby mitigating penetration artifacts caused by motion aliasing. In addition, an alternating optimization strategy is designed to iteratively refine facial and oral motion parameters, enabling more realistic video generation. Experiments across multiple datasets show that M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness versus TalkingGaussian while with 150 FPS inference speed.

Overall Pipeline

The M2DAO-Talker pipeline comprises three stages. i) Video Preprocessing: it extracts key features from input videos, involving portrait image \(I\), mouth movement feature \(a\), semantic mask \(M\), background image \(I_{\text{bg}}\), facial expression feature \(e\), and camera projection parameters. ii) Motion Representation: head motion is categorized into three components: head rotation, facial expressions, and oral movements. Head rotation is parameterized using a scaling matrix \(T\), rotation matrix \(R\), and focal length \(F\), while facial expressions and oral movements are modeled by two separate motion branches bashed on 3D Gaussian primitives. Region-specific supervision is applied to each branch using \(I\odot M\) as a localized training signal, and facial deformations are regularized to maintain coherence with torso motion. iii) Rendering Reconstruction: alternating optimization cyclically updates parameters between the Face Branch and the Inside Mouth Branch, using full portrait \(I\) as ground truth.

Self-Construction Comparison

Comparison with NeRF

From left to right, the results are Ground Truth, SyncTalk(CVPR 2024) and Ours.

Comparison with 3DGS

From left to right, the results are Ground Truth, GaussianTalker(ACM MM 2024), TalkingGaussian(ECCV 2025) and Ours.

Lip-Synchronization Comparison

Comparison with 3D

From left to right, the results are SyncTalk(CVPR 2024), GaussianTalker(ACM MM 2024), TalkingGaussian(ECCV 2025) and Ours.

Comparison with 2D

From left to right, the results are DINet(AAAI 2023), IP_LAP(CVPR 2023), TalkLip(CVPR 2023) and Ours.

More Live Show Results

From left to right, the results are Ground Truth, SyncTalk(CVPR 2024), TalkingGaussian(ECCV 2025) and Ours.

M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation