Wenkun He1, 2 , Yun Liu1, 2, 3 , Ruitao Liu1 , Li Yi†, 1, 2, 3
1 Tsinghua University, 2 Shanghai Qi Zhi Institute, 3 Shanghai Artificial Intelligence Laboratory
Wenkun He1, 2 , Yun Liu1, 2, 3 , Ruitao Liu1 , Li Yi†, 1, 2, 3
1 Tsinghua University, 2 Shanghai Qi Zhi Institute, 3 Shanghai Artificial Intelligence Laboratory
Synthesizing realistic human-object interaction motions is a critical problem in VR/AR and human animation. Unlike the commonly studied scenarios involving a single human or hand interacting with one object, we address a more generic multi-body setting with arbitrary numbers of humans, hands, and objects. This complexity introduces significant challenges in synchronizing motions due to the high correlations and mutual influences among bodies. To address these challenges, we introduce SyncDiff, a novel method for multi-body interaction synthesis using a synchronized motion diffusion strategy. SyncDiff employs a single diffusion model to capture the joint distribution of multi-body motions. To enhance motion fidelity, we propose a frequency-domain motion decomposition scheme. Additionally, we introduce a new set of alignment scores to emphasize the synchronization of different body motions. SyncDiff jointly optimizes both data sample likelihood and alignment likelihood through an explicit synchronization strategy. Extensive experiments across four datasets with various multi-body configurations demonstrate the superiority of SyncDiff over existing state-of-the-art motion synthesis methods.
We will present qualitative results in the order of TACO, CORE4D, OAKINK2, GRAB, and Failure Cases. The statistics are as follows:
TACO: 30 samples with comparison to baselines or ablation study results, 24 single samples of our method (Gallery). 54 samples in total.
CORE4D: 28 samples with comparison to baselines or ablation study results, 24 single samples of our method (Gallery). 52 samples in total.
OAKINK2: 18 samples with comparison to baselines or ablation study results, 24 single samples of our method (Gallery). 42 samples in total.
GRAB: 4 samples with comparison to baselines (MACS, DiffH2O), 8 samples with two different synthesized interactions (Gallery). 12 samples in total.
Failure Cases: 6 samples, 3 samples from CORE4D and 3 samples from TACO, as is stated in Section F in the supplementary document.
w/o exp sync or w/o align loss lead to contact loss, unsynchronization, or abnormal shakings.
Among the three methods, OMOMO, with its stagewise diffusion, relatively ensures the hand-object alignment, but still falls short compared to our synchronization strategies. These three methods often fail to complete tasks successfully because they lack the joint distribution and optimization using a single diffusion model. This might cause object trajectories to be unadvantageous, further leading to ineffective collaboration for two humans.
w/o decompose induces unnatural walking poses(e.g., sliding on ground), and sometimes unnatural joint rotations.
Synchronization mechanisms still play a significant role in human body-object interaction synthesis.
OAKINK2 poses higher demand on fine control of motions, which require the combination of our two synchronization mechanisms.
Postgrasp setting. GRAB has slightly lower requirements for synchronization. DiffH2O performs better than MACS, but it is outperformed by our method, particularly in scenarios involving small objects or tricky grasping areas.
Here we need to synthesize complete motion sequences rather than just post-grasp ones. We sample two different trajectories using different initialization, to demonstrate the diversity of our method.
Samples in CORE4D suffer from irrelevant bodies interference, such as the interruption from a person that is about to join a task or has already completed the task and is about to leave. Samples from TACO suffer from unstable rotations.
Our code repository is at https://github.com/WenkunHe/SyncDiff. Code will be released soon.
Wenkun He: wenkunhe2003@hotmail.com
Li Yi: ericyi0124@gmail.com