SyncDiff: Synchronized Motion Diffusion for Multi-Body Human-Object Interaction Synthesis

Wenkun He, Yun Liu, Ruitao Liu, Li Yi

Tsinghua University

SyncDiff Demo

SyncDiff is a unified framework synthesizing synchronized multi-body interaction motions with any number of hands, humans, and rigid objects. In SyncDiff, we introduce two novel multi-body motion synchronization mechanisms, namely the alignment scores for training and explicit synchronization strategy in inference. With these mechanisms, the synthesized results can effectively prevent interpenetration, contact loss, or asynchronous human-object interactions in various scenarios, as shown in the above figure.

Abstract

Synthesizing realistic human-object interaction motions is a critical problem in VR/AR and human animation. Unlike the commonly studied scenarios involving a single human or hand interacting with one object, we address a more generic multi-body setting with arbitrary numbers of humans, hands, and objects. This complexity introduces two significant challenges due to the high correlations and mutual influences among bodies, to which we propose corresponding solutions. First, to satisfy the high demands for synchronization of different body motions, we mathematically derive a new set of alignment scores during the training process, and use maximum likelihood sampling on a dynamic graphical model for explicit synchronization during inference. Second, the high-frequency interactions between objects are often overshadowed by the large-scale low-frequency movements. To address this, we introduce frequency decomposition and explicitly represent high-frequency components in the frequency domain. Extensive experiments across five datasets with various multi-body configurations demonstrate the superiority of SyncDiff over existing state-of-the-art motion synthesis methods.

Visual Results

We will present qualitative results in the order of TACO, CORE4D, OAKINK2, GRAB, and BEHAVE. The statistics are as follows:

TACO: 30 samples with comparison to baselines (MACS, DiffH2O) or ablation study results, 24 single samples of our method (Gallery). 54 samples in total.

CORE4D: 28 samples with comparison to baselines (OMOMO, CG-HOI) or ablation study results, 24 single samples of our method (Gallery). 52 samples in total.

OAKINK2: 18 samples with comparison to baselines (MACS, DiffH2O) or ablation study results, 24 single samples of our method (Gallery). 42 samples in total.

GRAB: 8 samples with comparison to baselines (MACS, DiffH2O), 4 samples with two different synthesized interactions (Gallery). 12 samples in total.

BEHAVE: 12 samples with comparison to baselines (OMOMO, CG-HOI).

TACO

Comparison to MACS & DiffH2O (12 Samples)

Comparison to w/o decompose (6 Samples)

w/o decompose causes oversmooth trajectories. Two objects often remain relatively stable. This is because those high-frequency compoenents with semantics need to be separately represented, in order to avoid being overshadowed by large-scale movements.

Comparison to w/o exp sync & w/o align loss (12 Samples)

w/o exp sync or w/o align loss lead to contact loss, unsynchronization, or abnormal shakings.

Gallery (24 Samples)

CORE4D

Comparison to OMOMO & CG-HOI (12 Samples)

Among the two baseline methods, OMOMO with its stagewise diffusion, and CG-HOI with cross-attention between bodies and contact maps, relatively ensures the hand-object alignment. But they still fall short compared to our synchronization strategies, especially when the timing of cooperation between two individuals needs to be perfectly orchestrated. This is because the two baseline methods lack the joint optimization within a single diffusion model in both training and inference time. This might cause object trajectories to be unadvantageous, further leading to ineffective collaboration for two humans.

Comparison to w/o decompose (6 Samples)

w/o decompose induces unnatural walking poses (e.g., sliding on ground), and sometimes unnatural joint rotations.

Comparison to w/o exp sync & w/o align loss (4 Samples)

Synchronization mechanisms still play a significant role in human body-object interaction synthesis.

Comparison to w/o exp sync (6 Samples)

Gallery (24 Samples)

OAKINK2

Comparison to MACS & DiffH2O (8 Samples)

Comparison to w/o exp sync & w/o align loss (4 Samples)

OAKINK2 poses higher demand on fine-grained control of motions, which require the combination of our two synchronization mechanisms.

Comparison to w/o exp sync (6 Samples)

Gallery (24 Samples)

GRAB

Comparison to MACS & DiffH2O (8 Samples)

Postgrasp setting. GRAB has slightly lower requirements for synchronization. DiffH2O performs better than MACS, but it is outperformed by our method, particularly in scenarios involving small objects or tricky grasping areas.

Multiple Synthesized Results with One Condition (4 Samples)

Here SyncDiff needs to synthesize complete motion sequences rather than just post-grasp ones. We sample two different trajectories using different initialization, to demonstrate the diversity of our method.

BEHAVE

Comparison to OMOMO & CG-HOI (12 Samples)

In general, due to the relatively simple setting of one-human-one-object, and the limited motion semantics (where the samples mostly consist of basic actions such as picking up, putting down, lateral/rotational movement), the disparity between our method and the baselines is not as pronounced as it is in the other four datasets. In the samples above, the advantages of our method are mostly manifested in more extensive motion completion, fewer interpenetration and contact loss, and more harmonious human postures.

Code

The code will be made public on appropriate time (Expected 2025.7).

Contact us

Wenkun He: wenkunhe2003@hotmail.com

Li Yi: ericyi0124@gmail.com