Zeyu Ling1,2Bo Han1111The source code can be found at https://github.com/ZeyuLing/MotionLLaMA.git.Shiyang Li1Hongdeng Shen3,2Jikang Cheng3,2Changqing Zou1,2† 1Zhejiang University
2Zhejiang Lab
3University of the Chinese Academy
Abstract
This paper introduces MotionLLaMA, a unified framework for motion synthesis and comprehension, along with a novel full-body motion tokenizer called the HoMi Tokenizer. MotionLLaMA is developed based on three core principles.First, it establishes a powerful unified representation space through the HoMi Tokenizer. Using a single codebook, the HoMi Tokenizer in MotionLLaMA achieves reconstruction accuracy comparable to residual vector quantization tokenizers utilizing six codebooks, outperforming all existing single-codebook tokenizers.Second, MotionLLaMA integrates a large language model to tackle various motion-related tasks. This integration bridges various modalities, facilitating both comprehensive and intricate motion synthesis and comprehension.Third, MotionLLaMA introduces the MotionHub dataset, currently the most extensive multimodal, multitask motion dataset, which enables fine-tuning of large language models.Extensive experimental results demonstrate that MotionLLaMA not only covers the widest range of motion-related tasks but also achieves state-of-the-art (SOTA) performance in motion completion, interaction dual-person text-to-motion, and all comprehension tasks while reaching performance comparable to SOTA in the remaining tasks. The code and MotionHub dataset are publicly available1.
1 Introduction
Motion synthesis and comprehension play a crucial role in the fields of computer vision and computer graphics. These techniques are widely applied in virtual reality, animation production, and motion analysis.Methods based on deep learning have dominated these domains due to their powerful learning and generalization capabilities.
However, traditional approaches often treat motion synthesis and comprehension as two independent tasks. Motion synthesis typically relies on frameworks such as VAEs[5], GANs[43], and DDPMs[44] to generate human motions. They are only applicable to single-person and lack detail in fine-grained parts such as fingers and hands.In contrast, motion comprehension more commonly adopts a sequence-to-sequence paradigm to parse and interpret motion signals[20, 14, 18, 16].
With the rapid progress of Multimodal LLMs[1, 51], integrating motion synthesis and comprehension has emerged as a new research trend[23, 33, 63, 50]. Leveraging the scaling laws of LLMs allows for a deeper exploration of the interactions between multimodal signals and human motions. This tokenizer-autoregressive mechanism facilitates a unified framework capable of addressing various motion-centric tasks. Although these Motion LLMs integrate multiple motion synthesis and comprehension tasks, they either struggle to adequately address a broad range of motion-centric tasks or fail to achieve fine-grained motion synthesis simultaneously[63, 23, 33].
In this paper, we propose a unified framework for fine-grained motion synthesis and comprehension.The framework includes a Holistic Motion (HoMi) Tokenizer and a LLaMA-3 model[13, 45] fine-tuned using LoRA[21]. The HoMi Tokenizer encodes full-body motion and spatial position information into discrete motion tokens, forming a motion vocabulary that serves as an extended lexicon for the LLaMA-3 model.
Fine-tuning a LLaMA-level large language model requires a substantial amount of training data. Previous LLM-based motion generation models[23, 33, 63] were trained on relatively small datasets, limiting their success to models at the scale of Flan-T5[41, 8]. To address the data insufficiency, we consolidate numerous open-source motion datasets, conduct re-annotation, filtering, and standardization, and construct MotionHub, the most exhaustive multi-task human motion dataset so far, containing both single-person and dual-person motion data. We present sample examples of different tasks in MotionHub in Fig1.
The contributions of this work can be summarized as follows:
- •
We propose MotionLLaMA,which is capable of generating and understanding both single-person and dual-person motions in a unified auto-regressive paradigm and supports 9 different motion-related tasks, including 3 comprehension tasks and 6 generation tasks. MotionLLaMA achieves SOTA performance on motion completion, interaction dual-person text-to-motion, and all comprehension tasks and reaches a level comparable to SOTA methods in other tasks.
- •
We introduce the HoMi Tokenizer, a holistic motion VQ-VAE[47] model based on a single codebook. Through the separation coding of the body and hand, the codebook utilization analysis, along with the bidirectional fast Fourier transform gating mechanism, achieves reconstruction performance comparable to that of residual VQ-VAE[59] (RVQ), yet only using a single codebook. It also achieves the best performance among all single-codebook models on the MotionHub dataset.
- •
We present the most extensive multimodal and multitask motion dataset (see Fig1 for reference), which includes 131,515 single-person motion data entries paired with 269,880 single-person motion descriptions, and 21,021 dual-person motion data entries paired with 62,791 dual-person motion descriptions. In total, there are 70.10 hours of motion data accompanied by audio.
2 Related Work
2.1 Motion-Centric Cross-Modal Generation
Motion-centric cross-modal generation encompasses two core tasks: motion synthesis and motion comprehension.For motion synthesis, this includes text-to-motion, music-to-dance, and speech-to-gesture. Meanwhile, motion comprehension involves motion-to-text and dance-to-music.
Text-to-motion aims to generate motions that align with the semantic descriptions.Recent studies have primarily employed diffusion models[44, 6, 29]and autoregressive Transformer models[61, 19, 32] to achieve.Music-to-dance focuses on generating dance motions that are consistent with the rhythm of the music.Unlike semantic signals, musical signals do not provide explicit descriptions, resulting in greater freedom[69].From a modeling perspective, recent approaches have converged with text-to-motion methodologies, utilizing diffusion[46, 26, 2] and autoregressive models[42, 25, 24]. However, they adopt different logic for integrating conditional signals.Speech-to-gesture emphasizes generating facial and hand motions that correspond to speech. Compared to the aforementioned motion generation tasks, this task requires more refined modeling of hand and facial movements, often separating these parts from the torso for more precise modeling[56, 31, 30]. Although these works have achieved promising results, they are often limited to a single task.
In dance-to-music tasks, music is typically represented as discrete tokens. This can be achieved through the inherently discrete notation of symbolic music or by discretizing waveform music. Foley[14] focuses on generating symbolic music that aligns with performance motions, while Dance2MIDI[20] emphasizes dance motion, generating rhythm first and then conditionally generating notes for other instruments. Taking video as input, CMT[12] establishes three types of relationships between video and music, whereas D2M-GAN[70] introduces a GAN-based approach for generating waveform music tokens. Similarly, this approach translates the generation task into a sequence-to-sequence translation problem in motion-to-text tasks[18, 16, 52, 40].
2.2 Multimodal LLMs
Large Language Models (LLMs) are advanced pre-trained language models, such as GPT-3[4], GPT-4[1], PaLM[7], and LLaMA[45], developed through extensive training on massive text datasets. By scaling both the dataset size and model architecture, LLMs exhibit remarkable emergent abilities, including instruction following[37], In-Context Learning (ICL)[4], and Chain-of-Thought (CoT) reasoning[49].Multimodal LLMs (MLLMs) extend these capabilities by integrating and processing information from multiple modalities.Since the release of GPT-4o[1], research on MLLMs has intensified, focusing initially on generating text from image[3], video[60], and audio inputs[10]. Subsequent studies have broadened the scope to enhance support for diverse input and output modalities[64, 35].Notably, NExT-GPT[51] utilizes multimodal adaptors and diffusion decoders, enabling it to perform tasks involving arbitrary combinations of text, images, videos, and audio.
2.3 Motion LLMs
Recently, several studies have attempted to apply MLLMs to the domain of human motion, aiming to address various motion-centric generative tasks. MotionGPT[23] is the first to integrate the T5 model[41] into the motion domain, utilizing a motion tokenizer to enable mutual transformation between motion and text. LMM[63], built upon the FineMoGen[62], introduces a MoE architecture for motion generation, while UDE-2[68] constructs a T5-like architecture for condition-to-motion transformation. Similarly, M3GPT[33] also adopts T5 as its language model backbone and pairs it with a conditional modality tokenizer to establish a motion-focused LLM. UniMuMo[55] primarily concentrates on dance movements, additionally supporting conversion between music and text. While these methods are capable of handling various motion-centric generative tasks, their models remain at the million-level.While the concurrent work, MotionAgent[50], achieves a billion-level model scale, it struggles with fine-grained motion synthesis, particularly in generating detailed hand motions.In contrast, we propose a billion-level Motion LLM based on LLaMA[13], which more effectively harnesses the potential of LLMs and covers the most comprehensive set of generative tasks, as detailed in Table1.
3 Method
The MotionLLaMA pipeline is depicted in Fig2. We employ multimodal tokenizers to transform three distinct modalities—text, audio, and motion—into discrete tokens within a unified multimodal vocabulary. For textual signals, we directly leverage a pre-trained LLaMA3 tokenizer model. For motion signals, we introduce a novel Holistic Motion Tokenizer based on VQ-VAE, specifically designed to encode full-body motion, encompassing both body and hand components. Additionally, we utilize the WavTokenizer[22] to perform audio tokenization. The resulting sequence of token indices is then fed into the causal language model (LLaMA3.2-Instruct), which is trained and performs inference in an autoregressive manner.
3.1 Holistic Motion Tokenizer
The discretization in raw VQ-VAEs can result in a loss of reconstruction accuracy. RVQ[59] addresses this by iteratively encoding the residuals across multiple quantization layers, thereby capturing finer details and enhancing reconstruction precision.Despite these advancements, RVQ increases computational complexity and complicates the sampling process during generation due to amplified dependencies between codebooks.These challenges highlight the need for more streamlined quantization methods to optimize generation performance while maintaining accuracy.Consequently, we introduce the Holistic Motion (HoMi) Tokenizer, as illustrated inFig3.
The tokenizer first applies the Fast Fourier Transform (FFT) along the temporal dimension of the input motion to convert it into the frequency domain, capturing global temporal frequency information. Then, it performs a second FFT along the feature channels, capturing cross-channel frequency correlations. This approach enables the model to understand complex frequency patterns inherent in motions.After the FFT operations, a gating function is applied to modulate the frequency-domain features. This mechanism selectively emphasizes important frequency components while suppressing less relevant ones, effectively allowing the network to focus on critical motion cues.
Due to the strong independence often observed between hand movements and torso movements, as well as the significant differences in the amplitude and frequency of hand and body motions, it is necessary to encode the hand and torso separately. Unlike works such as HumanTomato[32] and EMAGE[31], we do not use additional codebooks to discretize the hand and torso separately. Instead, we use independent encoders, for hands and for the torso, to encode them individually and then fuse the representations using a small MLP to obtain the full-body motion token. The encoding process can be expressed as:
(1) |
The operation represents the concatenation along the channel dimension.
To determine the optimal codebook size for this tokenizer, we trained it using an extensive codebook comprising entries. Upon completion of training, we performed tokenization on the entire dataset and visualized the utilization rates of each code, as illustrated inFig4. The analysis revealed that the usage was highly concentrated among approximately codes. Consequently, we concluded that a codebook size of is most appropriate, outperforming the previously utilized size of by providing sufficient representational capacity without introducing redundancy.
3.2 Audio Tokenizer with Single Codebook
Previous audio codec approaches were difficult to apply to multimodal large language model training due to their low compression rates[66] and excessive codebook hierarchies[11]. We adopt WavTokenizer[22], which is a discrete acoustic codec model based on the VQ-GAN framework, featuring a convolutional encoder, a single-layer quantizer, and an enhanced decoder with inverse Fourier transform. It achieves extreme compression by representing one second of 24 kHz audio with only 40 or 75 tokens, significantly fewer than previous models[11, 9, 54].
3.3 Training Strategy
After completing the training of the multimodal tokenizers, all tokenizers are subsequently frozen. Each code within the codebooks of the motion and audio tokenizers is encapsulated as a specialized token in the format <|[MODAL]_[IDX]|> and incorporated into the vocabulary of the language model. For each specific modality sub-sequence, we use modality-specific start and end tokens to delineate different data types, enabling the model to distinguish and process each modality accurately. This clear separation enhances multi-task learning by allowing the model to specialize in handling diverse inputs while facilitating effective parameter sharing. Consequently, the approach improves the model’s ability to manage multiple motion-related tasks with greater accuracy and coherence.
The training strategy of the MotionLLaMA is divided into two stages:1) Unconditional Pre-training,and 2) Instruction Multi-task Tuning.During the pretrain stage, single-person and multi-person motion and audio data are fed into the MotionLLaMA for autoregressive learning. The primary objectives of this phase are: first, to enable the MotionLLaMA to recognize tokens from multiple modalities and to optimize the new parameters introduced in the token embedding layer; second, to familiarize the model with the fundamental structure of multimodal tokens, wherein each type of modality data is encapsulated with specific start and end delimiters. During this phase, we employ LoRA[21] to optimize the decoder layers of the MotionLLaMA while simultaneously fine-tuning the weights of the newly introduced token embeddings. Concurrently, we freeze the weights of the original text token embeddings to ensure that the MotionLLaMA retains its inherent text comprehension and generation capabilities to the fullest extent.
During the instruction-truing phase, we initially design approximately one hundred prompt templates for a dozen motion-related training tasks and learn from all tasks jointly without finetuning with single task. In this phase, we entirely freeze the token embedding layer and continue to optimize the decoder layers exclusively through LoRA[21]. This approach ensures that the foundational text comprehension and generation capabilities of the MotionLLaMA are preserved, while the decoder layers are refined to better handle the newly introduced multimodal tasks.
4 Experiments
4.1 Implementation Details
Multi-modal Tokenizers. We utilize the native text tokenizer from LLaMA 3.2 as the text tokenizer, configured with a codebook size of 128,256. The WavTokenizer-unify-large variant serves as the audio tokenizer for both music and speech, featuring a codebook size of 4,096, which enables the discretization of audio signals into audio tokens at 40 frames per second (FPS). The HoMi Tokenizer employs a codebook size of 2,048 and a code dimension of 1,536, with the hand and body branches comprising 768 channels each and a downsampling rate of 4. Additionally, we incorporate a total of 16 special tokens across all modalities to mark the beginning and end of each modality. For multi-person tasks as well as motion completion or in-between tasks, we use special symbols to mark the starting points of each individual or motion segment.
LLM. We initialize the model with the LLaMA 3.2 Instruct version and fine-tune it using LoRA[21]. The LoRA parameters are set with a rank of 64, an alpha of 128, and a dropout rate of 0.05, and LoRA is applied to all linear layers within the Attention and Feed-Forward Network. The learning rate was uniformly set to. In the Pretraining and Instruction Fine-tuning phases, the model was trained for 500 epochs across all tasks. We trained two versions of the model: the expert model and the unified model. The expert model was trained simultaneously on two tasks (e.g., the T2M and M2T tasks shared the same model), while the unified model was trained on all tasks, including several tasks that were not involved in the evaluation, such as unconditional generation. During the inference stage, the temperature is set to 0.6 and to 0.9.
4.2 Evaluation Setup
We perform joint training across all tasks using the MotionHub dataset (detailed in SecA) and subsequently evaluate the model on the following tasks: T2M, M2T, M2D, D2M, S2G, IT2M, IM2T, Motion Inbetween, and Motion Completion. The MotionHub dataset uniformly employs normalized global joint coordinates based on the SMPL-H skeletal topology to represent human motion.
Multiple evaluation metrics are applied in this study, each tailored to the specific requirements of the corresponding tasks (detailed in SecB).
- •
Texst-to-Motion. Motion quality is assessed using FID, semantic alignment via Multimodal Distance and R-Precision, and diversity through Diversity metrics.
- •
Motion-to-Text.Evaluation, following [23, 18], includes Multimodal Distance, R-Precision, and standard NLP metrics (BLEU[36], ROUGE[28], CIDEr[48], and BERTScore[65]).
- •
Interaction Tasks. The evaluation metrics for Interaction Motion are identical to those used for single-person motions, with the sole difference being the use of a dual-person TMR[39], which we have trained as a feature extractor.
- •
Music-to-Dance. Following [33], We use Kinetic FID (), Diversity, and Beat Alignment Scores as metrics. Previous work does not support full-body dance generation. For fair comparison, we compute metrics using only the torso portion of the generated dance.
- •
Dance-to-Music. Following [58, 33], Beats Coverage Score (BCS), Beats Hit Score (BHS), and F1 Score assess rhythmic alignment between dance motion and generated music.
- •
Motion Completion. For Motion Inbetweening and Motion Prediction, we evaluate quality and variability with FID and Diversity, and accuracy with Average Displacement Error (ADE) and Final Displacement Error (FDE).
- •
Speech-to-Gesture. Following [57, 31], we use FGD (Frechet Gesture Distance) for gesture quality, BA (Beat Alignment) for rhythmic consistency, and L1Div for gesture diversity. Unlike dance-music alignment, a BA closer to ground truth is preferred over higher values.
- •
Motion Reconstruction. We evaluate reconstruction accuracy using Mean Per Joint Position Error (MPJPE), Normalized MPJPE (N-MPJPE), ADE, and FDE.
4.3 Text-Motion Bidirectional Association
We follow [39] to train the TMR motion-text alignment evaluation model on the MotionHub dataset for benchmarking M2T and T2M tasks. We replicated previous state-of-the-art approaches for the M2T and T2M tasks on the MotionHub dataset to facilitate comparative analysis.
M2T & IM2T Our results on M2T and IM2T are shown inTab2. The upper part of the table presents the results for the M2T task, while the lower part shows the results for the IM2T task. Leveraging the powerful contextual correlation and text generation capabilities of LLMs, our approach demonstrates exceptional performance both in the accuracy of motion matching and the quality of the generated text. In single-person M2T, we outperform previous SOTA methods across all metrics, with the only exception being a slightly lower BERTScore compared to MotionGPT. In dual-person M2T, MotionLLaMA is capable of accurately describing dual-person motions and outperforms all previous methods. Furthermore, our results significantly exceed those of previous methods in terms of the R-precision metric, while also outperforming or being comparable to other methods across additional evaluation metrics.
T2M & IT2M The upper half of Tab3 presents the evaluation results for the T2M task, while the lower half displays the evaluation results for the IT2M task. The dual-person interactive motions generated by MotionLLaMA surpass previous methods in diversity and alignment with the motion descriptions.
4.4 Motion-to-Motion Tasks
Motion completion and motion in-between results are depicted inTab4. MotionLLaMA achieves state-of-the-art performance in both motion quality and displacement prediction accuracy for the Motion Inbetween task, with displacement accuracy significantly surpassing that of previous methods.
4.5 Music-Dance Bidirectional Association
MotionHub integrates the FineDance[26] and AIST++[25] dance datasets. We tested the Music-to-Dance and Dance-to-Music tasks separately on both datasets.
M2D As shown inTab6, on the AIST++ dataset, MotionLLaMA achieves better rhythmic alignment and diversity, although its score is slightly higher. In contrast, on the FineDance dataset, MotionLLaMA achieves the optimal score while demonstrating comparable diversity and rhythmic alignment. Overall, MotionLLaMA reaches a level of performance in the music-to-dance task that is comparable to the previous state-of-the-art.
D2M Tab7 shows our results on D2M task. Following the previous works[33], we evaluate the quality of the generated background music using three metrics: Beats Coverage Score (BCS), Beats Hit Score (BHS), and F1 Score, which are detailed in SecB. Despite being trained on approximately 2,000 music-dance pairs from the AIST++ and FineDance datasets, the background music generated by MotionLLaMA demonstrates exceptional rhythmic consistency with the dance movements, even achieving perfect scores on the BCS metric. The rhythmic alignment of the music we generate significantly outperforms that of the other methods in the comparison, MotionLLaMA achieves SOTA performance across all three metrics.
4.6 Speech-to-Gesture
Following the[57], we trained a Gesture Variational Autoencoder (VAE) to extract gesture feature vectors for evaluations such as Frechet Gesture Distance (FGD) and other metrics. The evaluation results are presented inTab8. The generated accompanying motion exhibits a more realistic synchronization with the rhythm of speech compared to other methods. Meanwhile, the FGD and L1Div metrics are closely aligned with those of alternative approaches.
4.7 Reconstruction Quality Evaluation
As demonstrated inTab5, the HoMi Tokenizer outperforms other methods in both body and displacement reconstruction quality, including the RVQ which utilizes six codebooks. Additionally, hand reconstruction quality surpasses all single-codebook methods, though it is slightly inferior to that of RVQ. As mentioned in theSec3.1, we also conducted a comparison with the RVQ+Residual Transformer combination and found that RVQ+Residual Transformer can’t achieve the precision of RVQ, thus validating the point we raised earlier inSec3.1 that reconstruction accuracy with multiple codebooks does not necessarily equate to superior motion quality in generative scenarios.
4.8 Ablation Study
Effectiveness Analysis of HoMi Tokenizer Components.In this experiment, we evaluated the impact of integrating the separate encoders for the torso and hand (STH) and Frequency-Aware Motion Gating (FAMG) structures on the reconstruction quality of the HoMi Tokenizer. As illustrated in Tab9, HoMi Tokenizer achieves the highest motion reconstruction performance when both components are incorporated.
5 Conclusion
In summary, this paper presents MotionLLaMA, which unifies fine-grained motion synthesis and comprehension across diverse tasks. Through a new tokenizer called HoMi, MotionLLaMA establishes a high-precision, unified representation space, surpassing prior tokenizers in reconstruction accuracy and enabling seamless multimodal integration. Supported by the comprehensive MotionHub dataset, this model achieves state-of-the-art performance across an extensive range of motion-centric tasks. MotionLLaMA thus sets a new benchmark in motion modeling, offering a scalable solution for advancing multimodal motion synthesis and comprehension in computer vision and graphics.
References
- Achiam etal. [2023]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, etal.Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
- Alexanderson etal. [2023]Simon Alexanderson, Rajmund Nagy, Jonas Beskow, and GustavEje Henter.Listen, denoise, action! audio-driven motion synthesis with diffusion models.TOG, 42(4):1–20, 2023.
- Awadalla etal. [2023]Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, PangWei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt.Openflamingo: An open-source framework for training large autoregressive vision-language models.arXiv preprint arXiv:2308.01390, 2023.
- Brown etal. [2020]Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.Language models are few-shot learners.In Advances in Neural Information Processing Systems, pages 1877–1901, 2020.
- Cai etal. [2021]Yujun Cai, Yiwei Wang, Yiheng Zhu, Tat-Jen Cham, Jianfei Cai, Junsong Yuan, Jun Liu, Chuanxia Zheng, Sijie Yan, Henghui Ding, etal.A unified 3d human motion synthesis model via conditional variational auto-encoder.In ICCV, pages 11645–11655, 2021.
- Chen etal. [2023]Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu.Executing your commands via motion diffusion in latent space.In CVPR, pages 18000–18010, 2023.
- Chowdhery etal. [2023]Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, HyungWon Chung, Charles Sutton, Sebastian Gehrmann, etal.Palm: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113, 2023.
- Chung etal. [2024]HyungWon Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, etal.Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024.
- Défossez etal. [2023]Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi.High fidelity neural audio compression.Transactions on Machine Learning Research, 2023.
- Deshmukh etal. [2023]Soham Deshmukh, Benjamin Elizalde, Rita Singh, and Huaming Wang.Pengi: An audio language model for audio tasks.In NeurIPS, pages 18090–18108, 2023.
- Dhariwal etal. [2020]Prafulla Dhariwal, Heewoo Jun, Christine Payne, JongWook Kim, Alec Radford, and Ilya Sutskever.Jukebox: A generative model for music.arXiv preprint arXiv:2005.00341, 2020.
- Di etal. [2021]Shangzhe Di, Zeren Jiang, Si Liu, Zhaokai Wang, Leyan Zhu, Zexin He, Hongming Liu, and Shuicheng Yan.Video background music generation with controllable music transformer.In ACM Multimedia, pages 2037–2045, 2021.
- Dubey etal. [2024]Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, etal.The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024.
- Gan etal. [2020]Chuang Gan, Deng Huang, Peihao Chen, JoshuaB Tenenbaum, and Antonio Torralba.Foley music: Learning to generate music from videos.In ECCV, pages 758–775, 2020.
- Gopinath and Won [2020]Deepak Gopinath and Jungdam Won.Fairmotion-tools to load, process and visualize motion capture data.https://github.com/facebookresearch/fairmotion, 2020.
- Goutsu and Inamura [2021]Yusuke Goutsu and Tetsunari Inamura.Linguistic descriptions of human motion with generative adversarial seq2seq learning.In ICRA, pages 4281–4287. IEEE, 2021.
- Guo etal. [2022a]Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng.Generating diverse and natural 3d human motions from text.In CVPR, pages 5152–5161, 2022a.
- Guo etal. [2022b]Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng.Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts.In ECCV, pages 580–597. Springer, 2022b.
- Guo etal. [2024]Chuan Guo, Yuxuan Mu, MuhammadGohar Javed, Sen Wang, and Li Cheng.Momask: Generative masked modeling of 3d human motions.In CVPR, pages 1900–1910, 2024.
- Han etal. [2024]Bo Han, Yuheng Li, Yixuan Shen, Yi Ren, and Feilin Han.Dance2midi: Dance-driven multi-instrument music generation.Computational Visual Media, 10(4):791–802, 2024.
- Hu etal. [2021]EdwardJ Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, etal.Lora: Low-rank adaptation of large language models.In ICLR, 2021.
- Ji etal. [2024]Shengpeng Ji, Ziyue Jiang, Xize Cheng, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Ruiqi Li, Ziang Zhang, Xiaoda Yang, etal.Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling.arXiv preprint arXiv:2408.16532, 2024.
- Jiang etal. [2023]Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen.Motiongpt: Human motion as a foreign language.In NeurIPS, pages 20067–20079, 2023.
- Li etal. [2022]Buyu Li, Yongchi Zhao, Shi Zhelun, and Lu Sheng.Danceformer: Music conditioned 3d dance generation with parametric motion transformer.In AAAI, pages 1272–1279, 2022.
- Li etal. [2021]Ruilong Li, Shan Yang, DavidA Ross, and Angjoo Kanazawa.Ai choreographer: Music conditioned 3d dance generation with aist++.In ICCV, pages 13401–13412, 2021.
- Li etal. [2023]Ronghui Li, Junfan Zhao, Yachao Zhang, Mingyang Su, Zeping Ren, Han Zhang, Yansong Tang, and Xiu Li.Finedance: A fine-grained choreography dataset for 3d full body dance generation.In ICCV, pages 10234–10243, 2023.
- Liang etal. [2024]Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, and Lan Xu.Intergen: Diffusion-based multi-human motion generation under complex interactions.IJCV, pages 1–21, 2024.
- Lin [2004]Chin-Yew Lin.Rouge: A package for automatic evaluation of summaries.In Text summarization branches out, pages 74–81, 2004.
- Ling etal. [2024]Zeyu Ling, Bo Han, Yongkang Wong, Han Lin, Mohan Kankanhalli, and Weidong Geng.Mcm: Multi-condition motion synthesis framework.In IJCAI, pages 1083–1091, 2024.
- Liu etal. [2022]Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng.Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis.In ECCV, pages 612–630. Springer, 2022.
- Liu etal. [2024]Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, and MichaelJ. Black.Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling.In CVPR, pages 1144–1154, 2024.
- Lu etal. [2024]Shunlin Lu, Ling-Hao Chen, Ailing Zeng, Jing Lin, Ruimao Zhang, Lei Zhang, and Heung-Yeung Shum.Humantomato: Text-aligned whole-body motion generation.In ICML, 2024.
- Luo etal. [2024]Mingshuang Luo, Ruibing Hou, Hong Chang, Zimo Liu, Yaowei Wang, and Shiguang Shan.An advanced multimodal, multitask framework for motion comprehension and generation.arXiv preprint arXiv:2405.16273, 2024.
- McFee etal. [2015]Brian McFee, Colin Raffel, Dawen Liang, DanielPW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto.librosa: Audio and music signal analysis in python.In SciPy, pages 18–24, 2015.
- Moon etal. [2024]Seungwhan Moon, Andrea Madotto, Zhaojiang Lin, Tushar Nagarajan, Matt Smith, Shashank Jain, Chun-Fu Yeh, Prakash Murugesan, Peyman Heidari, Yue Liu, etal.Anymal: An efficient and scalable any-modality augmented language model.In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1314–1332, 2024.
- Papineni etal. [2002]Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu.Bleu: a method for automatic evaluation of machine translation.In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- Peng etal. [2023]Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao.Instruction tuning with gpt-4.arXiv preprint arXiv:2304.03277, 2023.
- Petrovich etal. [2022]Mathis Petrovich, MichaelJ. Black, and Gül Varol.TEMOS: Generating diverse human motions from textual descriptions.In European Conference on Computer Vision (ECCV), 2022.
- Petrovich etal. [2023]Mathis Petrovich, MichaelJ Black, and Gül Varol.Tmr: Text-to-motion retrieval using contrastive 3d human motion synthesis.In ICCV, pages 9488–9497, 2023.
- Plappert etal. [2018]Matthias Plappert, Christian Mandery, and Tamim Asfour.Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks.Robotics and Autonomous Systems, 109:13–26, 2018.
- Raffel etal. [2020]Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ Liu.Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020.
- Siyao etal. [2022]Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, ChenChange Loy, and Ziwei Liu.Bailando: 3d dance generation by actor-critic gpt with choreographic memory.In CVPR, pages 11050–11059, 2022.
- Sun etal. [2020]Guofei Sun, Yongkang Wong, Zhiyong Cheng, MohanS Kankanhalli, Weidong Geng, and Xiangdong Li.Deepdance: music-to-dance motion choreography with adversarial learning.IEEE Transactions on Multimedia, 23:497–509, 2020.
- Tevet etal. [2023]Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and AmitHaim Bermano.Human motion diffusion model.In ICLR, 2023.
- Touvron etal. [2023]Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, etal.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023.
- Tseng etal. [2023]Jonathan Tseng, Rodrigo Castellon, and Karen Liu.EDGE: Editable dance generation from music.In CVPR, pages 448–458, 2023.
- Van DenOord etal. [2017]Aaron Van DenOord, Oriol Vinyals, etal.Neural discrete representation learning.NeurIPS, 30, 2017.
- Vedantam etal. [2015]Ramakrishna Vedantam, C LawrenceZitnick, and Devi Parikh.Cider: Consensus-based image description evaluation.In CVPR, pages 4566–4575, 2015.
- Wei etal. [2024]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, EdH. Chi, QuocV. Le, and Denny Zhou.Chain-of-thought prompting elicits reasoning in large language models.In NeurIPS, 2024.
- Wu etal. [2024]Qi Wu, Yubo Zhao, Yifan Wang, Yu-Wing Tai, and Chi-Keung Tang.Motionllm: Multimodal motion-language learning with large language models.arXiv preprint arXiv:2405.17013, 2024.
- Wu etal. [2023a]Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua.Next-gpt: Any-to-any multimodal llm.arXiv preprint arXiv:2309.05519, 2023a.
- Wu etal. [2023b]Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov.Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation.In ICASSP, pages 1–5. IEEE, 2023b.
- Xu etal. [2024]Liang Xu, Xintao Lv, Yichao Yan, Xin Jin, Shuwen Wu, Congsheng Xu, Yifan Liu, Yizhou Zhou, Fengyun Rao, Xingdong Sheng, etal.Inter-x: Towards versatile human-human interaction analysis.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22260–22271, 2024.
- Yang etal. [2023]Dongchao Yang, Songxiang Liu, Rongjie Huang, Jinchuan Tian, Chao Weng, and Yuexian Zou.Hifi-codec: Group-residual vector quantization for high fidelity audio codec.arXiv preprint arXiv:2305.02765, 2023.
- Yang etal. [2024]Han Yang, Kun Su, Yutong Zhang, Jiaben Chen, Kaizhi Qian, Gaowen Liu, and Chuang Gan.Unimumo: Unified text, music and motion generation.arXiv preprint arXiv:2410.04534, 2024.
- Yi etal. [2023]Hongwei Yi, Hualin Liang, Yifei Liu, Qiong Cao, Yandong Wen, Timo Bolkart, Dacheng Tao, and MichaelJ Black.Generating holistic 3d human motion from speech.In CVPR, pages 469–480, 2023.
- Yoon etal. [2020]Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee.Speech gesture generation from the trimodal context of text, audio, and speaker identity.TOG, 39(6):1–16, 2020.
- Yu etal. [2023]Jiashuo Yu, Yaohui Wang, Xinyuan Chen, Xiao Sun, and Yu Qiao.Long-term rhythmic video soundtracker.In ICML, pages 40339–40353. PMLR, 2023.
- Zeghidour etal. [2021]Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi.Soundstream: An end-to-end neural audio codec.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
- Zhang etal. [2023a]Hang Zhang, Xin Li, and Lidong Bing.Video-LLaMA: An instruction-tuned audio-visual language model for video understanding.In EMNLP 2023 Demo, pages 543–553, 2023a.
- Zhang etal. [2023b]Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen, and Ying Shan.Generating human motion from textual descriptions with discrete representations.In CVPR, pages 14730–14740, 2023b.
- Zhang etal. [2023c]Mingyuan Zhang, Huirong Li, Zhongang Cai, Jiawei Ren, Lei Yang, and Ziwei Liu.Finemogen: Fine-grained spatio-temporal motion generation and editing.NeurIPS, 2023c.
- Zhang etal. [2024a]Mingyuan Zhang, Daisheng Jin, Chenyang Gu, Fangzhou Hong, Zhongang Cai, Jingfang Huang, Chongzhi Zhang, Xinying Guo, Lei Yang, Ying He, etal.Large motion model for unified multi-modal motion generation.arXiv preprint arXiv:2404.01284, 2024a.
- Zhang etal. [2024b]Renrui Zhang, Jiaming Han, Chris Liu, Aojun Zhou, Pan Lu, Yu Qiao, Hongsheng Li, and Peng Gao.LLaMA-adapter: Efficient fine-tuning of large language models with zero-initialized attention.In ICLR, 2024b.
- Zhang* etal. [2020]Tianyi Zhang*, Varsha Kishore*, Felix Wu*, KilianQ. Weinberger, and Yoav Artzi.Bertscore: Evaluating text generation with bert.In International Conference on Learning Representations, 2020.
- Zhang etal. [2024c]Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu.Speechtokenizer: Unified speech tokenizer for speech language models.In ICLR, 2024c.
- Zheng etal. [2022]Chuanxia Zheng, Tung-Long Vuong, Jianfei Cai, and Dinh Phung.Movq: Modulating quantized vectors for high-fidelity image generation.NeurIPS, 35:23412–23425, 2022.
- Zhou etal. [2023]Zixiang Zhou, Yu Wan, and Baoyuan Wang.A unified framework for multimodal, multi-part human motion synthesis.arXiv preprint arXiv:2311.16471, 2023.
- Zhu etal. [2023a]Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, and Yizhou Wang.Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023a.
- Zhu etal. [2022]Ye Zhu, Kyle Olszewski, Yu Wu, Panos Achlioptas, Menglei Chai, Yan Yan, and Sergey Tulyakov.Quantized gan for complex music generation from dance videos.In ECCV, pages 182–199. Springer, 2022.
- Zhu etal. [2023b]Ye Zhu, Yu Wu, Kyle Olszewski, Jian Ren, Sergey Tulyakov, and Yan Yan.Discrete contrastive diffusion for cross-modal music and image generation.In The Eleventh International Conference on Learning Representations, 2023b.
\thetitle
Supplementary Material
Appendix A MotionHub
In this section, we present the details of the MotionHub dataset, a novel large-scale motion dataset specifically curated for multi-task learning. MotionHub integrates multiple modalities—text, motion, and audio—offering a robust and diverse resource aimed at advancing research in multimodal human motion synthesis and comprehension.
We aggregated publicly available human motion datasets from diverse domains, covering everyday motions, dance movements, speech gestures, and dual-person interactive motions. Some datasets, particularly those relying on monocular vision-based motion capture, include samples with suboptimal acquisition quality. To enhance data integrity, each dataset was rendered into video format and manually filtered. All datasets utilized in MotionHub adhere to SMPL series skeletal topologies (SMPL, SMPL-H, SMPL-X). Table10 lists the public datasets incorporated into MotionHub, with MotionX encompassing several sub-datasets. MotionHub, constructed from these public datasets, is enriched and standardized through the following steps:
- 1.
Unified Coordinate System and Scale. Different motion datasets utilize distinct world coordinate systems and are recorded by different motion capture performers, resulting in significant discrepancies in data distribution that hinder the ability of neural networks to learn consistent patterns. To address this, we select one performer as the reference for all motion data. The position of the root joint in the first frame of this performer is designated as the origin of the y-axis, while the forward-facing direction, represented by the normal vector of the plane formed by the shoulder and hip joints, defines the positive x-axis. This approach not only standardizes the motion data but also preserves the relative positional relationships among multiple characters within the motions.
- 2.
Manual Annotation and Filter. For datasets lacking textual annotations and detailed labels, we rendered the motion data into action videos for manual annotation. Additionally, we required annotators to review and filter out the lower-quality motion data.
- 3.
Recaption and Label Sentence Construction. Some interaction motion datasets, such as InterX[53] and InterHuman[27], provide a general description of the interactions but lack individual descriptions. However, it is both reasonable and meaningful to decompose these motions into single-person motions. For these datasets, we utilized ChatGPT-4o to rewrite the descriptions of dual-person motions, resulting in corresponding single-person descriptions. Additionally, datasets like VirtualHuman, AIST++, and BEAT offer detailed action labels and relevant information, which we organized into coherent sentences using ChatGPT-4o to create textual descriptions.
- 4.
Text Normalization and Correction. We normalize simple, non-sentential annotation texts into complete sentences with subjects, predicates, and objects. Additionally, we address issues such as grammatical errors and ambiguous descriptions within the annotation texts to enhance clarity and coherence.
Appendix B Evaluation Metrics
In this section, we elucidate the computational methodologies for the evaluation metrics associated with each task.
Text-to-Motion. Following [17], We assess the quality of generated motions using Fréchet Inception Distance (FID), evaluate the alignment between motions and textual descriptions with MultiModal Distance and R-Precision, and measure the diversity of the motions using Diversity metrics. FID quantifies the distance between the statistical distributions of the generated motions and the ground truth motions. Multimodal Distance evaluates the distance between the feature vectors of the generated motions and their corresponding conditional captions. Conversely, R-Precision measures the proportion of generated motions within a batch of size whose feature vectors rank among the top closest to those of their respective conditional captions, the calculation process can be expressed as follow:
(2) |
Based on the aforementioned descriptions, the computation of these metrics relies on motion latents that effectively capture motion quality and align with textual descriptions. For single-person motion and text data, we follow [32, 39], employing contrastive learning techniques to align motion latents with text latents. In the case of two-person motions, we concatenate the two motions along the channel dimension to serve as the motion input to TMR, while using dual-person motion descriptions as the textual input. We trained the TMR model on the MotionHub dataset using single-person motions for 1,300 epochs and dual-person motions for 1,500 epochs, respectively.
The Diversity metric quantifies the variability among the actions generated by the model. To compute this metric, we randomly select 300 motion samples from all generated motions, extract their feature vectors, and calculate the average L2 distance between these samples.
Motion-to-Text. We employ R-Precision and Multimodal Distance to evaluate the alignment between ground truth motions and model-generated captions, utilizing the same computation methods as those applied in text-to-motion tasks. To assess the similarity between the generated captions and the ground truth captions, we utilize traditional NLP metrics, including BLEU[36], ROUGE-L[28], CIDEr-D[48], and BERTScore[65].
Music-to-Dance. For dance motions, we adhere to previous works by utilizing FairMotion[15] to extract the kinetic features of the motions. The Kinetic Fréchet Inception Distance (Kinetic FID, measures the statistical distance between the kinetic features of generated dance motions and those of ground truth motions to evaluate dance quality. computes the pairwise L2 distances between the kinetic features of all generated dance motions to assess their diversity. The Beat Alignment Score (BAS) calculates the average L2 error between the musical beats and the motion beats, as defined by the following formula:
(3) |
In this framework, denotes the beats of the dance motions, calculated as the maximum local velocity of the dance movements. represents the beats of the music, which are extracted using the Librosa[34] library. However, the BAS metric has certain limitations, as it does not penalize dance motions with an excessive number of beats. Consequently, in some cases, the generated dance motions may exhibit higher BAS scores than the ground truth.
Dance-to-Music. Following [58, 33], we assess the beat accuracy of the generated music. To evaluate the alignment between the kinematic beats of the generated dances and the input music beats, we employ two metrics. Let denote the total number of musical beats, represent the total number of kinematic beats, and indicate the number of kinematic beats that are aligned with musical beats. According to the implementation of LORIS[58], multiple beats occurring within a single second are treated as one. The two metrics are defined as follows:
- •
Beat Coverage (): Measures the proportion of kinematic beats relative to musical beats.
- •
Beat Hit Rate (): Quantifies the ratio of aligned kinematic beats to total kinematic beats.
Consistent with [58, 33], we compute the F1 score for Beat Coverage Score (BCS) and Beat Hit Score (BHS) using the following formula:
(4) |
Speech-to-Gesture. Following [31], we use FGD to evaluate the quality of gesture motions, L1Div to measure the diversity of gesture motions, and Beat Alignment to assess the rhythm consistency between gesture motions and speech. The computation of L1Div is as follows: L1Div measures the diversity of gesture motions by calculating the average absolute deviation of all results from their mean. To extract gesture latents for computing FGD, we trained a Gesture VAE[38] on the BEATv2[31] dataset.
Appendix C Reconstruction Quality of single Codebook VQ-VAE
Previous studies[23, 61] have employed fixed VQ-VAE parameters and utilized the HumanML3D dataset for motion representation. However, the introduction of the larger MotionHub dataset and the inclusion of multi-person motions render the practical experiences gained from VQ-VAE implementations on smaller datasets like HumanML3D unreliable. Consequently, we conducted experiments to evaluate different motion representations and determine the optimal number of codes in the codebook.
C.1 Motion Representation
Previous studies have employed HumanML3D as the motion representation. However, this representation exhibits high redundancy and lacks global positional information, rendering it unsuitable for multi-person motions. Therefore, it is essential to identify a motion representation that incorporates global positional information while maintaining low redundancy. We first review commonly employed centralized motion representations and subsequently present our experimental findings.
C.1.1 Preliminary
To accurately represent the motion of a human body, it is essential to define at least the following two quantities:
- •
The human skeletal topology , where . Here, denotes the number of abstracted skeletal joints, representing the joints of the human body. represents the connections between the joints of the human body. This human skeletal topology is also referred to as a kinematic tree, due to its adherence to the mathematical characteristics of a tree.
- •
The position of the joints at each time step , denoted as . The positions of the joints can be derived from various representation quantities, such as linear velocity, angular velocity, and rotational angles between joints. We collectively refer to these feasible representation quantities as “motion representation”.
In deep learning-based motion synthesis tasks, the kinematic tree is often predefined by the dataset, while the choice of motion representation is diverse and can significantly influence the synthesis results.
C.1.2 Impact of Motion Representations.
Our task involves the generation and understanding of multi-person motions. Current Motion VQ-VAE training often utilizes HumanML3D-style motion data as input. However, HumanML3D transforms all motion coordinates from a global coordinate system to a relative coordinate system, where the origin is set at the initial frame’s foot position, and the z-axis aligns with the initial frame’s facing direction. This transformation discards global position and orientation information of human motions, which is critical for generating multi-person motions. Moreover, studies like HumanTomato[32] have pointed out that HumanML3D-style data representation is highly redundant. Therefore, we start with the simplest representation, using global coordinates for each joint, and investigate the most efficient method for representing motion data.
Global Joint Coordinates. As mentioned in SecC.1.1, the simplest motion representation is the global joint coordinates. Due to differences in coordinate systems and motion capture actors, the data distribution across different motion capture datasets may vary significantly. Therefore, we apply the following operations for standardization.
- •
Unified Coordinate System: For single-person motion data, we define the xz-plane as the horizontal plane corresponding to the lowest point of all joints in each motion sequence, with the positive z-axis aligned with the character’s facing direction in the first frame, and the y-axis aligned with the vertical line passing through the root joint in the first frame. For two-person motion data, the xz-plane is defined based on the lowest point of all joints of both individuals, and the xz-plane and y-axis are determined in the same manner based on the position and orientation of one of the individuals in the first frame.
- •
Motion Retargeting: To mitigate data distribution differences caused by variations in the body shapes of motion capture actors, motions are often retargeted to a standard skeleton.
Translation and Rotation. Using coordinates to represent the positions between joints is simple and straightforward, but it overlooks the fact that bone lengths in the human body should remain constant during motion. The joint coordinates reconstructed by neural networks do not theoretically guarantee compliance with the predefined human skeletal topology. Therefore, a common approach is to use the root node’s displacement and the rotational angles of each joint as the motion representation.
Hybrid Motion Representation. Works such as T2M[17] (HumanML3D), and InterGen[27] (InterHuman) propose including both joint coordinates and rotations in the motion representation, while also incorporating joint velocities and binary features indicating whether the feet are in contact with the ground in each frame. A HumanML3D-style motion vector can be represented as a tuple of the following vectors: . In this representation, the positional information of the root node is expressed in terms of its velocities. It is decomposed into the angular velocity around the y-axis (the gravitational axis) , linear velocities on the x and z axes within the xz-plane (ground plane) , , and the height of the root node , , , , and respectively represent the joint position, joint velocity, joint rotation angle, and whether the foot is in contact with the ground. InterHuman and HM3D share a consistent format; however, to preserve global positional information, they use global joint coordinates as . These two motion representations are the most widely used in the field of motion generation. However, since rotational angles, velocities, and foot contact information can all be derived from joint coordinates, works such as HumanTomato[32] have pointed out that the hybrid representation is quite redundant.
To investigate the impact of various normalization steps and the components of representations on the reconstruction performance of Motion VQ-VAE, we conducted extensive experiments. As shown in Tab11, using the most straightforward global joints as the motion representation yields the highest reconstruction accuracy. Although the InterHuman motion representation introduces rich supplementary information, including velocity, feet contact, and rotation, it does not improve reconstruction quality. Furthermore, the translation and rotation-based motion representation, while decoupling motion representation from skeletal topology, results in accumulated errors due to inaccuracies in predicting each joint’s rotation. Consequently, this approach leads to inferior reconstruction quality compared to directly predicting joint positions.
Codebook Size. To fit a large-scale dataset like MotionHub, which contains a vast amount of samples and a wide variety of motion types, we require a sufficiently large Codebook space. As shown in Tab12, we experimented with Codebook sizes of 1024, 2048, 4096, and 8192. Our experimental results demonstrate that a codebook size of 2048 achieves the optimal reconstruction performance under our experimental setup, thereby validating our observations on code utilization presented in Sec3.1.
Effectiveness Analysis of HoMi Tokenizer Components. In this experiment, we evaluated the impact of integrating the Hands-Torso Disentangled Encoder (HTDE) and Frequency-Aware Motion Gating (FAMG) structures on the reconstruction quality of the HoMi Tokenizer. As illustrated in Tab9, HoMi Tokenizer achieves the highest motion reconstruction performance when both components are incorporated.
Appendix D More Results
We have performed visualizations for each task. Additional visualization results are available in the supplementary materials, and we have included the visualization output (10 samples or more for each task) along with the code in our submission package.
Fig5 illustrates the outcomes of our S2G and T2M tasks, demonstrating that our generated results exhibit a degree of consistency with the ground truth while also showcasing notable diversity.
As shown in Fig6, we present the visualization results for the M2D and IT2M tasks. Although our performance on the M2D task does not achieve state-of-the-art (SOTA) metrics, the quality of the generated dances and the beat alignment capability are satisfactory on the basis of the visualization results.