skip to main content
10.1145/3394171.3413635acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Action2Motion: Conditioned Generation of 3D Human Motions

Published: 12 October 2020 Publication History

Abstract

Action recognition is a relatively established task, where given an input sequence of human motion, the goal is to predict its action category. This paper, on the other hand, considers a relatively new problem, which could be thought of as an inverse of action recognition: given a prescribed action type, we aim to generate plausible human motion sequences in 3D. Importantly, the set of generated motions are expected to maintain its diversity to be able to explore the entire action-conditioned motion space; meanwhile, each sampled sequence faithfully resembles a natural human body articulation dynamics. Motivated by these objectives, we follow the physics law of human kinematics by adopting the Lie Algebra theory to represent the natural human motions; we also propose a temporal Variational Auto-Encoder (VAE) that encourages a diverse sampling of the motion space. A new 3D human motion dataset, HumanAct12, is also constructed. Empirical experiments over three distinct human motion datasets (including ours) demonstrate the effectiveness of our approach.

Supplementary Material

MP4 File (3394171.3413635.mp4)
We proposed a VAE based network for natural and diverse human motions generation only conditioned on action types. VAE tapped into a RNN architecture handles with the stochastic temporal pose generation, and Lie algebra naturally represent human poses. Our method outperforms the comparison baselines over three datasets. We also release a new action-annotated human motion dataset.

References

[1]
Hyemin Ahn, Timothy Ha, Yunho Choi, Hwiyeon Yoo, and Songhwai Oh. 2018. Text2action: Generative adversarial synthesis from language to action. In Proceedings of IEEE International Conference on Robotics and Automation (ICRA). IEEE, 5915--5920.
[2]
Chaitanya Ahuja and Louis-Philippe Morency. 2019. Language2Pose: Natural Language Grounded Pose Forecasting. In International Conference on 3D Vision (3DV). IEEE, 719--728.
[3]
Haoye Cai, Chunyan Bai, Yu-Wing Tai, and Chi-Keung Tang. 2018. Deep video generation, prediction and completion of human action sequences. In Proceedings of the European Conference on Computer Vision (ECCV). 366--382.
[4]
Alexandros Andre Chaaraoui, Jos� Ram�n Padilla-L�pez, Pau Climent-P�rez, and Francisco Fl�rez-Revuelta. 2014. Evolutionary joint selection to improve human action recognition with RGB-D devices. Expert systems with applications, Vol. 41, 3 (2014), 786--794.
[5]
Carnegie Mellon University. 2003. Carnegie Mellon University graphics lab motion capture database. (2003).
[6]
Emily Denton and Rob Fergus. 2018. Stochastic Video Generation with a Learned Prior. In International Conference on Machine Learning (ICML). 1174--1183.
[7]
Dariu M Gavrila, Larry S Davis, et almbox. 1995. Towards 3-d model-based tracking and recognition of human movement: a multi-view approach. In International workshop on automatic face-and gesture-recognition. Citeseer, 272--277.
[8]
Liang-Yan Gui, Yu-Xiong Wang, Xiaodan Liang, and Jos� MF Moura. 2018. Adversarial geometry-aware human motion prediction. In Proceedings of the European Conference on Computer Vision (ECCV). 786--803.
[9]
Fei Han, Brian Reily, William Hoff, and Hao Zhang. 2017. Space-time representation of people based on 3D skeletal data: A review. Computer Vision and Image Understanding, Vol. 158 (2017), 85--105.
[10]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems. 6626--6637.
[11]
Zhiwu Huang, Chengde Wan, Thomas Probst, and Luc Van Gool. 2017. Deep learning on lie groups for skeleton-based action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 6099--6108.
[12]
Mohamed E Hussein, Marwan Torki, Mohammad A Gowayyed, and Motaz El-Saban. 2013. Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In Twenty-Third International Joint Conference on Artificial Intelligence (IJCAI).
[13]
Yunji Kim, Seonghyeon Nam, In Cho, and Seon Joo Kim. 2019. Unsupervised Keypoint Learning for Guiding Class-Conditional Video Prediction. In Advances in Neural Information Processing Systems. 3809--3819.
[14]
Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR).
[15]
Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. 2020. VIBE: Video Inference for Human Body Pose and Shape Estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
[16]
Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-Chun Wang, Yu-Ding Lu, Ming-Hsuan Yang, and Jan Kautz. 2019. Dancing to Music. In Advances in Neural Information Processing Systems. 3581--3591.
[17]
Wanqing Li, Zhengyou Zhang, and Zicheng Liu. 2010. Action recognition based on a bag of 3d points. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. IEEE, 9--14.
[18]
Angela S Lin, Lemeng Wu, Rodolfo Corona, Kevin Tai, Qixing Huang, and Raymond J Mooney. 2018. generating animated videos of human activities from natural language descriptions. In Proceedings of the Visually Grounded Interaction and Language Workshop at NeurIPS 2018.
[19]
Jun Liu, Amir Shahroudy, Mauricio Lisboa Perez, Gang Wang, Ling-Yu Duan, and Alex Kot Chichung. 2019 a. NTU RGBD 120: A Large-Scale Benchmark for 3D Human Activity Understanding. IEEE transactions on pattern analysis and machine intelligence (2019).
[20]
Zhenguang Liu, Shuang Wu, Shuyuan Jin, Qi Liu, Shijian Lu, Roger Zimmermann, and Li Cheng. 2019 b. Towards natural and accurate future motion prediction of humans and animals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 10004--10012.
[21]
Meinard M�ller. 2007. Information retrieval for music and motion. Vol. 2. Springer.
[22]
Meinard M�ller, Tido R�der, Michael Clausen, Bernhard Eberhardt, Bj�rn Kr�ger, and Andreas Weber. [n.d.]. Mocap database hdm05. ([n.,d.]).
[23]
Richard M Murray, Zexiang Li, and S Shankar Sastry. 1994. A mathematical introduction to robotic manipulation. CRC press.
[24]
Matthias Plappert, Christian Mandery, and Tamim Asfour. 2018. Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks. Robotics and Autonomous Systems, Vol. 109 (2018), 13--26.
[25]
Eli Shlizerman, Lucio Dery, Hayden Schoen, and Ira Kemelmacher-Shlizerman. 2018. Audio to body dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7574--7583.
[26]
Stephanie Stoll, Necati Cihan Camgoz, Simon Hadfield, and Richard Bowden. 2020. Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks. International Journal of Computer Vision, Vol. 128 (2020), 891--908.
[27]
Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2015. What makes tom hanks look like tom hanks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 3952--3960.
[28]
Kenta Takeuchi, Dai Hasegawa, Shinichi Shirakawa, Naoshi Kaneko, Hiroshi Sakuta, and Kazuhiko Sumi. 2017. Speech-to-gesture generation: A challenge in deep learning approach with bi-directional LSTM. In Proceedings of the 5th International Conference on Human Agent Interaction. 365--369.
[29]
Taoran Tang, Jia Jia, and Hanyang Mao. 2018. Dance with melody: An lstm-autoencoder approach to music-oriented dance synthesis. In Proceedings of the 26th ACM international conference on Multimedia (ACM MM). 1598--1606.
[30]
Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. 2018. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 1526--1535.
[31]
Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa. 2014. Human action recognition by representing 3d skeletons as points in a lie group. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 588--595.
[32]
Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. 2012. Mining actionlet ensemble for action recognition with depth cameras. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1290--1297.
[33]
Lu Xia, Chia-Chih Chen, and Jake K Aggarwal. 2012. View invariant human action recognition using histograms of 3d joints. In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 20--27.
[34]
Chi Xu, Lakshmi Narasimhan Govindarajan, Yu Zhang, and Li Cheng. 2017. Lie-X: Depth image based articulated object pose estimation, tracking, and action recognition on lie groups. International Journal of Computer Vision, Vol. 123, 3 (2017), 454--478.
[35]
Yaser Yacoob and Michael J Black. 1999. Parameterized modeling and recognition of activities. Computer Vision and Image Understanding, Vol. 73, 2 (1999), 232--247.
[36]
Xinchen Yan, Akash Rastogi, Ruben Villegas, Kalyan Sunkavalli, Eli Shechtman, Sunil Hadap, Ersin Yumer, and Honglak Lee. 2018. Mt-vae: Learning motion transformations to generate multimodal human dynamics. In Proceedings of the European Conference on Computer Vision (ECCV). 265--281.
[37]
Ceyuan Yang, Zhe Wang, Xinge Zhu, Chen Huang, Jianping Shi, and Dahua Lin. 2018. Pose guided human video generation. In Proceedings of the European Conference on Computer Vision (ECCV). 201--216.
[38]
Shihao Zou, Xinxin Zuo, Yiming Qian, Sen Wang, Chi Xu, Minglun Gong, and Li Cheng. 2020 a. 3D Human Shape Reconstruction from a Polarization Image. In Proceedings of the European Conference on Computer Vision (ECCV).
[39]
Shihao Zou, Xinxin Zuo, Yiming Qian, Sen Wang, Chi Xu, Minglun Gong, and Li Cheng. 2020 b. Polarization Human Shape and Pose Dataset.

Cited By

View all
  • (2024)ASMNet: Action and Style-Conditioned Motion Generative Network for 3D Human Motion GenerationCyborg and Bionic Systems10.34133/cbsystems.00905Online publication date: 6-Feb-2024
  • (2024)Incorporating variational auto-encoder networks for text-driven generation of 3D motion human bodyJournal of Image and Graphics10.11834/jig.23029129:5(1434-1446)Online publication date: 2024
  • (2024)AdaptControl: Adaptive Human Motion Control and Generation via User Prompt and Spatial Trajectory GuidanceProceedings of the 5th International Workshop on Human-centric Multimedia Analysis10.1145/3688865.3689476(13-22)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. 3d animation
  2. 3d motion generation
  3. lie algebra
  4. variational auto-encoder

Qualifiers

  • Research-article

Funding Sources

  • NSERC Discovery Grant

Conference

MM '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24
The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne , VIC , Australia

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)373
  • Downloads (Last 6 weeks)45
Reflects downloads up to 19 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)ASMNet: Action and Style-Conditioned Motion Generative Network for 3D Human Motion GenerationCyborg and Bionic Systems10.34133/cbsystems.00905Online publication date: 6-Feb-2024
  • (2024)Incorporating variational auto-encoder networks for text-driven generation of 3D motion human bodyJournal of Image and Graphics10.11834/jig.23029129:5(1434-1446)Online publication date: 2024
  • (2024)AdaptControl: Adaptive Human Motion Control and Generation via User Prompt and Spatial Trajectory GuidanceProceedings of the 5th International Workshop on Human-centric Multimedia Analysis10.1145/3688865.3689476(13-22)Online publication date: 28-Oct-2024
  • (2024)Generation of Novel Fall Animation with Configurable AttributesProceedings of the 9th International Conference on Movement and Computing10.1145/3658852.3659087(1-6)Online publication date: 30-May-2024
  • (2024)Flexible Motion In-betweening with Diffusion ModelsACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657414(1-9)Online publication date: 13-Jul-2024
  • (2024)Hierarchical Semantics Alignment for 3D Human Motion RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657804(1083-1092)Online publication date: 10-Jul-2024
  • (2024)State of the Art on Diffusion Models for Visual ComputingComputer Graphics Forum10.1111/cgf.1506343:2Online publication date: 30-Apr-2024
  • (2024)MotionGPT: Human Motion Synthesis with Improved Diversity and Realism via GPT-3 Prompting2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00499(5058-5068)Online publication date: 3-Jan-2024
  • (2024)Few-shot generative model for skeleton-based human action synthesis using cross-domain adversarial learning2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00390(3934-3943)Online publication date: 3-Jan-2024
  • (2024)Using LLMs to Animate Interactive Story Characters with Emotions and Personality2024 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW)10.1109/VRW62533.2024.00124(632-635)Online publication date: 16-Mar-2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media