skip to main content
10.1145/3623264.3624454acmconferencesArticle/Chapter ViewAbstractPublication PagesmigConference Proceedingsconference-collections
short-paper

Motion-DVAE: Unsupervised learning for fast human motion denoising

Published: 15 November 2023 Publication History

Abstract

Pose and motion priors are crucial for recovering realistic and accurate human motion from noisy observations. Substantial progress has been made on pose and shape estimation from images, and recent works showed impressive results using priors to refine frame-wise predictions. However, a lot of motion priors only model transitions between consecutive poses and are used in time-consuming optimization procedures, which is problematic for many applications requiring real-time motion capture. We introduce Motion-DVAE, a motion prior to capture the short-term dependencies of human motion. As part of the dynamical variational autoencoder (DVAE) models family, Motion-DVAE combines the generative capability of VAE models and the temporal modeling of recurrent architectures. Together with Motion-DVAE, we introduce an unsupervised learned denoising method unifying regression- and optimization-based approaches in a single framework for real-time 3D human pose estimation. Experiments show that the proposed approach reaches competitive performance with state-of-the-art methods while being much faster.

Supplementary Material

Detailed calculations, qualitative results and supplementary experiments (mig23-18-supplementary.zip)

References

[1]
Thiemo Alldieck, Marcus Magnor, Bharat�Lal Bhatnagar, Christian Theobalt, and Gerard Pons-Moll. 2019. Learning to reconstruct people in clothing from a single RGB camera. In Computer Vision and Pattern Recognition (CVPR).
[2]
Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. 2005. SCAPE: Shape Completion and Animation of People. ACM Transactions on Graphics 24 (2005).
[3]
Anurag Arnab, Carl Doersch, and Andrew Zisserman. 2019. Exploiting temporal context for 3D human pose estimation in the wild. In Computer Vision and Pattern Recognition (CVPR).
[4]
Emad Barsoum, John Kender, and Zicheng Liu. 2018. Hp-gan: Probabilistic 3d human motion prediction via gan. In Computer Vision and Pattern Recognition Workshops (CVPRW).
[5]
Xiaoyu Bie, Wen Guo, Simon Leglaive, Lauren Girin, Francesc Moreno-Noguer, and Xavier Alameda-Pineda. 2022. Hit-dvae: Human motion generation via hierarchical transformer dynamical vae. arXiv preprint arXiv:2204.01565 (2022).
[6]
Christopher�M Bishop and Nasser�M Nasrabadi. 2006. Pattern recognition and machine learning. Vol.�4. Springer.
[7]
Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael�J. Black. 2016. Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image. In European Conference on Computer Vision (ECCV).
[8]
Marcus Brubaker, David Fleet, and Aaron Hertzmann. 2010. Physics-Based Person Tracking Using The Anthropomorphic Walker. International Journal of Computer Vision 87 (2010).
[9]
Yujun Cai, Yiwei Wang, Yiheng Zhu, Tat-Jen Cham, Jianfei Cai, Junsong Yuan, Jun Liu, Chuanxia Zheng, Sijie Yan, Henghui Ding, Xiaohui Shen, Ding Liu, and Nadia�Magnenat Thalmann. 2021a. A Unified 3D Human Motion Synthesis Model via Conditional Variational Auto-Encoder. In International Conference on Computer Vision (ICCV).
[10]
Zhongang Cai, Mingyuan Zhang, Jiawei Ren, Chen Wei, Daxuan Ren, Jiatong Li, Zhengyu Lin, Haiyu Zhao, Shuai Yi, Lei Yang, 2021b. Playing for 3D human recovery. arXiv preprint arXiv:2110.07588 (2021).
[11]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Computer Vision and Pattern Recognition (CVPR).
[12]
Yixin Chen, Siyuan Huang, Tao Yuan, Siyuan Qi, Yixin Zhu, and Song-Chun Zhu. 2019. Holistic++ scene understanding: Single-view 3d holistic scene parsing and human pose estimation with human-object interaction and physical commonsense. In International Conference on Computer Vision (ICCV).
[13]
Yu Cheng, Bo Yang, Bo Wang, Yan Wending, and Robby Tan. 2019. Occlusion-Aware Networks for 3D Human Pose Estimation in Video. In International Conference on Computer Vision (ICCV).
[14]
Vasileios Choutas, Georgios Pavlakos, Timo Bolkart, Dimitrios Tzionas, and Michael�J. Black. 2020. Monocular Expressive Body Regression through Body-Driven Attention. In European Conference on Computer Vision (ECCV).
[15]
Enric Corona, Gerard Pons-Moll, Guillem Aleny�, and Francesc Moreno-Noguer. 2022. Learned Vertex Descent: A New Direction for 3D Human Model Fitting. In European Conference on Computer Vision (ECCV).
[16]
Bin Dai, Ziyu Wang, and David Wipf. 2020. The Usual Suspects? Reassessing Blame for VAE Posterior Collapse. In International Conference on Machine Learning (ICML).
[17]
Bin Dai and David Wipf. 2019. Diagnosing and Enhancing VAE Models. In International Conference on Learning Representations (ICLR).
[18]
Haihan Duan, Jiaye Li, Sizheng Fan, Zhonghao Lin, Xiao Wu, and Wei Cai. 2021. Metaverse for social good: A university campus prototype. In Proceedings of the 29th ACM international Conference on Multimedia.
[19]
Moritz Einfalt, Dan Zecha, and Rainer Lienhart. 2018. Activity-Conditioned Continuous Human Pose Estimation for Performance Analysis of Athletes Using the Example of Swimming. In Winter Conference on Applications of Computer Vision (WACV).
[20]
Taosha Fan, Kalyan�Vasudev Alwala, Donglai Xiang, Weipeng Xu, Todd Murphey, and Mustafa Mukadam. 2021. Revitalizing optimization for 3d human pose and shape estimation: A sparse constrained formulation. In International Conference on Computer Vision (ICCV).
[21]
Dan Geiger, Thomas Verma, and Judea Pearl. 1990. Identifying independence in Bayesian networks. Networks 20 (1990).
[22]
Laurent Girin, Simon Leglaive, Xiaoyu Bie, Julien Diard, Thomas Hueber, and Xavier Alameda-Pineda. 2021. Dynamical Variational Autoencoders: A Comprehensive Review. Foundations and Trends� in Machine Learning 15 (2021).
[23]
Riza�Alp Guler and Iasonas Kokkinos. 2019. Holopose: Holistic 3d human reconstruction in-the-wild. In Computer Vision and Pattern Recognition (CVPR).
[24]
Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. 2020. Action2Motion: Conditioned Generation of 3D Human Motions. In Proceedings of the 28th ACM International Conference on Multimedia.
[25]
Ikhsanul Habibie, Daniel Holden, Jonathan Schwarz, Joseph Yearsley, and Taku Komura. 2017. A Recurrent Variational Autoencoder for Human Motion Synthesis. In British Machine Vision Conference (BMVC).
[26]
Wenbo Hu, Changgong Zhang, Fangneng Zhan, Lei Zhang, and Tien-Tsin Wong. 2021. Conditional directed graph convolution for 3d human pose estimation. In Proceedings of the 29th ACM International Conference on Multimedia.
[27]
Buzhen Huang, Liang Pan, Yuan Yang, Jingyi Ju, and Yangang Wang. 2022a. Neural MoCon: Neural Motion Control for Physically Plausible Human Motion Capture. In Computer Vision and Pattern Recognition (CVPR).
[28]
Buzhen Huang, Yuan Shu, Jingyi Ju, and Yangang Wang. 2022b. Occluded Human Body Capture with Self-Supervised Spatial-Temporal Motion Prior. arXiv preprint arXiv:2207.05375 (2022).
[29]
William�Robert Johnson, Jacqueline Alderson, David Lloyd, and Ajmal Mian. 2019. Predicting Athlete Ground Reaction Forces and Moments From Spatio-Temporal Driven CNN Models. IEEE Transactions on Biomedical Engineering 66 (2019).
[30]
Hanbyul Joo, Natalia Neverova, and Andrea Vedaldi. 2021. Exemplar fine-tuning for 3d human model fitting towards in-the-wild 3d human pose estimation. In International Conference on 3D Vision (3DV). IEEE.
[31]
Angjoo Kanazawa, Michael�J. Black, David�W. Jacobs, and Jitendra Malik. 2018. End-to-end Recovery of Human Shape and Pose. In Computer Vision and Pattern Recognition (CVPR).
[32]
Angjoo Kanazawa, Jason�Y. Zhang, Panna Felsen, and Jitendra Malik. 2019. Learning 3D Human Dynamics from Video. In Computer Vision and Pattern Recognition (CVPR).
[33]
Zhiqi Kang, Radu Horaud, and Mostafa Sadeghi. 2021. Robust Face Frontalization For Visual Speech Recognition. In International Conference on Computer Vision Workshops (ICCVW).
[34]
Do-Yeop Kim and Ju-Yong Chang. 2021. Attention-Based 3D Human Pose Sequence Refinement Network. Sensors 21 (2021).
[35]
Diederik�P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
[36]
Muhammed Kocabas, Nikos Athanasiou, and Michael�J. Black. 2020. VIBE: Video Inference for Human Body Pose and Shape Estimation. In Computer Vision and Pattern Recognition (CVPR).
[37]
Muhammed Kocabas, Chun-Hao�P. Huang, Otmar Hilliges, and Michael�J. Black. 2021. PARE: Part Attention Regressor for 3D Human Body Estimation. In International Conference on Computer Vision (ICCV).
[38]
Nikos Kolotouros, Georgios Pavlakos, Michael�J Black, and Kostas Daniilidis. 2019. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In International Conference on Computer Vision (ICCV).
[39]
Nikos Kolotouros, Georgios Pavlakos, Dinesh Jayaraman, and Kostas Daniilidis. 2021. Probabilistic Modeling for Human Mesh Recovery. In International Conference on Computer Vision (ICCV).
[40]
Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael�J Black, and Peter�V Gehler. 2017. Unite the people: Closing the loop between 3d and 2d human representations. In Computer Vision and Pattern Recognition (CVPR).
[41]
Lik-Hang Lee, Tristan Braud, Pengyuan Zhou, Lin Wang, Dianlei Xu, Zijun Lin, Abhishek Kumar, Carlos Bermejo, and Pan Hui. 2021. All one needs to know about metaverse: A complete survey on technological singularity, virtual ecosystem, and research agenda. arXiv preprint arXiv:2110.05352 (2021).
[42]
Jiaman Li, Ruben Villegas, Duygu Ceylan, Jimei Yang, Zhengfei Kuang, Hao Li, and Yajie Zhao. 2021. Task-Generic Hierarchical Human Motion Prior using VAEs. In International Conference on 3D Vision (3DV).
[43]
Hung Yu Ling, Fabio Zinno, George Cheng, and Michiel van de Panne. 2020a. Character Controllers Using Motion VAEs. ACM Transactions on Graphics 39 (2020).
[44]
Hung Yu Ling, Fabio Zinno, George Cheng, and Michiel van de Panne. 2020b. Character Controllers Using Motion VAEs. ACM Transactions on Graphics 39 (2020).
[45]
Yuzhao Liu, Yuhan Liu, Shihui Xu, Kelvin Cheng, Soh Masuko, and Jiro Tanaka. 2020. Comparing VR- and AR-Based Try-On Systems Using Personalized Avatars. Electronics 9 (2020).
[46]
Matthew Loper, Naureen Mahmood, and Michael J. Black. 2014. MoSh: Motion and Shape Capture from Sparse Markers. ACM Transactions on Graphics 33 (2014).
[47]
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. 2015. SMPL: A Skinned Multi-Person Linear Model. ACM Transactions on Graphics 34 (2015).
[48]
James Lucas, George Tucker, Roger B Grosse, and Mohammad Norouzi. 2019. Don’t blame the elbo! a linear vae perspective on posterior collapse. Advances in Neural Information Processing Systems 32 (2019).
[49]
Zhengyi Luo, S. Alireza Golestaneh, and Kris M. Kitani. 2021. 3D Human Motion Estimation via Motion Compression and Refinement. In Asian Conference on Computer Vision (ACCV), Hiroshi Ishikawa, Cheng-Lin Liu, Tomas Pajdla, and Jianbo Shi (Eds.).
[50]
Kedi Lyu, Zhenguang Liu, Shuang Wu, Haipeng Chen, Xuhong Zhang, and Yuyu Yin. 2021. Learning human motion prediction via stochastic differential equations. In Proceedings of the 29th ACM International Conference on Multimedia.
[51]
Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. 2019. AMASS: Archive of Motion Capture as Surface Shapes. In International Conference on Computer Vision (ICCV).
[52]
Julieta Martinez, Rayat Hossain, Javier Romero, and James J. Little. 2017. A simple yet effective baseline for 3d human pose estimation. In International Conference on Computer Vision (ICCV).
[53]
Radford M Neal and Geoffrey E Hinton. 1998. A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in graphical models (1998).
[54]
Jenna Ng. 2012. Seeing Movement: On Motion Capture Animation and James Cameron’s Avatar. Animation 7 (2012).
[55]
Dirk Ormoneit, Hedvig Sidenbladh, Michael Black, and Trevor Hastie. 2000. Learning and Tracking Cyclic Human Motion. In Neural Information Processing Systems (NeurIPS).
[56]
Ahmed A A Osman, Timo Bolkart, and Michael J. Black. 2020. STAR: A Sparse Trained Articulated Human Body Regressor. In European Conference on Computer Vision (ECCV).
[57]
Ahmed A. A. Osman, Timo Bolkart, Dimitrios Tzionas, and Michael J. Black. 2022. SUPR: A Sparse Unified Part-Based Human Representation. In European Conference on Computer Vision (ECCV).
[58]
Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. 2019. Expressive Body Capture: 3D Hands, Face, and Body from a Single Image. In Computer Vision and Pattern Recognition (CVPR).
[59]
Mathis Petrovich, Michael J. Black, and Gül Varol. 2021. Action-Conditioned 3D Human Motion Synthesis with Transformer VAE. In International Conference on Computer Vision (ICCV).
[60]
Davis Rempe, Tolga Birdal, Aaron Hertzmann, Jimei Yang, Srinath Sridhar, and Leonidas J. Guibas. 2021. HuMoR: 3D Human Motion Model for Robust Pose Estimation. In International Conference on Computer Vision (ICCV).
[61]
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning (ICML).
[62]
Ricardo M Rodriguez, Rafael Aguilar, Santiago Uceda, and Benjamín Castañeda. 2018. 3D pose estimation oriented to the initialization of an augmented reality system applied to cultural heritage. In Digital Cultural Heritage: Final Conference of the Marie Skłodowska-Curie Initial Training Network for Digital Cultural Heritage, ITN-DCH 2017, Olimje, Slovenia, May 23–25, 2017, Revised Selected Papers.
[63]
Javier Romero, Dimitrios Tzionas, and Michael J. Black. 2017. Embodied Hands: Modeling and Capturing Hands and Bodies Together. ACM Transactions on Graphics 36 (2017).
[64]
Akash Sengupta, Ignas Budvytis, and Roberto Cipolla. 2021. Hierarchical Kinematic Probability Distributions for 3D Human Shape and Pose Estimation from Images in the Wild. In International Conference on Computer Vision (ICCV).
[65]
Soshi Shimada, Vladislav Golyanik, Weipeng Xu, and Christian Theobalt. 2020. PhysCap: Physically Plausible Monocular 3D Motion Capture in Real Time. ACM Transactions on Graphics 39 (2020).
[66]
Hedvig Sidenbladh, Michael J. Black, and David J. Fleet. 2000. Stochastic Tracking of 3D Human Figures Using 2D Image Motion. In European Conference on Computer Vision (ECCV).
[67]
Jie Song, Xu Chen, and Otmar Hilliges. 2020. Human body model fitting by learned gradient descent. In European Conference on Computer Vision (ECCV).
[68]
Sebastian Starke, Ian Mason, and Taku Komura. 2022. DeepPhase: Periodic Autoencoders for Learning Motion Phase Manifolds. ACM Transactions on Graphics 41 (2022).
[69]
Efstathios Stavrakis, Andreas Aristidou, Maria Savva, Stephania Loizidou Himona, and Yiorgos Chrysanthou. 2012. Digitization of cypriot folk dances. In Progress in Cultural Heritage Preservation: 4th International Conference, EuroMed 2012, Limassol, Cyprus, October 29–November 3, 2012. Proceedings 4.
[70]
Bugra Tekin, Pablo Márquez-Neila, Mathieu Salzmann, and Pascal Fua. 2017. Learning to fuse 2d and 3d image cues for monocular body pose estimation. In International Conference on Computer Vision (ICCV).
[71]
Garvita Tiwari, Dimitrije Antic, Jan Eric Lenssen, Nikolaos Sarafianos, Tony Tung, and Gerard Pons-Moll. 2022. Pose-NDF: Modeling Human Pose Manifolds with Neural Distance Fields. In European Conference on Computer Vision (ECCV).
[72]
Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, and Cordelia Schmid. 2017. Learning from Synthetic Humans. In Computer Vision and Pattern Recognition (CVPR).
[73]
Bastian Wandt and Bodo Rosenhahn. 2019. RepNet: Weakly Supervised Training of an Adversarial Reprojection Network for 3D Human Pose Estimation. In Computer Vision and Pattern Recognition (CVPR).
[74]
Jianbo Wang, Kai Qiu, Houwen Peng, Jianlong Fu, and Jianke Zhu. 2019. AI Coach: Deep Human Pose Estimation and Analysis for Personalized Athletic Training Assistance. In Proceedings of the 27th ACM International Conference on Multimedia.
[75]
Hongyi Xu, Eduard Gabriel Bazavan, Andrei Zanfir, William T Freeman, Rahul Sukthankar, and Cristian Sminchisescu. 2020. GHUM & GHUML: Generative 3D Human Shape and Articulated Pose Models. In Computer Vision and Pattern Recognition (CVPR).
[76]
Chuanhang Yan, Yu Sun, Qian Bao, Jinhui Pang, Wu Liu, and Tao Mei. 2022. WOC: A Handy Webcam-Based 3D Online Chatroom. In Proceedings of the 30th ACM International Conference on Multimedia.
[77]
Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. 2023. Decoupling Human and Camera Motion from Videos in the Wild. arXiv preprint arXiv:2302.12827 (2023).
[78]
Ye Yuan, Umar Iqbal, Pavlo Molchanov, Kris Kitani, and Jan Kautz. 2022. GLAMR: Global occlusion-aware human mesh recovery with dynamic cameras. In Computer Vision and Pattern Recognition (CVPR).
[79]
Ye Yuan, Shih-En Wei, Tomas Simon, Kris Kitani, and Jason Saragih. 2021. SimPoE: Simulated Character Control for 3D Human Pose Estimation. In Computer Vision and Pattern Recognition (CVPR).
[80]
Hongwen Zhang, Jie Cao, Guo Lu, Wanli Ouyang, and Zhenan Sun. 2019. Danet: Decompose-and-aggregate network for 3d human shape and pose estimation. In Proceedings of the 27th ACM International Conference on Multimedia.
[81]
Yan Zhang, Mohamed Hassan, Heiko Neumann, Michael�J. Black, and Siyu Tang. 2020. Generating 3D People in Scenes without People. In Computer Vision and Pattern Recognition (CVPR).
[82]
Na Zheng, Xuemeng Song, Zhaozheng Chen, Linmei Hu, Da Cao, and Liqiang Nie. 2019. Virtually Trying on New Clothing with Arbitrary Poses. In Proceedings of the 27th ACM International Conference on Multimedia.
[83]
Xiaowei Zhou, Menglong Zhu, Kosta Derpanis, and Kostas Daniilidis. 2016. Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video. In Computer Vision and Pattern Recognition (CVPR).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MIG '23: Proceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games
November 2023
224 pages
ISBN:9798400703935
DOI:10.1145/3623264
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Motion prior
  2. dynamical variational autoencoder.
  3. motion denoising
  4. unsupervised learning

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Funding Sources

  • ANR

Conference

MIG '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate -9 of -9 submissions, 100%

Upcoming Conference

MIG '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 117
    Total Downloads
  • Downloads (Last 12 months)117
  • Downloads (Last 6 weeks)15
Reflects downloads up to 19 Oct 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media