skip to main content
10.1145/3341162.3344861acmconferencesArticle/Chapter ViewAbstractPublication PagesubicompConference Proceedingsconference-collections
research-article
Open access

Audio-visual TED corpus: enhancing the TED-LIUM corpus with facial information, contextual text and object recognition

Published: 09 September 2019 Publication History

Abstract

We present a variety of new visual features in extension to the TED-LIUM corpus. We re-aligned the original TED talk audio transcriptions with official TED.com videos. By utilizing state-of-the-art models for face and facial landmarks detection, optical character recognition, object detection and classification, we extract four new visual features that can be used for Large-Vocabulary Continuous Speech Recognition (LVCSR) systems, including facial images, landmarks, text, and objects in the scenes. The facial images and landmarks can be used in combination with audio for audio-visual acoustic modeling where the visual modality provides robust features in adverse acoustic environments. The contextual information, i.e. extracted text and detected objects in the scene can be used as prior knowledge to create contextual language models. Experimental results showed the efficacy of using visual features on top of acoustic features for speech recognition in overlapping speech scenarios.

References

[1]
Gary Bradski. 2000. The OpenCV Library. Dr. Dobb's Journal of Software Tools (2000).
[2]
Martin Cooke, Jon Barker, Stuart Cunningham, and Xu Shao. 2006. An audio-visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of America (2006).
[3]
Andrzej Czyzewski, Bozena Kostek, Piotr Bratoszewski, Jozef Kotus, and Marcin Szykulski. 2017. An audio-visual corpus for multimodal automatic speech recognition. Journal of Intelligent Information Systems (2017).
[4]
Martin Danelljan, Gustav H�ger, Fahad Khan, and Michael Felsberg. 2014. Accurate scale estimation for robust visual tracking. In British Machine Vision Conference (BMVC).
[5]
The FFmpeg developers. 2019. ffmpeg version 2.8.11--0. http://ffmpeg.org/. Accessed: 2019-06-14.
[6]
Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2015. The Pascal Visual Object Classes Challenge: A Retrospective. International Journal of Computer Vision (IJCV) (2015).
[7]
Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. 2016. MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition. In European Conference on Computer Vision (ECCV).
[8]
Sepp Hochreiter and J�rgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation (1997).
[9]
Davis E. King. 2009. Dlib-ml: A Machine Learning Toolkit. Journal of Machine Learning Research (2009).
[10]
Martin Koestinger, Paul Wohlhart, Peter M Roth, and Horst Bischof. 2011. Annotated Facial Landmarks in the Wild: A Large-scale, Real-world Database for Facial Landmark Localization. In ICCV Workshop on Benchmarking Facial Image Analysis Technologies.
[11]
Bowon Lee, Mark Hasegawa-Johnson, Camille Goudeseune, Suketu Kamdar, Sarah Borys, Ming Liu, and Thomas Huang. 2004. AVICAR: Audio-visual speech corpus in a car environment. In Eighth International Conference on Spoken Language Processing.
[12]
Christopher McCool, Sebastien Marcel, Abdenour Hadid, Matti Pietikäinen, Pavel Matejka, Jan Cernockỳ, Norman Poh, Josef Kittler, Anthony Larcher, Christophe Levy, Driss Matrouf, Jean-François Bonastre, Phil Tresadern, and Timothy Cootes. 2012. Bi-modal person recognition on a mobile phone: using mobile phone data. In International Conference on Multimedia and Expo Workshops.
[13]
Kieron Messer, Jiri Matas, Josef Kittler, Juergen Luettin, and Gilbert Maitre. 1999. XM2VTSDB: The extended M2VTS database. In International Conference on Audio and Video-based Biometric Person Authentication.
[14]
Javier R Movellan. 1995. Visual speech recognition with stochastic networks. In Advances in Neural Information Processing Systems.
[15]
Hong-Wei Ng and Stefan Winkler. 2014. A Data-Driven Approach to Cleaning Large Face Datasets. In International Conference on Image Processing (ICIP).
[16]
Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et al. 2015. Deep Face Recognition. In British Machine Vision Conference (BMVC).
[17]
Eric K Patterson, Sabri Gurbuz, Zekeriya Tufekci, and John N Gowdy. 2002. CUAVE: A new audio-visual database for multimodal human-computer interface research. In International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[18]
Dijana Petrovska-Delacr�taz, Sylvie Lelandais, Joseph Colineau, Liming Chen, Bernadette Dorizzi, M Ardabilian, E Krichen, M-A Mellakh, A Chaari, S Guerfi, et al. 2008. The IV 2 Multimodal Biometric Database (Including Iris, 2D, 3D, Stereoscopic, and Talking Face Data), and the IV 2-2007 Evaluation Campaign. In International Conference on Biometrics: Theory, Applications and Systems.
[19]
Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Better, Faster, Stronger. In Computer Vision and Pattern Recognition (CVPR).
[20]
Shaoqing Ren, Xudong Cao, Yichen Wei, and Jian Sun. 2014. Face Alignment at 3000 fps via Regressing Local Binary Features. In Computer Vision and Pattern Recognition (CVPR).
[21]
Anthony Rousseau, Paul Del�glise, and Yannick Est�ve. 2012. TED-LIUM: an Automatic Speech Recognition dedicated corpus. In International Conference on Language Resources and Evaluation (LREC).
[22]
A Rousseau, P Del�glise, and Y Est�ve. 2014. Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks. In International Conference on Language Resources and Evaluation (LREC).
[23]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) (2015).
[24]
Christos Sagonas, Epameinondas Antonakos, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. 2016. 300 Faces In-the-Wild Challenge: Database and Results. Image and Vision Computing (2016).
[25]
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Computer Vision and Pattern Recognition (CVPR).
[26]
Ray Smith. 2007. An Overview of the Tesseract OCR Engine. In International Conference on Document Analysis and Recognition (ICDAR).
[27]
Fred Weinhaus. 2019. ImageMagick. http://www.fmwconcepts.com/imagemagick/. Accessed: 2019-06-14.
[28]
Shuo Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. 2016. WIDER FACE: A Face Detection Benchmark. In Computer Vision and Pattern Recognition (CVPR).

Cited By

View all
  • (2021)Tree-Constrained Pointer Generator for End-to-End Contextual Speech Recognition2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)10.1109/ASRU51503.2021.9687915(780-787)Online publication date: 13-Dec-2021
  • (2021)Audio to Video: Generating a Talking Fake AgentProgress in Intelligent Decision Science10.1007/978-3-030-66501-2_17(212-227)Online publication date: 30-Jan-2021

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
UbiComp/ISWC '19 Adjunct: Adjunct Proceedings of the 2019 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2019 ACM International Symposium on Wearable Computers
September 2019
1234 pages
ISBN:9781450368698
DOI:10.1145/3341162
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 September 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. audio-visual speech recognition
  2. multi-modal interaction
  3. speech recognition corpus

Qualifiers

  • Research-article

Conference

UbiComp '19

Acceptance Rates

Overall Acceptance Rate 764 of 2,912 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)59
  • Downloads (Last 6 weeks)7
Reflects downloads up to 19 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Tree-Constrained Pointer Generator for End-to-End Contextual Speech Recognition2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)10.1109/ASRU51503.2021.9687915(780-787)Online publication date: 13-Dec-2021
  • (2021)Audio to Video: Generating a Talking Fake AgentProgress in Intelligent Decision Science10.1007/978-3-030-66501-2_17(212-227)Online publication date: 30-Jan-2021

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media