research-article

Open access

Audio-visual TED corpus: enhancing the TED-LIUM corpus with facial information, contextual text and object recognition

Authors:

John Paul Shen,

Ian LaneAuthors Info & Claims

UbiComp/ISWC '19 Adjunct: Adjunct Proceedings of the 2019 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2019 ACM International Symposium on Wearable Computers

Pages 468 - 473

https://doi.org/10.1145/3341162.3344861

Published: 09 September 2019 Publication History

Abstract

We present a variety of new visual features in extension to the TED-LIUM corpus. We re-aligned the original TED talk audio transcriptions with official TED.com videos. By utilizing state-of-the-art models for face and facial landmarks detection, optical character recognition, object detection and classification, we extract four new visual features that can be used for Large-Vocabulary Continuous Speech Recognition (LVCSR) systems, including facial images, landmarks, text, and objects in the scenes. The facial images and landmarks can be used in combination with audio for audio-visual acoustic modeling where the visual modality provides robust features in adverse acoustic environments. The contextual information, i.e. extracted text and detected objects in the scene can be used as prior knowledge to create contextual language models. Experimental results showed the efficacy of using visual features on top of acoustic features for speech recognition in overlapping speech scenarios.

References

[1]

Gary Bradski. 2000. The OpenCV Library. Dr. Dobb's Journal of Software Tools (2000).

[2]

Martin Cooke, Jon Barker, Stuart Cunningham, and Xu Shao. 2006. An audio-visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of America (2006).

[3]

Andrzej Czyzewski, Bozena Kostek, Piotr Bratoszewski, Jozef Kotus, and Marcin Szykulski. 2017. An audio-visual corpus for multimodal automatic speech recognition. Journal of Intelligent Information Systems (2017).

Digital Library

[4]

Martin Danelljan, Gustav H�ger, Fahad Khan, and Michael Felsberg. 2014. Accurate scale estimation for robust visual tracking. In British Machine Vision Conference (BMVC).

[5]

The FFmpeg developers. 2019. ffmpeg version 2.8.11--0. http://ffmpeg.org/. Accessed: 2019-06-14.

[6]

Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2015. The Pascal Visual Object Classes Challenge: A Retrospective. International Journal of Computer Vision (IJCV) (2015).

Digital Library

[7]

Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. 2016. MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition. In European Conference on Computer Vision (ECCV).

[8]

Sepp Hochreiter and J�rgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation (1997).

[9]

Davis E. King. 2009. Dlib-ml: A Machine Learning Toolkit. Journal of Machine Learning Research (2009).

Digital Library

[10]

Martin Koestinger, Paul Wohlhart, Peter M Roth, and Horst Bischof. 2011. Annotated Facial Landmarks in the Wild: A Large-scale, Real-world Database for Facial Landmark Localization. In ICCV Workshop on Benchmarking Facial Image Analysis Technologies.

[11]

Bowon Lee, Mark Hasegawa-Johnson, Camille Goudeseune, Suketu Kamdar, Sarah Borys, Ming Liu, and Thomas Huang. 2004. AVICAR: Audio-visual speech corpus in a car environment. In Eighth International Conference on Spoken Language Processing.

[12]

Christopher McCool, Sebastien Marcel, Abdenour Hadid, Matti Pietikäinen, Pavel Matejka, Jan Cernockỳ, Norman Poh, Josef Kittler, Anthony Larcher, Christophe Levy, Driss Matrouf, Jean-François Bonastre, Phil Tresadern, and Timothy Cootes. 2012. Bi-modal person recognition on a mobile phone: using mobile phone data. In International Conference on Multimedia and Expo Workshops.

Digital Library

[13]

Kieron Messer, Jiri Matas, Josef Kittler, Juergen Luettin, and Gilbert Maitre. 1999. XM2VTSDB: The extended M2VTS database. In International Conference on Audio and Video-based Biometric Person Authentication.

[14]

Javier R Movellan. 1995. Visual speech recognition with stochastic networks. In Advances in Neural Information Processing Systems.

Digital Library

[15]

Hong-Wei Ng and Stefan Winkler. 2014. A Data-Driven Approach to Cleaning Large Face Datasets. In International Conference on Image Processing (ICIP).

[16]

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et al. 2015. Deep Face Recognition. In British Machine Vision Conference (BMVC).

[17]

Eric K Patterson, Sabri Gurbuz, Zekeriya Tufekci, and John N Gowdy. 2002. CUAVE: A new audio-visual database for multimodal human-computer interface research. In International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]

Dijana Petrovska-Delacr�taz, Sylvie Lelandais, Joseph Colineau, Liming Chen, Bernadette Dorizzi, M Ardabilian, E Krichen, M-A Mellakh, A Chaari, S Guerfi, et al. 2008. The IV 2 Multimodal Biometric Database (Including Iris, 2D, 3D, Stereoscopic, and Talking Face Data), and the IV 2-2007 Evaluation Campaign. In International Conference on Biometrics: Theory, Applications and Systems.

[19]

Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Better, Faster, Stronger. In Computer Vision and Pattern Recognition (CVPR).

[20]

Shaoqing Ren, Xudong Cao, Yichen Wei, and Jian Sun. 2014. Face Alignment at 3000 fps via Regressing Local Binary Features. In Computer Vision and Pattern Recognition (CVPR).

Digital Library

[21]

Anthony Rousseau, Paul Del�glise, and Yannick Est�ve. 2012. TED-LIUM: an Automatic Speech Recognition dedicated corpus. In International Conference on Language Resources and Evaluation (LREC).

[22]

A Rousseau, P Del�glise, and Y Est�ve. 2014. Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks. In International Conference on Language Resources and Evaluation (LREC).

[23]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) (2015).

Digital Library

[24]

Christos Sagonas, Epameinondas Antonakos, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. 2016. 300 Faces In-the-Wild Challenge: Database and Results. Image and Vision Computing (2016).

Digital Library

[25]

Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Computer Vision and Pattern Recognition (CVPR).

[26]

Ray Smith. 2007. An Overview of the Tesseract OCR Engine. In International Conference on Document Analysis and Recognition (ICDAR).

Digital Library

[27]

Fred Weinhaus. 2019. ImageMagick. http://www.fmwconcepts.com/imagemagick/. Accessed: 2019-06-14.

[28]

Shuo Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. 2016. WIDER FACE: A Face Detection Benchmark. In Computer Vision and Pattern Recognition (CVPR).

Cited By

Sun GZhang CWoodland P(2021)Tree-Constrained Pointer Generator for End-to-End Contextual Speech Recognition2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)10.1109/ASRU51503.2021.9687915(780-787)Online publication date: 13-Dec-2021
https://doi.org/10.1109/ASRU51503.2021.9687915
Cakir DKasap �(2021)Audio to Video: Generating a Talking Fake AgentProgress in Intelligent Decision Science10.1007/978-3-030-66501-2_17(212-227)Online publication date: 30-Jan-2021
https://doi.org/10.1007/978-3-030-66501-2_17

Recommendations

Audio-visual speech recognition using MPEG-4 compliant visual features

We describe an audio-visual automatic continuous speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system utilizes facial ...
Automatic visual feature extraction for mandarin audio-visual speech recognition
SMC'09: Proceedings of the 2009 IEEE international conference on Systems, Man and Cybernetics

Automatic speech recognition (ASR) by machine has been an attractive research area in past several decades. In recent years, there are many automatic speech-reading systems proposed that utilizing the combination of audio and visual speech features. In ...
A multi-purpose audio-visual corpus for multi-modal Persian speech recognition: The Arman-AV dataset
Abstract
Automatic lip reading has advanced significantly in recent years. However, these methods need large-scale datasets that are scarce for many low-resource languages. In this paper, we introduce a new multipurpose audio-visual dataset for Persian. ...
Highlights
- New multipurpose audiovisual dataset for Persian language.
- First large-scale lip-reading dataset in Persian.
- Dataset suitable for lip-reading, ASR and audio-visual speech recognition.
- A new technique for detecting visemes in ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

UbiComp/ISWC '19 Adjunct: Adjunct Proceedings of the 2019 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2019 ACM International Symposium on Wearable Computers

September 2019

1234 pages

ISBN:9781450368698

DOI:10.1145/3341162

General Chairs:
Robert Harle
Cambridge University
,
Katayoun Farrahi
University Of Southampton
,
Nicholas Lane
University Of Oxford And Samsung Ai

Copyright � 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 September 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

UbiComp '19

Sponsor:

UbiComp '19: The 2019 ACM International Joint Conference on Pervasive and Ubiquitous Computing

September 9 - 13, 2019

London, United Kingdom

Acceptance Rates

Overall Acceptance Rate 764 of 2,912 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
383
Total Downloads

Downloads (Last 12 months)59
Downloads (Last 6 weeks)7

Reflects downloads up to 19 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sun GZhang CWoodland P(2021)Tree-Constrained Pointer Generator for End-to-End Contextual Speech Recognition2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)10.1109/ASRU51503.2021.9687915(780-787)Online publication date: 13-Dec-2021
https://doi.org/10.1109/ASRU51503.2021.9687915
Cakir DKasap �(2021)Audio to Video: Generating a Talking Fake AgentProgress in Intelligent Decision Science10.1007/978-3-030-66501-2_17(212-227)Online publication date: 30-Jan-2021
https://doi.org/10.1007/978-3-030-66501-2_17

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents