Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition

Lin, Yist Y.; Han, Tao; Xu, Haihua; Pham, Van Tung; Khassanov, Yerbolat; Chong, Tze Yuang; He, Yi; Lu, Lu; Ma, Zejun

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2210.15876 (eess)

[Submitted on 28 Oct 2022 (v1), last revised 25 May 2023 (this version, v2)]

Title:Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition

Authors:Yist Y. Lin, Tao Han, Haihua Xu, Van Tung Pham, Yerbolat Khassanov, Tze Yuang Chong, Yi He, Lu Lu, Zejun Ma

View PDF

Abstract:One of limitations in end-to-end automatic speech recognition (ASR) framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose an on-the-fly random utterance concatenation (RUC) based data augmentation method to alleviate train-test utterance length mismatch issue for short-video ASR task. Specifically, we are motivated by observations that our human-transcribed training utterances tend to be much shorter for short-video spontaneous speech (~3 seconds on average), while our test utterance generated from voice activity detection front-end is much longer (~10 seconds on average). Such a mismatch can lead to suboptimal performance. Empirically, it's observed the proposed RUC method significantly improves long utterance recognition without performance drop on short one. Overall, it achieves 5.72% word error rate reduction on average for 15 languages and improved robustness to various utterance length.

Comments:	5 pages, 3 figures, 4 tables
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
Cite as:	arXiv:2210.15876 [eess.AS]
	(or arXiv:2210.15876v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2210.15876

Submission history

From: Yerbolat Khassanov [view email]
[v1] Fri, 28 Oct 2022 03:54:57 UTC (80 KB)
[v2] Thu, 25 May 2023 05:32:24 UTC (87 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators