skip to main content
research-article

Regularized minimum variance distortionless response-based cepstral features for robust continuous speech recognition

Published: 01 October 2015 Publication History

Abstract

We study the low-variance and robust features for speech recognition system on the AURORA-4 corpus.We propose to compute cepstral features from a regularized MVDR (RMVDR) spectral estimates, denoted as RMVDR-based Cepstral Coefficient (RMCC) features.A sigmoid-shape auditory domain weighting rule is proposed for speech spectrum enhancement and incorporated in to the RMCC framework.We incorporate the medium duration power bias subtraction (MDPBS) method in to the RMCC framework.Two robust front-ends are proposed, robust RMCC (RRMCC) and Normalized RMCC (NRMCC) for speech recognition. In this paper, we present robust feature extractors that incorporate a regularized minimum variance distortionless response (RMVDR) spectrum estimator instead of the discrete Fourier transform-based direct spectrum estimator, used in many front-ends including the conventional MFCC, to estimate the speech power spectrum. Direct spectrum estimators, e.g., single tapered periodogram, have high variance and they perform poorly under noisy and adverse conditions. To reduce this performance drop we propose to increase the robustness of speech recognition systems by extracting features that are more robust based on the regularized MVDR technique. The RMVDR spectrum estimator has low spectral variance and is robust to mismatch conditions. Based on the RMVDR spectrum estimator, robust acoustic front-ends, namely, are regularized MVDR-based cepstral coefficients (RMCC), robust RMVDR cepstral coefficients (RRMCC) and normalized RMVDR cepstral coefficients (NRMCC). In addition to the RMVDR spectrum estimator, RRMCC and NRMCC also utilize auditory domain spectrum enhancement methods, auditory spectrum enhancement (ASE) and medium duration power bias subtraction (MDPBS) techniques, respectively, to improve the robustness of the feature extraction method. Experimental speech recognition results are conducted on the AURORA-4 large vocabulary continuous speech recognition corpus and performances are compared with the Mel frequency cepstral coefficients (MFCC), perceptual linear prediction (PLP), MVDR spectrum estimator-based MFCC, perceptual MVDR (PMVDR), cochlear filterbank cepstral coefficients (CFCC), power normalized cepstral coefficients (PNCC), ETSI advancement front-end (ETSI-AFE), and the robust feature extractor (RFE) of Alam et al. (2012). Experimental results demonstrate that the proposed robust feature extractors outperformed the other robust front-ends in terms of percentage word error rate on the AURORA-4 large vocabulary continuous speech recognition (LVCSR) task under clean and multi-condition training conditions. In clean training conditions, on average, the RRMCC and NRMCC provide significant reductions in word error rate over the rest of the front-ends. In multi-condition training, the RMCC, RRMCC, and NRMCC perform slightly better in terms of the average word error rate than the rest of the front-ends used in this work.

References

[1]
Alam, M.J., Ouellet, P., Kenny, P., O'Shaughnessy, D., 2011. Comparative evaluation of feature normalization techniques for speaker verification. In: Proc. NOLISP, LNAI 7015, Las Palmas, Spain, November 2011, pp. 246-253.
[2]
Alam, M.J., Kenny, P., O'Shaughnessy, D., 2012. Robust feature extraction for speech recognition by enhancing auditory spectrum. In: Proc. INTERSPEECH, Portland, Oregon, September 2012.
[3]
Alam, M.J., Kenny, P., O'Shaughnessy, D., 2013a. Smoothed nonlinear energy operator-based amplitude modulation features for robust speech recognition. In: Proc. NOLISP, LNAI 7911, 2013, pp. 168-175.
[4]
Alam, M.J., Kenny, P., O'Shaughnessy, D., 2013b. Speech recognition using regularized minimum variance distortion-less response spectrum-estimation based cepstral features. In: Proc. ICASSP, Vancouver, Canada, May 2013. <http://www.crim.ca/perso/patrick.kenny/Alam_icassp2013.pdf>.
[5]
Alam, M.J., O'Shaughnessy, D., Kenny, P., 2013c. A novel feature extractor employing regularized MVDR spectrum estimator and subband spectrum enhancement technique. In: Proc. WOSSPA, Algiers, Algeria, May 2013. <http://www.crim.ca/perso/patrick.kenny/Alam_WOSSPA_2013.pdf>.
[6]
Alam, M.J., Kenny, P., O'Shaughnessy, D., 2013d. Regularized MVDR spectrum estimation-based robust feature extractors for speech recognition. In: Proc. INTERSPEECH, Lyon, France, August 2013.
[7]
Alam, M.J., Gupta, V., Kenny, P., Dumouchel, P., 2014a. Use of multiple front-ends and i-vector based speaker adaptation for robust speech recognition. In: Proc. REVERB Challenge, Florence, Italy, May 2014.
[8]
Md. Jahangir Alam, Patrick Kenny, Douglas O'Shaughnessy, Robust feature extraction based on an asymmetric level dependent auditory filterbank and a subband spectrum enhancement technique, Digit. Signal Process., 29 (2014) 147-157.
[9]
B. Atal, Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification, JASA, 55 (1974) 1304-1312.
[10]
Au Yeung, S.-K., Siu, M.-H., 2004. Improved performance of Aurora-4 using HTK and unsupervised MLLR adaptation. In: Proceedings of the Int. Conference on Spoken Language Processing, Jeju, Korea, 2004.
[11]
J. Capon, High-resolution frequency-wavenumber spectrum analysis, Proc. IEEE, 57 (1969) 1408-1418.
[12]
Y.B. Chiu, B. Raj, R.M. Stern, Learning-based auditory encoding for robust speech recognition, IEEE Trans. Audio Speech Lang. Process., 20 (2012) 900-914.
[13]
I. Cohen, Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging, IEEE Trans. Speech Audio Process., 11 (2003) 466-475.
[14]
S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Trans. Acoust. Speech Signal Process., 28 (1980) 357-366.
[15]
A. de la Torre, A.M. Peinado, J.C. Segura, J.L. Perez-Cordoba, M.C. Benitez, A.J. Rubio, Histogram equalization of speech representation for robust speech recognition, IEEE Trans. Speech Audio Process., 13 (2005) 355-366.
[16]
Dharanipragada, S., Rao, B.D., 2001. MVDR based feature extraction for robust speech recognition. In: Proc. ICASSP, 2001, pp. 309-312.
[17]
P.M. Djuric, S.M. Kay, Spectrum Estimation and Modeling, Digital Signal Processing Handbook, CRC Press LLC, 1999.
[18]
Dubnov, S., 2006. YASAS - yet another sound analysis - synthesis method. In: Proc. of ICMC, New Orleans, 2006.
[19]
J. Durbin, The fitting of time-series models, Rev. Int. Stat. Inst., 28 (1960) 233-244.
[20]
L.A. Ekman, W. Bastian Kleijn, M.N. Murthi, Regularized linear prediction of speech, IEEE TASLP, 16 (2007) 65-72.
[21]
Ellis, D.P.W. PLP and RASTA (and MFCC, and Inversion) in Matlab. <http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat/>.
[22]
ETSI ES 202 050, 2003. Speech Processing, Transmission and Quality Aspects (STQ); Distributed Speech Recognition; Advanced Front-end Feature Extraction Algorithm; Compression Algorithms.
[23]
S. Furui, Cepstral analysis technique for automatic speaker verification, IEEE Trans. Acoust. Speech Signal Process., 29 (1981) 254-272.
[24]
Gerkmann, T., Hendrikes, R.C., 2011. Noise power estimation based on the probability of speech presence. In: Proc. IEEEWASPAA, New York, October 2011, pp. 145-148.
[25]
C. Hanilci, T. Kinnunen, F. Ertas, R. Saeidi, J. Pohjalainen, P. Alku, Regularized all-pole models for speaker verification under noisy environments, IEEE Signal Process. Lett., 19 (2012) 163-166.
[26]
H. Hermansky, Perceptual linear prediction analysis of speech, J. Acoust. Soc. Am., 87 (1990) 1738-1752.
[27]
H. Hermansky, N. Morgan, RASTA processing of speech, IEEE Trans. Speech Audio Process., 2 (1994) 578-589.
[28]
Hilger, F., Ney, H., 2001. Quantile based histogram equalization for noise robust speech recognition. In: Proc. EUROSPEECH, 2001, pp. 1135-1138.
[29]
X. Huang, A. Acero, H. Hon, Spoken Language Processing: A Guide to Theory, Algorithm and System Development, Prentice-Hall PTR, Upper Saddle River, New Jersey, 2001.
[30]
T. Irino, R.D. Patterson, A compressive gammachirp auditory filter for both physiological and psychophysical data, J. Acoust. Soc. Am., 109 (2001) 2008-2022.
[31]
Kim, C., Stern, R.M., 2010. Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring. In: IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, March 2010, pp. 4574-4577.
[32]
N. Levinson, The wiener RMS (root mean square) error criterion in filter design and prediction, J. Math. Phys., 25 (1947) 261-278.
[33]
Li, Q., Huang, Y., 2010. Robust speaker identification using an auditory-based feature. In: Proc. ICASSP, 2010, pp. 4514-4517.
[34]
Liu, F.H., Stern, R.M., Huang, X., Acero, A., 1993. Efficient cepstral normalization for robust speech recognition. In: Proc. ARPA Human Language Technology Workshop '93, Princeton, NJ, March 1993, pp. 69-74.
[35]
R. Martin, Noise power spectral density estimation based on optimal smoothing and minimum statistics, IEEE Trans. Speech Audio Process., 9 (2001) 504-512.
[36]
Mitra, V., Franco, H., Graciarena, M., Mandal, A., 2012. Normalized amplitude modulation features for large vocabulary noise-robust speech recognition. In: Proc. of ICASSP, 2012, pp. 4117-4120.
[37]
Murthi, M.N., Kleijn, W.B., 2000. Regularized linear prediction all-pole models. In: IEEE Speech Coding Workshop, 2000, pp. 96-98.
[38]
M.N. Murthi, B.D. Rao, All-pole modeling of speech based on the minimum variance distortionless response spectrum, IEEE Trans. Speech Audio Process., 8 (2000) 221-239.
[39]
D. O'Shaughnessy, Speech Communication - Human and Machine, IEEE Press, 2000.
[40]
Parihar, N., Picone, J., Pearce, D., Hirsch, H.G., 2004. Performance analysis of the Aurora large vocabulary baseline system. In: Proceedings of the European Signal Processing Conference, Vienna, Austria, 2004.
[41]
Pelecanos, J., Sridharan, S., 2001. Feature warping for robust speaker verification. In: Proc. Speaker Odyssey: The Speaker Recognition Workshop, Crete, Greece, 2001, pp. 213-218.
[42]
Ravindran, S., Anderson, D.V., Slaney, M., 2006. Improving the noise-robustness of mel-frequency cepstral coefficients for speech processing. In: Proc. SAPA, 2006. <http://www.sapaworkshops.org/2006/papers/131.pdf>.
[43]
Results of ASR task in the REVERB Challenge, 2014. <http://reverb2014.dereverberation.com/result_asr.html>.
[44]
Sarikaya, R., Hansen, J.H.L., 2001. Analysis of the root-cepstrum for acoustic modeling and fast decoding in speech recognition. In: Proc. EUROSPEECH.
[45]
Seltzer, M.L., Yu, D., Wang, Y., 2013. An investigation of deep neural networks for noise robust speech recognition. In: Proc. of ICASSP, 2013. <http://mi.eng.cam.ac.uk/~yw293/pdfs/ICASSP2013a.pdf>.
[46]
T. Shimamura, N.D. Nguyen, Autocorrelation and double autocorrelation based spectral representations for a noisy word recognition systems, Proc. Interspeech (2010) 1712-1715.
[47]
The KALDI Speech Recognizer. <http://kaldi.sourceforge.net/>.
[48]
The REVERB Challenge Corpus. <http://reverb2014.dereverberation.com/data.html>.
[49]
van Hout, J., Alwan, A., 2012. A novel approach to soft-mask estimation and log-spectral enhancement for robust speech recognition. In: Proc. of ICASSP, 2012, pp. 4105-4108.
[50]
O. Viikki, K. Laurila, Cepstral domain segmental feature vector normalization for noise robust speech recognition, Speech Commun., 25 (1998) 133-147.
[51]
G. Von B�k�sy, E.G. Wever, Experiments in Hearing, McGraw-Hill, New York, 1960.
[52]
Weng, C., Yu, D., Watanabe, S., Hwang Juang, B., 2014. Recurrent deep neural networks for robust speech recognition. In: Proc. of ICASSP, 2014. <http://research.microsoft.com/pubs/217320/RDNN-Robust-CIASSP2014-published.pdf>.
[53]
M.C. Wolfel, J.W. McDonough, Minimum variance distortionless response spectral estimation, review and refinements, IEEE Signal Process. Mag., 22 (2005) 117-126.
[54]
Wolfel, M., Yang, Q., Jin, Q., Schultz, T., 2009. Speaker identification using warped MVDR cepstral features. In: Proc. Interspeech, 2009, pp. 912-915.
[55]
Xiang, B., Chaudhari, U.V., Navratil, J., Ramaswamy, G.N., Gopinath, R.A., 2002. Short-time Gaussianization for robust speaker verification. In: IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, Orlando, Florida, USA, 2002, pp. 681-684.
[56]
Xiong, X., 2009. Robust Speech Features and Acoustic Models for Speech Recognition. PhD Thesis, NTU, Singapore.
[57]
Yapanel, U.H., Dharanipragada, S., 2003. Perceptual MVDR cepstral coefficients (PMCCs) for robust speech recognition. In: Proc. ICASSP, 2003, pp. 644-647.
[58]
U.H. Yapanel, John H.L. Hansen, A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition, Speech Commun., 50 (2008) 142-152.
[59]
S.J. Young, HTK Book, Entropic Cambridge Research Laboratory Ltd., 2006.
  1. Regularized minimum variance distortionless response-based cepstral features for robust continuous speech recognition

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Speech Communication
      Speech Communication  Volume 73, Issue C
      October 2015
      93 pages

      Publisher

      Elsevier Science Publishers B. V.

      Netherlands

      Publication History

      Published: 01 October 2015

      Author Tags

      1. ASE
      2. Feature normalization
      3. Multi-condition training
      4. Regularized MVDR
      5. Robust feature extraction
      6. Speech recognition

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 0
        Total Downloads
      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 18 Oct 2024

      Other Metrics

      Citations

      View Options

      View options

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media