Vocal tract length normalization for automatic speech recognition

Sharma, Shubham

View/Open

201211015.pdf (3.066Mb)

Date

2014

Author

Sharma, Shubham

Metadata

Show full item record

Abstract

Various factors affect the performance of Automatic Speech Recognition (ASR) systems. In this thesis, speaker differences due to variations in vocal tract length (VTL) are taken into account. Vocal Tract Length Normalization (VTLN) has become an integral part of ASR systems these days. Different methods have been studied to compensate for these differences in the spectral-domain. In this thesis, various state-of-the-art methods have been implemented and discussed in detail. For example, method of Lee and Rose uses a maximum likelihood-based approach. It implements a grid search over a range of values of warping factors to obtain optimal warping factors for different speakers. On the other hand, method by Umesh et al. uses scale transform to obtain VTL normalized features. Frequency warping is the basis of such normalizing techniques. Mel scale warping is the most acceptable for compensating the speaker differences as it is inspired from the hearing process of human ear. Use of Bark scale–based warping is proposed in this thesis. Bark scale is based on the perception of loudness by human ear in contrast with the Mel scale which is based on pitch perception. Bark scale-based warping provides improved recognition accuracy in case of mismatched conditions (i.e., training on male (or female) speakers and testing on female (or male) speakers). Performances of different methods have been tested for different ASR tasks in English, Gujarati and Marathi languages. TIMIT database is used for English language and details of database collection for Gujarati and Marathi languages have been discussed. The performance provided by using VTLN has shown improvement over state-of-the-art MFCC features alone for almost all applications considered in this thesis. One of the major tasks done in this thesis is the development of Phonetic Engines (PE) using VTLN in three different modes of speech, viz., read, spontaneous and lecture mode in Gujarati and Marathi languages. Lee-Rose method is used for the design of PEs. Improved accuracy is achieved using VTLN-based method as compared to MFCCs. In addition, template matching experiment is performed using various VTL-normalized features under study and MFCCs for application of spoken keyword spotting. Better precision and lower equal error rates (EER) are obtained using VTL-normalized Scale Transform Cepstral Coefficients (STCC). This suggests that VTLN-based features can be useful for bigger applications such as audio search and spoken term detection (STD).

URI

http://drsr.daiict.ac.in/handle/123456789/487

Collections

M Tech Dissertations [923]