Please use this identifier to cite or link to this item: http://drsr.daiict.ac.in//handle/123456789/487
Title: Vocal tract length normalization for automatic speech recognition
Authors: Patil, Hemant A.
Sharma, Shubham
Keywords: Signal processing
Automatic speech recognition
Speech processing systems
Speech synthesis
Issue Date: 2014
Publisher: Dhirubhai Ambani Institute of Information and Communication Technology
Citation: Sharma, Shubham (2014). Vocal tract length normalization for automatic speech recognition. Dhirubhai Ambani Institute of Information and Communication Technology, xii, 80 p. (Acc.No: T00450)
Abstract: Various factors affect the performance of Automatic Speech Recognition (ASR) systems. In this thesis, speaker differences due to variations in vocal tract length (VTL) are taken into account. Vocal Tract Length Normalization (VTLN) has become an integral part of ASR systems these days. Different methods have been studied to compensate for these differences in the spectral-domain. In this thesis, various state-of-the-art methods have been implemented and discussed in detail. For example, method of Lee and Rose uses a maximum likelihood-based approach. It implements a grid search over a range of values of warping factors to obtain optimal warping factors for different speakers. On the other hand, method by Umesh et al. uses scale transform to obtain VTL normalized features. Frequency warping is the basis of such normalizing techniques. Mel scale warping is the most acceptable for compensating the speaker differences as it is inspired from the hearing process of human ear. Use of Bark scale–based warping is proposed in this thesis. Bark scale is based on the perception of loudness by human ear in contrast with the Mel scale which is based on pitch perception. Bark scale-based warping provides improved recognition accuracy in case of mismatched conditions (i.e., training on male (or female) speakers and testing on female (or male) speakers). Performances of different methods have been tested for different ASR tasks in English, Gujarati and Marathi languages. TIMIT database is used for English language and details of database collection for Gujarati and Marathi languages have been discussed. The performance provided by using VTLN has shown improvement over state-of-the-art MFCC features alone for almost all applications considered in this thesis. One of the major tasks done in this thesis is the development of Phonetic Engines (PE) using VTLN in three different modes of speech, viz., read, spontaneous and lecture mode in Gujarati and Marathi languages. Lee-Rose method is used for the design of PEs. Improved accuracy is achieved using VTLN-based method as compared to MFCCs. In addition, template matching experiment is performed using various VTL-normalized features under study and MFCCs for application of spoken keyword spotting. Better precision and lower equal error rates (EER) are obtained using VTL-normalized Scale Transform Cepstral Coefficients (STCC). This suggests that VTLN-based features can be useful for bigger applications such as audio search and spoken term detection (STD).
URI: http://drsr.daiict.ac.in/handle/123456789/487
Appears in Collections:M Tech Dissertations

Files in This Item:
File Description SizeFormat 
201211015.pdf
  Restricted Access
3.14 MBAdobe PDFThumbnail
View/Open Request a copy


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.