Vowel landmark detection for speech recognition
Abstract
Landmarks are the time instants in a speech utterance which marks the important events such as vowels, glides and consonants. This thesis proposes a novel Vowel Landmark Detection (VLD) algorithm to locate vowel landmarks and hence the nucleus of a vowel. VLD can find its potential application for Automatic Speech Recognition (ASR) and Automatic Phonetic Segmentation (APS). The proposed VLD method uses speech source information to detect the vowel landmarks which are points of high sonority. The excitation peaks in Hilbert envelope (HE) of Teager energy profile of zero frequency filtered (ZFF) speech signal can be interpreted as perceptually significant feature which contribute to the loudness. The performance of proposed VLD method is compared with existing loudness-based method. The results are reported on TIMIT and NTIMIT corpora. The proposed VLD algorithm has detection rate of 85.48 % (83.97 %) which is 5.06 % (7.51 %) more as compared to existing loudness-based method for TIMIT (NTIMIT) corpus, respectively. In addition, this thesis proposes use of VLD algorithm for low resource languages, viz., Gujarati and Marathi, Indian languages. The results are reported on speech recorded in three different modes, viz., read, spontaneous and lecture followed by manual phonetic transcription by the transcribers (to be used as ground truth) for Gujarati as well as Marathi. The proposed VLD algorithm has detection rate of 78.92 %, 76.40 % and 73.89 %, which has jump of 8.79 %, 7.23 % and 7.17 % more as compared to loudness-based method in lecture, spontaneous and read mode, respectively for Gujarati. Similarly, the proposed VLD algorithm has detection rate of 76.93 %, 75.16 % and 73.93 %, which has jump of 7.52 %, 7.43 % and 7.82 % more as compared to loudness-based method in lecture, spontaneous and read mode, respectively (for Marathi). The proposed algorithm is also shown to be robust against signal degradation such as white noise. The second part of the thesis is to recognize the detected vowel landmarks.Formant-based technique is used to recognize the detected vowels. The results are reported on phonetically transcribed TIMIT corpus. The recognition rate is 32.16 % on the correctly detected vowels (i.e., out of 78374 vowels, 66994 number of vowels are detected correctly and out of that 21545 vowels are recognized). Proposed method is very fast and requires no training.
Collections
- M Tech Dissertations [923]