Design of QbE-STD System: audio representation and matching perspective

Madhavi, Maulik C.

View/Open

201121003 (4.513Mb)

Date

2017

Author

Madhavi, Maulik C.

Metadata

Show full item record

Abstract

The retrieval of the spoken document and detecting the query (keyword) within the audio document have attained huge research interest. The problem of retrieving audio documents and detecting the query (keyword) using a spoken form of a query is widely known as Query-by-Example Spoken Term Detection (QbE-STD). This thesis presents the design of QbE-STD system from the representation and matching perspective. A speech spectrum is known to be affected by the variations in the length of the vocal tract of a speaker due to the inverse relation between formants and vocal tract length. The process of compensating spectral variation caused due to the length of the vocal tract is popularly known as Vocal Tract Length Normalization (VTLN) (especially, in speech recognition literature). VTLN is a very important speaker normalization technique for speech recognition task. In this context, this thesis proposes the use of Gaussian posteriorgram of VTL-warped spectral features for a QbE-STD task. This study presents the novel use of a Gaussian Mixture Model (GMM) framework for VTLN warping factor estimation. In particular, presentedGMMframework does not require phoneme-level transcription and hence, it can be useful for the unsupervised task. In addition, we also propose the use of the mixture of GMMs for posteriorgram design. The speech data governs acoustically similar broad phonetic structures. To capture broad phonetic structure, we exploit supplementary knowledge of broad phoneme classes (such as, vowels, semi-vowels, nasals, fricatives, plosive) for the training of GMM. The mixture of GMMs is tied with GMMs of these broad phoneme classes. AGMMtrained under no supervision assumes uniform priors to each Gaussian component, whereas a mixture of GMMs assigns the prior probability based on broad phoneme class. The novelty of our work lies in prior probability assignments (as weights of the mixture of GMMs) for better Gaussian posteriorgram design. In realistic scenarios, there is a need to retrieve the query, which does not appear exactly in the spoken document. However, the appeared instance of query might have the different suffix, prefix or word order. The DTW algorithm monotonically aligns the two sequences, and hence, it is not suitable to perform partial matching between the frame sequence of query and test utterance. We propose novel partial matching approach between spoken query and utterance using modified DTW algorithm, where multiple warping paths are constructed for each query and test utterance pair. This partial matching approachimproves the detection of the non-exact query in the realistic scenarios, where both exact and non-exact queries are present. Next, we address the research issue associated with search complexity of DTW algorithm and suggest two approaches, namely,feature reduction approach and segment-level Bag-of-Acoustic-Words (BoAW) model. In feature reduction approach, the number of feature vectors is reduced by averaging across the consecutive frames within phonetic boundaries. Thus, a lesser number of feature vectors require a fewer number of comparison operations and hence, DTW speeds up the search computation. In BoAW model, we construct term frequency-inverse document frequency (t f ??id f ) vectors at segment level to retrieve audio documents. The proposed segment-level BoAW model is used to match test utterance with a query using (t f ?? id f ) vectors and the scoresobtained are used to rank the test utterance. Both of these search space reduction approaches are used to speed up the execution with a slight degradation in the search performance. We propose two-stage approaches for re-scoring the detection hypothesis withthe help of acoustic features and detection sources. First, we explored several acoustic features to re-score the detection hypothesis. The second approach considers additional detection sources, such as, depth of detection valley and termfrequency, Self-Similarity Matrix (SSM), Pseudo Relevance Feedback (PRF) and Weighted mean feature with Gaussian and phonetic posteriorgram. These twostage approaches improve the detection performance with the re-scoring from the hypothesis of a single QbE-STD system. Finally, the thesis concludes by presenting few iscellaneous studies, a summary of entire thesis, along with few potential future research directions.

URI

http://drsr.daiict.ac.in//handle/123456789/649

Collections

PhD Theses [87]