Please use this identifier to cite or link to this item: http://drsr.daiict.ac.in//handle/123456789/624
Title: Representation learning for speech technology applications using deep learning methods
Authors: Patil, Hemant A.
Soni, Meetkumar H.
Keywords: Learning Method
Artificial Neural Network
Subband Autoencoder
Natural and Spoofed Speech
Issue Date: 2016
Publisher: Dhirubhai Ambani Institute of Information and Communication Technology
Citation: Soni, Meetkumar H. (2016). Representation learning for speech technology applications using deep learning methods. Dhirubhai Ambani Institute of Information and Communication Technology, xiii, 97p. (Acc.No: T00587)
Abstract: In context of speech, deep learning architectures such as autoencoder and ConvolutionalNeural Network (CNN) are being used in the field of Automatic SpeechRecognition (ASR). However, they are seldom used as representation learning architectureto extract features, rather they are being used as performance enhancementof handcrafted features or as a strong back-end for performance improvement.Mainly, autoencoder features, which are proven to have excellent reconstructionability of speech spectrum are still unexplored in many speech technologyapplication. The lack of use of autoencoder features motivated the authors toexplore their usefulness in various speech technology applications such as nonintrusiveobjective quality assessment of noise suppressed speech, detection ofspoofed speech as a front-end of Automatic Speaker Verification (ASV) systemsand in Automatic Speech Recognition (ASR) systems as primary acoustic features.Moreover, the reasons for unpopularity of using autoencoder features motivatedauthors to overcome their limitations by modifying architecture of autoencoderand developing a new one, namely, subband autoencoder (SBAE). Proposed SBAEarchitecture is inspired by Human Auditory System (HAS) and extracts more interpretablefeatures from speech spectrum than autoencoder features in nonlinear,unsupervised manner. The performance of autoencoder features and SBAEfeatures is compared with state-of-the-art handcrafted features used in that particularapplication. Results of experiments for quality assessment of noise suppressedspeech suggest that autoencoder features and SBAE features perform significantlybetter and give more robust performance than state-of-the-art Mel filterbankenergies (FBE) with SBAE features giving the best performance. For spoofdetection task, SBAE features gave overall better performance than state-of-theartMel-Frequency Cepstral Coefficients (MFCC). However the best performancewas achieved using score-level fusion of both features. Autoencoder features performedpoorly in spoof detection task. In ASR experiments, FBE performed betterthan autoencoder features and SBAE features. However, when system-level combinationwas done, SBAE features improved performance of FBEs significantly,which suggests complementary nature of information captured by both features. System combination of SBAE features and FBE gave better performance than systemcombination of FBE and autoencoder features. For quality assessment of synthesizedspeech task, SBAE features performed significantly better than FBE. Theresults of these experiments suggest that proposed SBAE architecture improvesover traditional autoencoder in terms of usefulness of extracted features. The natureof the features extracted by SBAE was complementary to that of MFCC orFBE due to nonlinear processing involved in their extraction.
URI: http://drsr.daiict.ac.in/handle/123456789/624
Appears in Collections:M Tech Dissertations

Files in This Item:
File Description SizeFormat 
201411038.pdf
  Restricted Access
4.9 MBAdobe PDFThumbnail
View/Open Request a copy


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.