Representation learning for speech technology applications using deep learning methods

Soni, Meetkumar H.

Please use this identifier to cite or link to this item: http://drsr.daiict.ac.in//handle/123456789/624

Title:	Representation learning for speech technology applications using deep learning methods
Authors:	Patil, Hemant A. Soni, Meetkumar H.
Keywords:	Learning Method Artificial Neural Network Subband Autoencoder Natural and Spoofed Speech
Issue Date:	2016
Publisher:	Dhirubhai Ambani Institute of Information and Communication Technology
Citation:	Soni, Meetkumar H. (2016). Representation learning for speech technology applications using deep learning methods. Dhirubhai Ambani Institute of Information and Communication Technology, xiii, 97p. (Acc.No: T00587)
Abstract:	In context of speech, deep learning architectures such as autoencoder and ConvolutionalNeural Network (CNN) are being used in the field of Automatic SpeechRecognition (ASR). However, they are seldom used as representation learning architectureto extract features, rather they are being used as performance enhancementof handcrafted features or as a strong back-end for performance improvement.Mainly, autoencoder features, which are proven to have excellent reconstructionability of speech spectrum are still unexplored in many speech technologyapplication. The lack of use of autoencoder features motivated the authors toexplore their usefulness in various speech technology applications such as nonintrusiveobjective quality assessment of noise suppressed speech, detection ofspoofed speech as a front-end of Automatic Speaker Verification (ASV) systemsand in Automatic Speech Recognition (ASR) systems as primary acoustic features.Moreover, the reasons for unpopularity of using autoencoder features motivatedauthors to overcome their limitations by modifying architecture of autoencoderand developing a new one, namely, subband autoencoder (SBAE). Proposed SBAEarchitecture is inspired by Human Auditory System (HAS) and extracts more interpretablefeatures from speech spectrum than autoencoder features in nonlinear,unsupervised manner. The performance of autoencoder features and SBAEfeatures is compared with state-of-the-art handcrafted features used in that particularapplication. Results of experiments for quality assessment of noise suppressedspeech suggest that autoencoder features and SBAE features perform significantlybetter and give more robust performance than state-of-the-art Mel filterbankenergies (FBE) with SBAE features giving the best performance. For spoofdetection task, SBAE features gave overall better performance than state-of-theartMel-Frequency Cepstral Coefficients (MFCC). However the best performancewas achieved using score-level fusion of both features. Autoencoder features performedpoorly in spoof detection task. In ASR experiments, FBE performed betterthan autoencoder features and SBAE features. However, when system-level combinationwas done, SBAE features improved performance of FBEs significantly,which suggests complementary nature of information captured by both features. System combination of SBAE features and FBE gave better performance than systemcombination of FBE and autoencoder features. For quality assessment of synthesizedspeech task, SBAE features performed significantly better than FBE. Theresults of these experiments suggest that proposed SBAE architecture improvesover traditional autoencoder in terms of usefulness of extracted features. The natureof the features extracted by SBAE was complementary to that of MFCC orFBE due to nonlinear processing involved in their extraction.
URI:	http://drsr.daiict.ac.in/handle/123456789/624
Appears in Collections:	M Tech Dissertations

Files in This Item:

File	Description	Size	Format
201411038.pdf Restricted Access		4.9 MB	Adobe PDF	View/Open Request a copy

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets