Representation learning for speech technology applications using deep learning methods

Soni, Meetkumar H.

View/Open

201411038.pdf (4.782Mb)

Date

2016

Author

Soni, Meetkumar H.

Metadata

Show full item record

Abstract

In context of speech, deep learning architectures such as autoencoder and ConvolutionalNeural Network (CNN) are being used in the field of Automatic SpeechRecognition (ASR). However, they are seldom used as representation learning architectureto extract features, rather they are being used as performance enhancementof handcrafted features or as a strong back-end for performance improvement.Mainly, autoencoder features, which are proven to have excellent reconstructionability of speech spectrum are still unexplored in many speech technologyapplication. The lack of use of autoencoder features motivated the authors toexplore their usefulness in various speech technology applications such as nonintrusiveobjective quality assessment of noise suppressed speech, detection ofspoofed speech as a front-end of Automatic Speaker Verification (ASV) systemsand in Automatic Speech Recognition (ASR) systems as primary acoustic features.Moreover, the reasons for unpopularity of using autoencoder features motivatedauthors to overcome their limitations by modifying architecture of autoencoderand developing a new one, namely, subband autoencoder (SBAE). Proposed SBAEarchitecture is inspired by Human Auditory System (HAS) and extracts more interpretablefeatures from speech spectrum than autoencoder features in nonlinear,unsupervised manner. The performance of autoencoder features and SBAEfeatures is compared with state-of-the-art handcrafted features used in that particularapplication. Results of experiments for quality assessment of noise suppressedspeech suggest that autoencoder features and SBAE features perform significantlybetter and give more robust performance than state-of-the-art Mel filterbankenergies (FBE) with SBAE features giving the best performance. For spoofdetection task, SBAE features gave overall better performance than state-of-theartMel-Frequency Cepstral Coefficients (MFCC). However the best performancewas achieved using score-level fusion of both features. Autoencoder features performedpoorly in spoof detection task. In ASR experiments, FBE performed betterthan autoencoder features and SBAE features. However, when system-level combinationwas done, SBAE features improved performance of FBEs significantly,which suggests complementary nature of information captured by both features. System combination of SBAE features and FBE gave better performance than systemcombination of FBE and autoencoder features. For quality assessment of synthesizedspeech task, SBAE features performed significantly better than FBE. Theresults of these experiments suggest that proposed SBAE architecture improvesover traditional autoencoder in terms of usefulness of extracted features. The natureof the features extracted by SBAE was complementary to that of MFCC orFBE due to nonlinear processing involved in their extraction.

URI

http://drsr.daiict.ac.in/handle/123456789/624

Collections

M Tech Dissertations [923]