• Login
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Browse

    All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    LoginRegister

    Statistics

    View Usage StatisticsView Google Analytics Statistics

    Representation learning for speech technology applications using deep learning methods

    Thumbnail
    View/Open
    201411038.pdf (4.782Mb)
    Date
    2016
    Author
    Soni, Meetkumar H.
    Metadata
    Show full item record
    Abstract
    In context of speech, deep learning architectures such as autoencoder and ConvolutionalNeural Network (CNN) are being used in the field of Automatic SpeechRecognition (ASR). However, they are seldom used as representation learning architectureto extract features, rather they are being used as performance enhancementof handcrafted features or as a strong back-end for performance improvement.Mainly, autoencoder features, which are proven to have excellent reconstructionability of speech spectrum are still unexplored in many speech technologyapplication. The lack of use of autoencoder features motivated the authors toexplore their usefulness in various speech technology applications such as nonintrusiveobjective quality assessment of noise suppressed speech, detection ofspoofed speech as a front-end of Automatic Speaker Verification (ASV) systemsand in Automatic Speech Recognition (ASR) systems as primary acoustic features.Moreover, the reasons for unpopularity of using autoencoder features motivatedauthors to overcome their limitations by modifying architecture of autoencoderand developing a new one, namely, subband autoencoder (SBAE). Proposed SBAEarchitecture is inspired by Human Auditory System (HAS) and extracts more interpretablefeatures from speech spectrum than autoencoder features in nonlinear,unsupervised manner. The performance of autoencoder features and SBAEfeatures is compared with state-of-the-art handcrafted features used in that particularapplication. Results of experiments for quality assessment of noise suppressedspeech suggest that autoencoder features and SBAE features perform significantlybetter and give more robust performance than state-of-the-art Mel filterbankenergies (FBE) with SBAE features giving the best performance. For spoofdetection task, SBAE features gave overall better performance than state-of-theartMel-Frequency Cepstral Coefficients (MFCC). However the best performancewas achieved using score-level fusion of both features. Autoencoder features performedpoorly in spoof detection task. In ASR experiments, FBE performed betterthan autoencoder features and SBAE features. However, when system-level combinationwas done, SBAE features improved performance of FBEs significantly,which suggests complementary nature of information captured by both features. System combination of SBAE features and FBE gave better performance than systemcombination of FBE and autoencoder features. For quality assessment of synthesizedspeech task, SBAE features performed significantly better than FBE. Theresults of these experiments suggest that proposed SBAE architecture improvesover traditional autoencoder in terms of usefulness of extracted features. The natureof the features extracted by SBAE was complementary to that of MFCC orFBE due to nonlinear processing involved in their extraction.
    URI
    http://drsr.daiict.ac.in/handle/123456789/624
    Collections
    • M Tech Dissertations [923]

    Resource Centre copyright © 2006-2017 
    Contact Us | Send Feedback
    Theme by 
    Atmire NV
     

     


    Resource Centre copyright © 2006-2017 
    Contact Us | Send Feedback
    Theme by 
    Atmire NV