• Login
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Browse

    All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    LoginRegister

    Statistics

    View Usage StatisticsView Google Analytics Statistics

    Generative Adversarial Networks for Speech Technology Applications

    Thumbnail
    View/Open
    201611055 (3.438Mb)
    Date
    2018
    Author
    Shah, Neil
    Metadata
    Show full item record
    Abstract
    The deep learning renaissance has enabled the machines to understand the observed data in terms of a hierarchy of representations. This allows the machines to learn complicated nonlinear relationships between the representative pairs. In context of the speech, deep learning architectures, such as Deep Neural Networks (DNNs), Convolutional Neural Networks (CNNs) are the traditional supervised learning algorithms employing Maximum Likelihood (ML)-based optimization. These techniques reduce the numerical estimates between the generated and the groundtruth. However, the performance gap between the generated representation and the groundtruth in various speech applications is due to the fact that the numerical estimation may not correlate with the human perception mechanism. On the other hand, the Generative Adversarial Networks (GANs) reduces the distributional divergence, rather than minimizing the numerical errors and hence, may synthesize the samples with improved perceptual quality. However, the vanilla GAN (v-GAN) architecture generates the spectrum that may belong to the true desired distribution but may not correspond to the given spectral frames at the input. To address this issue, the Minimum Mean Square Error (MMSE) regularized, MMSE-GAN and CNN-GAN architectures are proposed for the Speech Enhancement (SE) task. The objective evaluation shows the improvement in the speech quality and suppression of the background interferences over the state-ofthe- art techniques. The effectiveness of the proposed MMSE-GAN is explored in other speech technology applications, such as Non-Audible Murmur-to-Whisper Speech Conversion (NAM2WHSP), Query-by-Example Spoken Term Detection (QbE-STD), and Voice Conversion (VC). In QbE-STD, a DNN-based GAN with a cross-entropy regularization is proposed for extracting an unsupervised posterior feature representation (uGAN-PG), trained on labeled Gaussian Mixture Model (GMM) posteriorgram. Moreover, the ability of Wasserstein GAN (WGAN) in improving the optimization stability and providing a meaningful loss metric that correlates to the generated sample quality and the generator's convergence is also exploited. To that effect, MMSE-WGAN is proposed for the VC task and its performance is compared with the MMSE-GAN and DNN-based approaches.
    URI
    http://drsr.daiict.ac.in//handle/123456789/772
    Collections
    • M Tech Dissertations [820]

    Resource Centre copyright © 2006-2017 
    Contact Us | Send Feedback
    Theme by 
    Atmire NV
     

     


    Resource Centre copyright © 2006-2017 
    Contact Us | Send Feedback
    Theme by 
    Atmire NV