Generative Adversarial Networks for Speech Technology Applications

Shah, Neil

dc.contributor.advisor	Patil, Hemant A.
dc.contributor.author	Shah, Neil
dc.date.accessioned	2019-03-19T09:30:59Z
dc.date.available	2019-03-19T09:30:59Z
dc.date.issued	2018
dc.identifier.citation	Shah, Neil (2018). Generative Adversarial Networks for Speech Technology Applications. Dhirubhai Ambani Institute of Information and Communication Technology, xiv, 86 p. (Acc. No: T00738)
dc.identifier.uri	http://drsr.daiict.ac.in//handle/123456789/772
dc.description.abstract	The deep learning renaissance has enabled the machines to understand the observed data in terms of a hierarchy of representations. This allows the machines to learn complicated nonlinear relationships between the representative pairs. In context of the speech, deep learning architectures, such as Deep Neural Networks (DNNs), Convolutional Neural Networks (CNNs) are the traditional supervised learning algorithms employing Maximum Likelihood (ML)-based optimization. These techniques reduce the numerical estimates between the generated and the groundtruth. However, the performance gap between the generated representation and the groundtruth in various speech applications is due to the fact that the numerical estimation may not correlate with the human perception mechanism. On the other hand, the Generative Adversarial Networks (GANs) reduces the distributional divergence, rather than minimizing the numerical errors and hence, may synthesize the samples with improved perceptual quality. However, the vanilla GAN (v-GAN) architecture generates the spectrum that may belong to the true desired distribution but may not correspond to the given spectral frames at the input. To address this issue, the Minimum Mean Square Error (MMSE) regularized, MMSE-GAN and CNN-GAN architectures are proposed for the Speech Enhancement (SE) task. The objective evaluation shows the improvement in the speech quality and suppression of the background interferences over the state-ofthe- art techniques. The effectiveness of the proposed MMSE-GAN is explored in other speech technology applications, such as Non-Audible Murmur-to-Whisper Speech Conversion (NAM2WHSP), Query-by-Example Spoken Term Detection (QbE-STD), and Voice Conversion (VC). In QbE-STD, a DNN-based GAN with a cross-entropy regularization is proposed for extracting an unsupervised posterior feature representation (uGAN-PG), trained on labeled Gaussian Mixture Model (GMM) posteriorgram. Moreover, the ability of Wasserstein GAN (WGAN) in improving the optimization stability and providing a meaningful loss metric that correlates to the generated sample quality and the generator's convergence is also exploited. To that effect, MMSE-WGAN is proposed for the VC task and its performance is compared with the MMSE-GAN and DNN-based approaches.
dc.publisher	Dhirubhai Ambani Institute of Information and Communication Technology
dc.subject	Artificial intelligence
dc.subject	Neural network
dc.subject	Speech recognition
dc.subject	Voice conversion
dc.classification.ddc	005.454 SHA
dc.title	Generative Adversarial Networks for Speech Technology Applications
dc.type	Dissertation
dc.degree	M. Tech
dc.student.id	201611055
dc.accession.number	T00738

Files in this item

Name:: 201611055_Neil Shah.pdf
Size:: 3.438Mb
Format:: PDF
Description:: 201611055

View/Open

This item appears in the following Collection(s)

M Tech Dissertations [923]

Show simple item record