dc.description.abstract | The deep learning renaissance has enabled the machines to understand the observed data in terms of a hierarchy of representations. This allows the machines to learn complicated nonlinear relationships between the representative pairs. In
context of the speech, deep learning architectures, such as Deep Neural Networks (DNNs), Convolutional Neural Networks (CNNs) are the traditional supervised learning algorithms employing Maximum Likelihood (ML)-based optimization.
These techniques reduce the numerical estimates between the generated and the
groundtruth. However, the performance gap between the generated representation
and the groundtruth in various speech applications is due to the fact that
the numerical estimation may not correlate with the human perception mechanism.
On the other hand, the Generative Adversarial Networks (GANs) reduces
the distributional divergence, rather than minimizing the numerical errors and
hence, may synthesize the samples with improved perceptual quality. However,
the vanilla GAN (v-GAN) architecture generates the spectrum that may belong to
the true desired distribution but may not correspond to the given spectral frames
at the input. To address this issue, the Minimum Mean Square Error (MMSE) regularized,
MMSE-GAN and CNN-GAN architectures are proposed for the Speech
Enhancement (SE) task. The objective evaluation shows the improvement in the
speech quality and suppression of the background interferences over the state-ofthe-
art techniques. The effectiveness of the proposed MMSE-GAN is explored in
other speech technology applications, such as Non-Audible Murmur-to-Whisper
Speech Conversion (NAM2WHSP), Query-by-Example Spoken Term Detection
(QbE-STD), and Voice Conversion (VC). In QbE-STD, a DNN-based GAN with a
cross-entropy regularization is proposed for extracting an unsupervised posterior
feature representation (uGAN-PG), trained on labeled Gaussian Mixture Model
(GMM) posteriorgram. Moreover, the ability of Wasserstein GAN (WGAN) in
improving the optimization stability and providing a meaningful loss metric that
correlates to the generated sample quality and the generator's convergence is also
exploited. To that effect, MMSE-WGAN is proposed for the VC task and its performance
is compared with the MMSE-GAN and DNN-based approaches. | |