PhD Theses
Permanent URI for this collectionhttp://drsr.daiict.ac.in/handle/123456789/2
Browse
3 results
Search Results
Item Open Access Design of spoof speech detection system : teager energy-based approach(Dhirubhai Ambani Institute of Information and Communication Technology, 2021) Kamble, Madhu R.; Patil, Hemant A.Automatic Speaker Verification (ASV) systems are vulnerable to various spoofing attacks, namely, Speech Synthesis (SS), Voice Conversion (VC), Replay, and Impersonation. The study of spoofing countermeasures has become increasingly important and is currently a critical area of research, which is the principal objective of this thesis. With the development of Neural Networkbased techniques, in particular, for machine generated spoof speech signals, the performance of Spoof Speech Detection (SSD) system will be further challenging. To encourage the development of countermeasures that are based on signal processing techniques or neural network-based features for SSD task, a standardized dataset was provided by the organizers of ASVspoof challenge campaigns during 2015, 2017, and 2019. The front-end features extracted from the speech signal has a huge impact in the field of signal processing applications. The goal of feature extraction is to estimate the meaningful information directly from the speech signal that can be helpful to the pattern classifier, speech, speaker, emotion recognition, etc. Among various spoofing attacks, speech synthesis, voice conversion, and replay attacks have been identified as the most effective and accessible forms of spoofing. Accordingly, this thesis investigates and develops a framework to extract the discriminative features to deflect these three spoofing attacks. The main contribution of the thesis is to propose various feature sets as frontend countermeasures for SSD task using a traditional Gaussian Mixture Model (GMM)-based classification system. The feature sets are based on Teager Energy Operator (TEO) and Energy Separation Algorithm (ESA), namely, Teager Energy Cepstral Coefficients (TECC), Energy Separation Algorithm Instantaneous Frequency Cepstral Coefficients (ESA-IFCC), Energy Separation Algorithm Instantaneous Amplitude Cepstral Coefficients (ESA IACC), Amplitude Weighted Frequency Cepstral Coefficients (AWFCC), Gabor Teager Filterbank (GTFB). The motivation behind using TEO is its nonlinear speech production property. The true total source energy is known to be estimated using TEO, and it also preserves the amplitude and frequency modulation of a resonant signal and hence, it improves the time-frequency resolution along with improving the formant information representation. In addition, the TEO also has the noise suppression property and it attempts to remove the distortion caused by noise signal. In Chapter 3, we analyze the replay speech signal in terms of reverberation that occurs during recording of the speech signal. The reverberation introduces delay and change in amplitude producing close copies of speech signal which significantly influences the replay components. To that effect, we propose to exploit the capabilities of Teager Energy Operator (TEO) to estimate running estimate of subband energies for replay vs. genuine signal. We have used linearly-spaced Gabor filterbank to obtain narrowband filtered signal. The TEO has the property to track the instantaneous changes of a signal. In Chapter 4, we propose Instantaneous Amplitude (IA) and Instantaneous Frequency (IF) features using Energy Separation Algorithm (ESA). The speech signal is passed through bandpass filters in order to obtain narrowband components because speech is a combination of several monocomponent signals. To obtain a narrowband filtered signal, we have used linearly-spaced Butterworth and Gabor filterbank. The instantaneous modulations helps to understand the local characteristics of a non-stationary signal. These IA and IF components are able to capture the information present in a slowly-varying amplitude envelope and fast-varying frequency. The slowvarying temporal modulations for replay speech have the distorted amplitude envelope, and the fast-varying temporal modulation do not preserve the harmonic structure compared to the natural speech signal. For replay speech signal, the intermediate device characteristics and acoustic environment distorts the spectral energy compared to the natural speech energy. In Chapter 5, we extend our earlier work with the generalized TEO, i.e., by varying the samples of past and future instants with a constant arbitrary integer k also known as lag parameter or dependency index, and named it as Variable length Teager Energy Operator (VTEO). In Chapter 6, we propose the combination of Amplitude Modulation and Frequency Modulation (AM-FM) features for replay Spoof Speech Detection (SSD) task. The AM components are known to be affected by noise (in this case, due to replay mechanism). In particular, we explore this damage in AM component to corresponding Instantaneous Frequency (IF) for SSD task. Thus, the novelty of proposed AmplitudeWeighted Frequency Cepstral Coefficients (AWFCC) feature set lies in using frequency components along with squared weighted amplitude components that are degraded due to replay noise. The AWFCC features contains the information of both AM and FM components together and hence, gave discriminatory information in the spectral characteristics. The first motivation in this thesis is to develop various countermeasures for SSD task. The experimental results on the standard spoofing database shows that proposed feature sets perform better than the corresponding baseline systems. Inspired by the success in the SSD task, we applied TEO-based feature set in a variety of speech and audio processing applications, namely, Automatic Speech Recognition (ASR), Acoustic Scene Sound Classification (ASC), Voice Assistant (VA), and Whisper Speech Detection (WSD). In all these applications, our TEObased feature set gave consistently better performance compared to their respective baselines.Item Open Access Voice conversion: alignment and mapping perspective(Dhirubhai Ambani Institute of Information and Communication Technology, 2019) Shah, Nirmesh J.; Patil, Hemant A.Understanding how a particular speaker is producing speech, and mimicking one's voice is a difficult research problem due to the sophisticated mechanism involved in speech production. Voice Conversion (VC) is a technique that modifies the perceived speaker identity in a given speech utterance from a source speaker to a particular target speaker without changing the linguistic content. Each standalone VC system building consists of two stages, namely, training and testing. First, speaker-dependent features are xtracted from both speakers' training data.These features are first time aligned and corresponding pairs are obtained. Then a mapping function is learned among these aligned feature-pairs. Once the training step is done, during the testing stage, features are extracted from the source speaker's held out data. These features are converted using the mapping function. The converted features are then passed through the vocoder that will produce a converted voice. Hence, there are primarily three components of the stand-alone VC system building, namely, the alignment step, the mapping function, and the speech analysis/synthesis framework. Major contributions of this thesis are towards identifying the limitations of existing techniques, improving it, and developing new approaches for the mapping, and alignment stages of the VC. In particular, a novel Amplitude Scaling (AS) method is proposed for frequency warping (FW)-based VC, which linearly transfers the amplitude of the frequency-warped spectrum using the knowledge of a Gaussian Mixture Model (GMM)-based converted spectrum without adding any spurious peaks. To overcome the issue of overfitting in Deep Neural Network (DNN)-based VC, the idea of pre-training is popular. However, this pre-training is time-consuming, and Equires a separate network to learn the parameters of the network. Hence, whether this additional pre-training step could be avoided by using recent advances in deep learning is investigated in this thesis. The ability of Generative Adversarial Network (GAN) in estimating probability density function (pdf) for generating the realistic samples corresponding to the given source speaker's utterance resulted in a significant performance improvement in the area of VC. The key limitation of the vanilla GAN-based system is in generating the samples that may not correspond to the given source speaker's utterance. To address this issue, Minimum Mean Squared Error (MMSE) regularized GAN (i.e.,MMSE-GAN) is proposed in this thesis.Obtaining corresponding feature pairs in the context of both parallel as well as non-parallel VC is a challenging task. In this thesis, the strengths and limitations of the different existing alignment strategies are identified, and new alignment strategies are proposed for both parallel and non-parallel VC task. Wrongly aligned pairs will affect the learning of the mapping function, which in turn will deteriorate the quality of the converted voices. In order to remove such wrongly aligned pairs from the training data, outlier removal-based pre-processing technique is proposed for the parallel VC. In the case of non-parallel VC, theoretical convergence proof is developed for the popular alignment technique, namely, Iterative combination of a Nearest Neighbor search step and a Conversion step Alignment (INCA). In addition, the use of dynamic features along with static features to calculate the Nearest Neighbor (NN) aligned pairs in the existing INCA, and Temporal context (TC) INCA is also proposed. Furthermore, a novel distance metric is learned for the NN-based search strategies, as Euclidean distance may not correlate well with the perceptual distance. Moreover, computationally simple Spectral Transition Measure (STM)-based phone alignment technique that does not require any apriori training data is also proposed for the non-parallel VC. Both the parallel and the non-parallel alignment techniques will generate oneto-many and many-to-one feature pairs. These one-to-many and many-to-one pairs will affect the learning of the mapping function and result in the muffling and oversmoothing effect in VC. Hence, unsupervised Vocal Tract Length Normalization (VTLN) posteriorgram, and novel inter mixture weighted GMM Posteriorgram as a speaker-independent representation in the two-stage mapping network is proposed in order to avoid the alignment step from the VC framework. In this thesis, an attempt has also been made to use the acoustic-to-articulatory inversion (AAI) technique for the quality assessment of the voice converted speech. Lastly, the proposed MMSE-GAN architecture is extended in the form of Discover GAN (i.e., MMSE DiscoGAN) for the cross-domain VC applications (w.r.t.attributes of the speech production mechanism), namely, Non-Audible Murmur (NAM)-to-WHiSPer (NAM2WHSP) speech conversion, and WHiSPer-to-SPeeCH (WHSP2SPCH) conversion. Finally, thesis summarizes overall work presented, limitations of various approaches along with future research directions.Item Open Access Design of countermeasures for spoofed speech detection system(Dhirubhai Ambani Institute of Information and Communication Technology, 2017) Patel, Tanvina; Patil, Hemant A.Automatic Speaker Verification (ASV) systems are vulnerable to speech synthesisand voice conversion techniques due to spoofing attacks.Recently, to encourage thedevelopment of anti-spoofing measures or countermeasures for Spoofed Speech Detection (SSD) task, a standardized dataset was provided at the 'ASV spoof 2015 challenge' held at INTERSPEECH 2015. In the present work, using a traditional Gaussian Mixture Model (GMM)-based classification system, novel countermeasures are proposed considering three vital aspects of speech production mechanism, i.e., excitation source, vocal tract system (i.e., filter) and Source-Filter (S-F) interaction. Considering our relatively best performance at the ASV spoof challenge, we first discuss system-based features that include proposed Cochlear Filter Cepstral Coefficients and Instantaneous Frequency (CFCCIF) features. These use the envelope and average IF of each subband along with the transient information. The transient variations estimated by the symmetric difference (CFCCIFS) gave better discrimination. Within the framework of system-based features, the Subband Autoencoder (SBAE) feature set that embeds subband processing in the Autoencoder architecture is used. For source-based features, knowing that an actual vocal fold movement is absent in machine-generated speech, fundamental frequency (F0) contour and Strength of Excitation (SoE) are used as features. Next, as spoofed speech is easily predicted if generated by a simplified model or difficult to predict due to artifacts, we propose the use of prediction-based methods. This includes the Linear Prediction (LP), Long-Term Prediction (LTP) and Non-Linear Prediction (NLP) techniques. Lastly, the Fujisaki Model is used to analyze the prosodic differences in terms of accent and phrase between natural and spoofed speech. In addition to independently using source or system features, the time-varying dependencies or the S-F interaction features are considered. This includes exploring Features based on the residual information of the glottal excitation source and its fitted Liljencrants-Fant (LF) model, both in time-domain and frequency-domain for the SSD task.