Design of spoof speech detection system : teager energy-based approach
Automatic Speaker Verification (ASV) systems are vulnerable to various spoofing attacks, namely, Speech Synthesis (SS), Voice Conversion (VC), Replay, and Impersonation. The study of spoofing countermeasures has become increasingly important and is currently a critical area of research, which is the principal objective of this thesis. With the development of Neural Networkbased techniques, in particular, for machine generated spoof speech signals, the performance of Spoof Speech Detection (SSD) system will be further challenging. To encourage the development of countermeasures that are based on signal processing techniques or neural network-based features for SSD task, a standardized dataset was provided by the organizers of ASVspoof challenge campaigns during 2015, 2017, and 2019. The front-end features extracted from the speech signal has a huge impact in the field of signal processing applications. The goal of feature extraction is to estimate the meaningful information directly from the speech signal that can be helpful to the pattern classifier, speech, speaker, emotion recognition, etc. Among various spoofing attacks, speech synthesis, voice conversion, and replay attacks have been identified as the most effective and accessible forms of spoofing. Accordingly, this thesis investigates and develops a framework to extract the discriminative features to deflect these three spoofing attacks. The main contribution of the thesis is to propose various feature sets as frontend countermeasures for SSD task using a traditional Gaussian Mixture Model (GMM)-based classification system. The feature sets are based on Teager Energy Operator (TEO) and Energy Separation Algorithm (ESA), namely, Teager Energy Cepstral Coefficients (TECC), Energy Separation Algorithm Instantaneous Frequency Cepstral Coefficients (ESA-IFCC), Energy Separation Algorithm Instantaneous Amplitude Cepstral Coefficients (ESA IACC), Amplitude Weighted Frequency Cepstral Coefficients (AWFCC), Gabor Teager Filterbank (GTFB). The motivation behind using TEO is its nonlinear speech production property. The true total source energy is known to be estimated using TEO, and it also preserves the amplitude and frequency modulation of a resonant signal and hence, it improves the time-frequency resolution along with improving the formant information representation. In addition, the TEO also has the noise suppression property and it attempts to remove the distortion caused by noise signal. In Chapter 3, we analyze the replay speech signal in terms of reverberation that occurs during recording of the speech signal. The reverberation introduces delay and change in amplitude producing close copies of speech signal which significantly influences the replay components. To that effect, we propose to exploit the capabilities of Teager Energy Operator (TEO) to estimate running estimate of subband energies for replay vs. genuine signal. We have used linearly-spaced Gabor filterbank to obtain narrowband filtered signal. The TEO has the property to track the instantaneous changes of a signal. In Chapter 4, we propose Instantaneous Amplitude (IA) and Instantaneous Frequency (IF) features using Energy Separation Algorithm (ESA). The speech signal is passed through bandpass filters in order to obtain narrowband components because speech is a combination of several monocomponent signals. To obtain a narrowband filtered signal, we have used linearly-spaced Butterworth and Gabor filterbank. The instantaneous modulations helps to understand the local characteristics of a non-stationary signal. These IA and IF components are able to capture the information present in a slowly-varying amplitude envelope and fast-varying frequency. The slowvarying temporal modulations for replay speech have the distorted amplitude envelope, and the fast-varying temporal modulation do not preserve the harmonic structure compared to the natural speech signal. For replay speech signal, the intermediate device characteristics and acoustic environment distorts the spectral energy compared to the natural speech energy. In Chapter 5, we extend our earlier work with the generalized TEO, i.e., by varying the samples of past and future instants with a constant arbitrary integer k also known as lag parameter or dependency index, and named it as Variable length Teager Energy Operator (VTEO). In Chapter 6, we propose the combination of Amplitude Modulation and Frequency Modulation (AM-FM) features for replay Spoof Speech Detection (SSD) task. The AM components are known to be affected by noise (in this case, due to replay mechanism). In particular, we explore this damage in AM component to corresponding Instantaneous Frequency (IF) for SSD task. Thus, the novelty of proposed AmplitudeWeighted Frequency Cepstral Coefficients (AWFCC) feature set lies in using frequency components along with squared weighted amplitude components that are degraded due to replay noise. The AWFCC features contains the information of both AM and FM components together and hence, gave discriminatory information in the spectral characteristics. The first motivation in this thesis is to develop various countermeasures for SSD task. The experimental results on the standard spoofing database shows that proposed feature sets perform better than the corresponding baseline systems. Inspired by the success in the SSD task, we applied TEO-based feature set in a variety of speech and audio processing applications, namely, Automatic Speech Recognition (ASR), Acoustic Scene Sound Classification (ASC), Voice Assistant (VA), and Whisper Speech Detection (WSD). In all these applications, our TEObased feature set gave consistently better performance compared to their respective baselines.
- PhD Theses