Design of countermeasures for replay spoof speech attack
Automatic Speaker Verification (ASV) system is a biometric person authentication system to verify a claimed speaker's identity from his/her voice with the help of machines. The ASV systems are vulnerable to various types of spoofing attacks, such as impersonation, speech synthesis (SS), voice conversion (VC), replay and twins. Replay attack poses one of the most difficult challenge for the use of ASV systems in the practical scenarios, as it does not require any specific expert knowledge and advanced equipment. In this work, we present a standalone replay Spoof Speech Detection (SSD) task to classify the natural vs. replayed speech. In the earlier studies, researchers mainly used vocal tract system-based (segmental) information for replay SSD. However, during replay mechanism, excitation source-based information also gets affected (in particular, degradation in pitch (F0) source harmonics at the higher frequency regions) due to recording environment and replay devices. Hence, in this thesis, we have explored the excitation source-based feature set along with system-based features for replay SSD task. In particular, we proposed the novel Linear Frequency Residual Cepstral Coefficients (LFRCC) for replay SSD task. The objective of using this novel feature set for replay SSD task is to explore possible complementary excitation source information present in the Linear Prediction (LP) residual-based features. In addition, we also proposed system-based features, namely, Instantaneous Amplitude (IA) and Instantaneous Frequency (IF) features using Hilbert Transform (HT) demodulation technique. These HT-based Instantaneous Amplitude Cepstral Coefficients (IACC) and Instantaneous Frequency Cepstral Coefficients (IFCC) feature sets are able to capture the information present in a slowly-varying envelope and fast-varying changes in frequency. Experiments were performed on ASV Spoof 2017 Challenge database with Gaussian Mixture Model (GMM) and Convolutional Neural Network (CNN) classifiers. On the other hand, the score-level fusion of source-based features and system-based features significantly improved the performance. Furthermore, for a fixed feature set, when we have fused GMM and CNN classifier at a score-level a significant reduction in % Equal Error Rate (EER) is obtained. Furthermore, we have also analyze the effect of classifier-level fusion for replay SSD task.
- M Tech Dissertations