Design of countermeasures for replay spoof speech attack
Abstract
Automatic Speaker Verification (ASV) system is a biometric person authentication
system to verify a claimed speaker's identity from his/her voice with the
help of machines. The ASV systems are vulnerable to various types of spoofing
attacks, such as impersonation, speech synthesis (SS), voice conversion (VC), replay
and twins. Replay attack poses one of the most difficult challenge for the use
of ASV systems in the practical scenarios, as it does not require any specific expert
knowledge and advanced equipment. In this work, we present a standalone replay
Spoof Speech Detection (SSD) task to classify the natural vs. replayed speech.
In the earlier studies, researchers mainly used vocal tract system-based (segmental)
information for replay SSD. However, during replay mechanism, excitation
source-based information also gets affected (in particular, degradation in pitch
(F0) source harmonics at the higher frequency regions) due to recording environment
and replay devices. Hence, in this thesis, we have explored the excitation
source-based feature set along with system-based features for replay SSD task.
In particular, we proposed the novel Linear Frequency Residual Cepstral Coefficients
(LFRCC) for replay SSD task. The objective of using this novel feature set
for replay SSD task is to explore possible complementary excitation source information
present in the Linear Prediction (LP) residual-based features.
In addition, we also proposed system-based features, namely, Instantaneous Amplitude
(IA) and Instantaneous Frequency (IF) features using Hilbert Transform
(HT) demodulation technique. These HT-based Instantaneous Amplitude Cepstral
Coefficients (IACC) and Instantaneous Frequency Cepstral Coefficients (IFCC)
feature sets are able to capture the information present in a slowly-varying envelope
and fast-varying changes in frequency. Experiments were performed on ASV
Spoof 2017 Challenge database with Gaussian Mixture Model (GMM) and Convolutional
Neural Network (CNN) classifiers. On the other hand, the score-level
fusion of source-based features and system-based features significantly improved
the performance. Furthermore, for a fixed feature set, when we have fused GMM
and CNN classifier at a score-level a significant reduction in % Equal Error Rate
(EER) is obtained. Furthermore, we have also analyze the effect of classifier-level
fusion for replay SSD task.
Collections
- M Tech Dissertations [923]