Person recognition using humming, singing and speech
Abstract
In this thesis, person recognition system is designed for three different speech-related biometric signals, i.e., humming, singing and normal speech. As humming is nasalised sound, we have approached Mel filterbank-based features for person recognition task rather than LP (Linear Prediction) model. This thesis work is done to observe which biometric pattern performs better amongst three for person recognition task. As we found that person-specific information is not same in any two biometric signals, one should have to observe performance of these biometric signals.
The very first task for any person recognition system design is data collection and corpus design. Hence, in this thesis, first, corpus is designed for the humming, singing and speech. In the data collection part, total 50 subjects are selected for the recording purpose. The data collection is done in 4 different sessions for each subject in order to capture intersession variability. Each session consists of testing session of recording for humming, singing and speech. Next to data collection, feature extraction is done with Mel filterbank which follows the human perception for hearing, so Mel Frequency Cepstral Coefficients (MFCC) is used as state-of-the-art feature set. Then using this filterbank, experiment is done for intersession as well as session training-testing set. After that, noise is added to the database and the results are compared to observe the effect of noise viz., evaluation under noisy conditions in robustness performance of the system. Then some modification is also done in feature (Teager Energy Based MFCC) extraction process using TEO (Teager Energy Operator). Results of T-MFCC features are also compared with the results of MFCC feature set. Score-level fusion of T-MFCC and MFCC feature set are also done and results for the same are observed. These observations lead us to the fact that score-level fusion of MFCC and T-MFCC performs better than either of them two individually. This type of score-level fusion increases the performance of the system. For different values of the fusion weight, performance is measured and optimum value for fusion-weight is determined for humming, singing and speech signals. Effect of feature dimensions as well as order of the classifier also observed for intersession experiment. After these studies, inter biometric type experiment is performed. Based on the results obtained in this experiment, Fisher’s F-ratio is determined for all three biometric patterns (i.e., humming, singing and speech). The new structure of filterbank is proposed for all three biometric patterns. The system performance is also measured for this new filterbank and compared with previous all experiments.
In all these experiments, person-specific model is generated using polynomial classifier. This classifier considers out-of-class information while creating person-specific model.
The experiments were reported for different performance evaluation factors. For example, effect of polynomial classifier order, effect of dimension of feature vector, effect of noisy environments are considered. To evaluate, performance DET curves are used. This is NIST standardized widely accepted performance evaluation measure for speaker recognition application.
Collections
- M Tech Dissertations [923]