Person:
Patil, Hemant

Name

Hemant Patil

Job Title

Faculty

Telephone

079-68261650

Specialization

Speech Signal Processing, Speech and Speaker Recognition (Voice Biometrics), Development of Countermeasures for Spoofing Attacks on Automatic Speaker Verification, Voice Conversion

Biography

Hemant A. Patil received Ph.D. degree from the Indian Institute of Technology (IIT), Kharagpur, India, in July 2006. Since 2007, he has been a faculty member at DA-IICT Gandhinagar, India and developed Speech Research Lab at DA-IICT, which is recognized as ISCA speech labs. Dr. Patil is member of IEEE, IEEE Signal Processing Society, IEEE Circuits and Systems Society, International Speech Communication Association (ISCA), EURASIP and an affiliate member of IEEE SLTC. He is regular reviewer for ICASSP and INTERSPEECH, Speech Communication, Elsevier, Computer Speech and Language, Elsevier and Int. J. Speech Tech, Springer, Circuits, Systems and Signal Processing, Springer. He has published around 226 research publications in national and international conferences/journals/book chapters. He visited department of ECE, University of Minnesota, Minneapolis, USA (May-July, 2009) as short term scholar. He has been associated (as PI) with three MeitY sponsored projects in ASR, TTS and QbE-STD. He was co-PI for DST sponsored project on India-Digital Heritage (IDH)-Hampi. His research interests include speech and speaker recognition, TTS, infant cry analysis. He has received DST Fast Track Award for Young Scientists for infant cry analysis. He has coedited a book on Forensic Speaker Recognition with Dr. Amy Neustein (EIC, IJST Springer). Presently, he is coediting two books in speech technology for medical-domain. Dr. Patil has taken a lead role in organizing several ISCA supported events, such as summer/winter schools/CEP workshops (such as speaker and language recognition, speech source modeling, text-to-speech synthesis, speech production-perception link, advances in speech processing) and progress review meetings for two MeitY consortia project all at DA-IICT Gandhingagar. Dr. Patil has supervised 04 doctoral (including doctoral thesis supervision in spoofing attacks) and 42 M.Tech. theses. Presently, he is supervising 03 doctoral students. Recently, he offered a joint tutorial with Prof. Haizhou Li during Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2017 and also during INTERSPEECH 2018. He will be offfering joint tutorial with Prof. H. Kawahar during APSIPA ASC 2018, Honolulu, USA, Nov. 12-15, 2018. He has been selected as APSIPA Distinguished Lecturer (DL) for 2018-2019. He has delivered 11 APSIPA DLs in three countries, India, China and Canada.

Full item page

Search Results

Now showing 1 - 10 of 38

Metadata only
Combining evidences from magnitude and phase information using VTEO for person recognition using humming
(Elsevier, 01-11-2018) Madhavi, Maulik C; Patil, Hemant; DA-IICT, Gandhinagar; Madhavi, Maulik C (200911036)
Most of the state-of-the-art�speaker recognition system�use natural speech signal (i.e., real speech, spontaneous speech or contextual speech) from the subjects. In this paper, recognition of a person is attempted from his or her�hum�with the help of machines. This kind of application can be useful to design person-dependent Query-by-Humming (QBH) system and hence, plays an important role in�music information retrieval�(MIR) system. In addition, it can be also useful for other interesting speech technological applications such as human-computer interaction, speech prosody analysis of disordered speech, and speaker forensics. This paper develops new feature extraction technique to exploit�perceptually�meaningful (due to mel frequency warping to imitate human perception process for hearing) phase spectrum information along with magnitude spectrum information from the hum signal. In particular, the structure of state-of-the-art feature set, namely,�Mel Frequency Cepstral Coefficients�(MFCCs) is modified to capture the phase spectrum information. In addition, a new�energy measure, namely,�Variable length�Teager Energy Operator (VTEO) is employed to compute subband energies of different time-domain�subband signals�(i.e., an output of�24�triangular-shaped filters used in the mel filterbank). We refer this proposed feature set as MFCC-VTMP (i.e., mel frequency cepstral coefficients to capture perceptually meaningful magnitude and phase information via VTEO)The polynomial classifier (which is in-principle similar to other discriminatively-trained classifiers such as�support vector machine�(SVM) with polynomial kernel) is used as the basis for all the experiments. The effectiveness of proposed feature set is evaluated and consistently found to be better than MFCCs feature set for several evaluation factors, such as, comparison with other phase-based features, the order of polynomial classifier, person (speaker) modeling approach (such as, GMM-UBM and�i-vector), the dimension of feature vector, robustness under signal degradation conditions, static�vs.�dynamic features, feature discrimination measures and intersession variability.
Metadata only
On Significance of Constant-Q Transform for Pop Noise Detection
(Elsevier, 11-06-2023) Khoria, Kuldeep; Patil, Ankur T; Patil, Hemant; DA-IICT, Gandhinagar; Khoria, Kuldeep (201911014); Patil, Ankur T (201621008)
Liveness detection has emerged as an important research issue for many�biometrics, such as face, iris, hand geometry, etc. and significant research efforts are reported in the literature. However, less emphasis is given to liveness detection for voice biometrics or Automatic Speaker Verification (ASV). Voice Liveness Detection (VLD) can be a potential technique to detect spoofing attacks in�ASV system. Presence of pop noise in the speech signal of live speaker provides the discriminative acoustic cue to distinguish between genuine�vs.�spoofed speech in the framework of VLD. Pop noise comes out as a burst at the lips, which is captured by the ASV system (since the speaker and microphone are close enough), indicating the liveness of the speaker and provides the basis of VLD. In this paper, we present the Constant-Q Transform (CQT) -based approach over the traditional Short-Time Fourier Transform (STFT) -based algorithm (baseline). With respect to Heisenberg�s uncertainty principle in signal processing framework, the CQT has variable spectro-temporal resolution, in particular, better frequency resolution for low frequency region and better temporal resolution for high frequency region, which can be effectively utilized to identify the low frequency characteristics of pop noise. We have also compared proposed algorithm with�cepstral�features, namely, Linear Frequency�Cepstral Coefficients�(LFCC) and Constant-Q Cepstral Coefficients. The experiments are performed on recently released�POp noise COrpus�(POCO) dataset with various statistical, discriminative, and deep learning-based classifiers, namely,�Gaussian Mixture Model�(GMM),�Support Vector Machine�(SVM),�Convolutional Neural Networks�(CNN), Light-CNN (LCNN), and�Residual Network�(ResNet), respectively. The significant improvement in performance, in particular, an absolute improvement of 14.23% and 10.95% in terms of percentage�classification accuracy�on development and evaluation set, respectively, is obtained for the proposed CQT-based algorithm along with SVM classifier, over the STFT-SVM (baseline) system. Similar trend of the�performance improvement�is observed for the GMM, CNN, LCNN, and ResNet classifiers for the proposed CQT-based algorithm�vs.�traditional STFT-based algorithm. The analysis is further extended by simulating the replay mechanism (in the standard framework of ASVSpoof-2019 PA challenge dataset) on the subset of POCO dataset in order to observe the effect of room acoustics onto the performance of the VLD system. By embedding the moderate simulated replay mechanism in POCO dataset, we obtained the percentage�classification accuracy�of 97.82% on evaluation set.
Metadata only
Significance of Higher-Order Spectral Analysis in Infant Cry Classification
(Springer, 01-01-2018) Chittora, Anshu; Patil, Hemant; DA-IICT, Gandhinagar; Chittora, Anshu (201021012)
In this paper, higher-order spectral analysis is applied to infant cry signals for classification of normal infant cries from pathological infant cries. From the family of higher-order spectra, bispectrum is considered for the proposed task. Bispectrum is the Fourier transform of the third-order cumulant function. To extract features from the bispectrum, application of higher-order singular value decomposition theorem is proposed. Experimental results show the average classification accuracy of��and Matthew�s correlation coefficient (MCC) of 0.62 with proposed bispectrum features. In all of the experiments reported in this paper, support vector machine with radial basis function kernel is used as the pattern classifier. Performance of the proposed features is also compared with the state-of-the-art methods such as linear frequency cepstral coefficients, Mel frequency cepstral coefficients, perceptual linear prediction coefficients, linear prediction coefficients, linear prediction cepstral coefficients and perceptual linear prediction cepstral coefficients, and is found to be better than that given by these feature sets. The proposed bispectrum-based features are shown to be robust under signal degradation or noisy conditions at various SNR levels. Performance in the presence of noise is compared with the state-of-the-art spectral feature sets using MCC scores. In addition, effectiveness of cryunit segmentation in normal and pathological infant cry classification task is reported.
Metadata only
Replay Spoof Detection Using Energy Separation Based Instantaneous Frequency Estimation From Quadrature and In-Phase Components
(Elsevier, 01-01-2023) Gupta, Priyanka; Chodingala, Piyush; Patil, Hemant; DA-IICT, Gandhinagar; Gupta, Priyanka (201721001); Chodingala, Piyush (202015002)
Replay attacks in speech are becoming easier to mount with the advent of high quality of recording and playback devices. This makes these replay attacks a major concern for the security of Automatic Speaker Verification (ASV) systems and�voice assistants. In the past, auditory transform-based as well as�Instantaneous Frequency�(IF)-based features have been proposed for replay spoofed speech detection (SSD). In this context, IF has been estimated either by derivative of analytic phase via�Hilbert transform, or by using high temporal resolution Teager Energy Operator (TEO)-based Energy Separation Algorithm (ESA). However, excellent temporal resolution of ESA comes with lacking in using�relative phase�information and vice-versa. To that effect, we propose novel Cochlear Filter Cepstral Coefficients-based�Instantaneous Frequency�using Quadrature Energy Separation Algorithm (CFCCIF-QESA) features, with excellent temporal resolution as well as relative�phase information. CFCCIF-QESA is designed by exploiting�relative phase shift�to estimate IF, without estimating phase explicitly from the signal. To motivate and validate effectiveness of proposed QESA approach for IF estimation, we have employed information-theoretic measures, such as�Mutual Information�(MI), Kullback�Leibler (KL) divergence, and Jensen�Shannon (JS) divergence. The proposed CFCCIF-QESA feature set is extensively evaluated on standard statistically meaningful ASVSpoof 2017 version2.0 dataset. When evaluated on the ASVSpoof 2017 v2.0 dataset, CFCCIF-QESA achieves improved performance as compared to CFCCIF-ESA and CQCC feature sets on�GMM,�CNN, and LCNN classifiers. Furthermore, in the case of cross-database evaluation using ASVSpoof 2017 v2.0 and VSDC, CFCCIF-QESA also performs relatively better as compared to CFCCIF-ESA and CQCC on�GMM�classifier. However, for the case of self-classification on the ASVSpoof 2019 PA data, CFCCIF-QESA only outperforms CFCCIF-ESA. Whereas, on BTAS 2016 dataset, it performs relatively close to CFCCIF-ESA. Finally, results are presented for the case when the ASV system is not under attack.
Metadata only
Music footprint recognition via sentiment, identity, and setting identification
(Springer, 01-07-2022) Phatnani, Kirtana Sunil; Patil, Hemant; DA-IICT, Gandhinagar
Emotional contagion is said to occur when an origin (i.e., any sensory stimuli) emanating emotions causes the observer to feel the same emotions. In this paper, we explore the identification and quantification of emotional contagion produced by music in human beings. We survey 50 subjects who answer: what type of music they hear when they are happy, excited, sad, angry, and affectionate. In the analysis of the distribution, we observe that predominantly the emotional state of the subjects does influence the choice of�tempo�of the musical piece. We define the footprint in three dimensions, namely, sentiment, time, and identification. We unpack each song by unraveling sentiment analysis in time, using lexicons and tenses, along with the identity via pronouns used. In this study, we wish to quantify and visualize the emotional journey of the listener through music. The results of this can be extended to the elicitation of emotional contagion within any story, poem, and conversations.
Metadata only
Data Collection of Infant Cries for Research and Analysis
(Elsevier, 01-03-2017) Chittora, Anshu; Patil, Hemant; DA-IICT, Gandhinagar; Chittora, Anshu (201021012)
Analysis of infants cries may help in identifying the needs of infants such as hunger, pain, sickness, etc and thereby develop a tool or possible mobile application that can help the parents in monitoring the needs of their infant. Analysis of cries of infants who are suffering from�neurologic disorders�and severe diseases, which can later on result in motor and mental handicap, may prove helpful in�early diagnosis�of pathologies and protect infants from such disorders. The development of an infant cry corpus is necessary for the analysis of infant cries and for the development of infant cry tools. Infant cry database is not available commercially for research, which limits the scope of research in this area. Because the cry characteristics changes with many factors such as reason for crying, infant's health and weight, age, etc, care is required while designing a corpus for a particular research application of infant cry analysis and classification. In this paper, the ideal characteristics of the corpus are proposed along with factors influencing infant cry characteristics, and experiences during data collection are shared. This study may help other researchers to build an infant cry corpus for their specific problem of study. Justification of the proposed characteristics is also given along with suitable examples.
Metadata only
Partial matching and search space reduction for QbE-STD
(Elsevier, 01-09-2017) Madhavi, Maulik C; Patil, Hemant; DA-IICT, Gandhinagar; Madhavi, Maulik C (200911036)
Query-by-Example approach of spoken content retrieval has gained much attention because of its feasibility in the absence of speech recognition and its applicability in a multilingual matching scenario. This approach to retrieve spoken content is referred to as Query-by-Example Spoken Term Detection (QbE-STD). The state-of-the-art QbE-STD system performs matching between the frame sequence of query and test utterance via Dynamic Time Warping (DTW) algorithm. In realistic scenarios, there is a need to retrieve the query which does not appear exactly in the spoken document. However, the appeared instance of query might have the different suffix, prefix or word order. The DTW algorithm monotonically aligns the two sequences and hence, it is not suitable to perform partial matching between the frame sequence of query and test utterance. In this paper, we propose novel partial matching approach between spoken query and utterance using modified DTW algorithm where multiple warping paths are constructed for each query and test utterance pair. Next, we address the research issue associated with search complexity of DTW and suggest two approaches, namely, feature reduction approach and Bag-of-Acoustic-Words (BoAW) model. In feature reduction approach, the number of feature vectors is reduced by averaging across the consecutive frames within phonetic boundaries. Thus, a lesser number of feature vectors require fewer number of comparisons and hence, DTW speeds up the search computation. The search�computation time�gets reduced by�46�49% with a slight degradation in performance as compared to no feature reduction case. In BoAW model, we construct term frequency-inverse document frequency��vectors at segment-level to retrieve audio documents. The proposed segment-level BoAW model is used to match test utterance with a query using��vectors and the scores obtained are used to rank the test utterance. The BoAW model gave more than�80% recall value on�70% top retrieval. To re-score the detection, we further employ DTW search or modified DTW search to retrieve the spoken query from the selected utterances using BoAW model. QbE-STD experiments are conducted on different international benchmarks, namely, MediaEval spoken�web search�SWS 2013 and MediaEval query-by-example search on speech QUESST 2014.
Metadata only
Auditory feature representation using convolutional restricted Boltzmann machine and Teager energy operator for speech recognition
(AIP, 01-06-2017) Sailor, Hardik B; Patil, Hemant; DA-IICT, Gandhinagar; Sailor, Hardik B (201321002)
In addition, for a hexagonal geometry, when the number of simultaneously communicating pairs are different in different adjacent cells, the optimum number of gateways per cell that maximizes the system capacity for all the cases is not a fixed number, but varies between a minimum value of�one�and a maximum value of�six, depending on the number of cells which has two simultaneously communication pairs.
Metadata only
Vulnerability Issues in Automatic Speaker Verification (ASV) Systems
(ACM DL, 10-02-2024) Gupta, Priyanka; Guido, Rodrigo Capobianco; Patil, Hemant; DA-IICT, Gandhinagar; Gupta, Priyanka (201721001)
Claimed identities of speakers can be verified by means of automatic speaker verification (ASV) systems, also known as voice biometric systems. Focusing on security and robustness against spoofing attacks on ASV systems, and observing that the investigation of attacker�s perspectives is capable of leading the way to prevent known and unknown threats to ASV systems, several countermeasures (CMs) have been proposed during ASVspoof 2015, 2017, 2019, and 2021 challenge campaigns that were organized during INTERSPEECH conferences. Furthermore, there is a recent initiative to organize the ASVSpoof 5 challenge with the objective of collecting the massive spoofing/deepfake attack data (i.e., phase 1), and the design of a spoofing-aware ASV system using a single classifier for both ASV and CM, to design integrated CM-ASV solutions (phase 2). To that effect, this paper presents a survey on a diversity of possible strategies and vulnerabilities explored to successfully attack an ASV system, such as target selection, unavailability of global countermeasures to reduce the attacker�s chance to explore the weaknesses, state-of-the-art adversarial attacks based on machine learning, and deepfake generation. This paper also covers the�possibility�of attacks, such as hardware attacks on ASV systems. Finally, we also discuss the several technological challenges from the attacker�s perspective, which can be exploited to come up with better defence mechanisms for the security of ASV systems.
Metadata only
Residual Neural Network precisely quantifies dysarthria severity-level based on short-duration speech segments
(Elsevier, 01-07-2021) Gupta, Siddhant; Patil, Ankur T; Purohit, Mirali; Patel, Maitreya; Guido, Rodrigo Capobianco; Patil, Hemant; DA-IICT, Gandhinagar; Gupta, Siddhant (201911007); Patil, Ankur T (201621008); Purohit, Mirali (201811067); Patel, Maitreya (201601160)
Recently, we have witnessed Deep Learning methodologies gaining significant attention for severity-based classification of dysarthric speech. Detecting dysarthria, quantifying its severity, are of paramount importance in various real-life applications, such as the assessment of patients' progression in treatments, which includes an adequate planning of their therapy and the improvement of speech-based interactive systems in order to handle pathologically-affected voices automatically. Notably, current speech-powered tools often deal with short-duration speech segments and, consequently, are less efficient in dealing with impaired speech, even by using Convolutional Neural Networks (CNNs). Thus, detecting dysarthria severity-level based on short speech segments might help in improving the performance and applicability of those systems. To achieve this goal, we propose a novel Residual Network (ResNet)-based technique which receives short-duration speech segments as input. Statistically meaningful objective analysis of our experiments, reported over standard Universal Access corpus, exhibits average values of 21.35% and 22.48% improvement, compared to the baseline CNN, in terms of classification accuracy and F1-score, respectively. For additional comparisons, tests with Gaussian Mixture Models and Light CNNs were also performed. Overall, the values of 98.90% and 98.00% for classification accuracy and F1-score, respectively, were obtained with the proposed ResNet approach, confirming its efficacy and reassuring its practical applicability.

Person:
Patil, Hemant

Name

Job Title

Email Address

Telephone

Birth Date

Specialization

Abstract

Biography

Research Projects

Organizational Units

Name

Filters

Settings

Sort By

Results per page

Search Results

Person: Patil, Hemant

Name

Job Title

Email Address

Telephone

Birth Date

Specialization

Abstract

Biography

Research Projects

Organizational Units

Name

Filters

Settings

Sort By

Results per page

Search Results

Person:
Patil, Hemant