Publications

Permanent URI for this collectionhttps://ir.daiict.ac.in/handle/123456789/32

Browse

Search Results

Now showing 1 - 10 of 19
  • Publication
    Combining evidences from magnitude and phase information using VTEO for person recognition using humming
    (Elsevier, 01-11-2018) Madhavi, Maulik C; Patil, Hemant; DA-IICT, Gandhinagar; Madhavi, Maulik C (200911036)
    Most of the state-of-the-art�speaker recognition system�use natural speech signal (i.e., real speech, spontaneous speech or contextual speech) from the subjects. In this paper, recognition of a person is attempted from his or her�hum�with the help of machines. This kind of application can be useful to design person-dependent Query-by-Humming (QBH) system and hence, plays an important role in�music information retrieval�(MIR) system. In addition, it can be also useful for other interesting speech technological applications such as human-computer interaction, speech prosody analysis of disordered speech, and speaker forensics. This paper develops new feature extraction technique to exploit�perceptually�meaningful (due to mel frequency warping to imitate human perception process for hearing) phase spectrum information along with magnitude spectrum information from the hum signal. In particular, the structure of state-of-the-art feature set, namely,�Mel Frequency Cepstral Coefficients�(MFCCs) is modified to capture the phase spectrum information. In addition, a new�energy measure, namely,�Variable length�Teager Energy Operator (VTEO) is employed to compute subband energies of different time-domain�subband signals�(i.e., an output of�24�triangular-shaped filters used in the mel filterbank). We refer this proposed feature set as MFCC-VTMP (i.e., mel frequency cepstral coefficients to capture perceptually meaningful magnitude and phase information via VTEO)The polynomial classifier (which is in-principle similar to other discriminatively-trained classifiers such as�support vector machine�(SVM) with polynomial kernel) is used as the basis for all the experiments. The effectiveness of proposed feature set is evaluated and consistently found to be better than MFCCs feature set for several evaluation factors, such as, comparison with other phase-based features, the order of polynomial classifier, person (speaker) modeling approach (such as, GMM-UBM and�i-vector), the dimension of feature vector, robustness under signal degradation conditions, static�vs.�dynamic features, feature discrimination measures and intersession variability.
  • Publication
    Significance of Higher-Order Spectral Analysis in Infant Cry Classification
    (Springer, 01-01-2018) Chittora, Anshu; Patil, Hemant; DA-IICT, Gandhinagar; Chittora, Anshu (201021012)
    In this paper, higher-order spectral analysis is applied to infant cry signals for classification of normal infant cries from pathological infant cries. From the family of higher-order spectra, bispectrum is considered for the proposed task. Bispectrum is the Fourier transform of the third-order cumulant function. To extract features from the bispectrum, application of higher-order singular value decomposition theorem is proposed. Experimental results show the average classification accuracy of��and Matthew�s correlation coefficient (MCC) of 0.62 with proposed bispectrum features. In all of the experiments reported in this paper, support vector machine with radial basis function kernel is used as the pattern classifier. Performance of the proposed features is also compared with the state-of-the-art methods such as linear frequency cepstral coefficients, Mel frequency cepstral coefficients, perceptual linear prediction coefficients, linear prediction coefficients, linear prediction cepstral coefficients and perceptual linear prediction cepstral coefficients, and is found to be better than that given by these feature sets. The proposed bispectrum-based features are shown to be robust under signal degradation or noisy conditions at various SNR levels. Performance in the presence of noise is compared with the state-of-the-art spectral feature sets using MCC scores. In addition, effectiveness of cryunit segmentation in normal and pathological infant cry classification task is reported.
  • Publication
    A novel approach to remove outliers for parallel voice conversion
    (Elsevier, 01-11-2019) Shah, Nirmesh J; Patil, Hemant; DA-IICT, Gandhinagar; Shah, Nirmesh J (201321009)
    Alignment is a key step before learning a�mapping function�between a source and a target speaker�s�spectral features�in various state-of-the-art parallel data Voice Conversion (VC) techniques. After alignment, some corresponding pairs are still inconsistent with the rest of the data and are considered outliers. These outliers shift the parameters of the mapping function from their true value and hence, negatively affect the learning of mapping function during the training phase of the VC task. To the best of the authors� knowledge, the effect of outliers (and hence, their removal) on quality of the converted voice has not been much explored in the VC literature. Recent research has shown the effectiveness of the�outlier removal�as a pre-processing step in the VC. In this paper, we extend this study with a detailed theory and analysis. The proposed method uses a score distance that is estimated using Robust�Principal Component�Analysis (ROBPCA) to detect the outliers. In particular, the outliers are determined using a fixed cut-off on the score distances, based on the degrees of freedom in a chi-squared distribution, which is speaker-pair independent. The fixed cut-off is due to the assumption that the score distances follow the normal (i.e., Gaussian) distribution. However, this is a�weak�statistical assumption even in the cases where quite many samples are available. Hence, in this paper, we propose to explore speaker-pair dependent cut-offs to detect the outliers. In addition, we have presented our results on two state-of-the-art databases, namely, CMU-ARCTIC and Voice Conversion Challenge (VCC) 2016 by developing various state-of-the-art methods in the VC. In particular, we have presented the effectiveness of the outlier removal on�Gaussian Mixture Model�(GMM),�Artificial Neural Network�(ANN), and�Deep Neural Network�(DNN)-based VC techniques. Furthermore, we have presented subjective and objective evaluations using a 95% confidence interval for the statistical significance of the tests. We obtained an average 0.56% relative reduction in Mel�Cepstral�Distortion (MCD) with the proposed outlier removal approach as a pre-processing step. In particular, with the proposed speaker-pair dependent cut-off, we have observed relative improvement of 12.24% and 30.51% in the speech quality, and 39.7% and 4.27% absolute improvement in the speaker similarity for the CMU-ARCTIC and the VCC 2016, respectively.
  • Publication
    Vocal Tract Length Normalization using a Gaussian Mixture Model Framework for Query-by-Example Spoken Term Detection
    (Elsevier, 01-11-2019) Madhavi, Maulik C; Patil, Hemant; DA-IICT, Gandhinagar
    In this work, we explored hierarchical MoS2�nanomaterials�for soil moisture sensing (SMS) and tested their efficacy considering the operational aspects of the sensor. Carnation and marigold flower-like MoS2�nanostructures were prepared via facile hydrothermal processes with varying synthesis temperatures. The synthesized MoS2�nanostructures were well characterized by�XRD, FTIR, FESEM, EDS, and HRTEM and it is evident that the variation in the hydrothermal temperatures has a significant impact on the crystallinity, morphology, stoichiometry, dimensions, and lattice spacing. We found that hierarchical MoS2�marigold flower-like nanostructures offer the highest sensitivity of about 2000 %, when gravimetric water content (GWC) is varied from 1 % to 20 % GWC, which is one of the highest reported SMS. The sensors exhibit hysteresis of about ��4 % and response times of about 500�s. They were highly selective to moisture compared to the other salts like Na, K, Cd, and Cu present in the soil. The sensors were also unaffected by changing temperatures with a small 2�4 % between 20 �C and 65 �C.
  • Publication
    Design of Mixture of GMMs for Query-by-Example Spoken Term Detection
    (Elsevier, 04-05-2018) Madhavi, Maulik C; Patil, Hemant; DA-IICT, Gandhinagar; Madhavi, Maulik C (200911036)
    This paper presents the design of a mixture of�Gaussian�Mixture Models (GMMs) for Query-by-Example Spoken Term Detection (QbE-STD). The speech data governs acoustically similar broad phonetic structures. To capture broad phonetic structure, we exploit additional information of broad phoneme classes (such as vowels, semi-vowels, nasals, fricatives, and plosives) for the training of the GMM. The mixture of GMMs is tied with GMMs of these broad phoneme classes, i.e., each GMM expresses the�probability density function�(pdf) of a broad phoneme category. The Expectation Maximization (EM) algorithm is used to obtain the GMM for each broad phoneme class. Thus, a mixture of GMMs represents the spoken query with the broad phonetic constraints. These constraints restrict the�posterior probability�within the broad class, which results into a better posteriorgram design. The novelty of our work lies in prior probability assignments (as weights of the mixture of GMMs) for better Gaussian posteriorgram design. The proposed simple yet effective posteriorgram outperform Gaussian posteriorgram because of its implicit constraints supplied by broad phonetic posteriors. The Maximum Term Weighted Value (MTWV) for SWS 2013 dataset is improved by 0.052, and 0.051 w.r.t. Gaussian posteriorgram for Mel Frequency Cepstral Coefficients (MFCC) and Perceptual�Linear Prediction�(PLP), respectively. We found that the proposed mixture of GMMs approach gave consistently better performance than the Gaussian posteriorgram across various evaluation factors, such as different cepstral representations, number of Gaussian components, the number of spoken examples per query, and effect of amount of labeled data used for broad phoneme posterior computati
  • Publication
    Data Collection of Infant Cries for Research and Analysis
    (Elsevier, 01-03-2017) Chittora, Anshu; Patil, Hemant; DA-IICT, Gandhinagar; Chittora, Anshu (201021012)
    Analysis of infants cries may help in identifying the needs of infants such as hunger, pain, sickness, etc and thereby develop a tool or possible mobile application that can help the parents in monitoring the needs of their infant. Analysis of cries of infants who are suffering from�neurologic disorders�and severe diseases, which can later on result in motor and mental handicap, may prove helpful in�early diagnosis�of pathologies and protect infants from such disorders. The development of an infant cry corpus is necessary for the analysis of infant cries and for the development of infant cry tools. Infant cry database is not available commercially for research, which limits the scope of research in this area. Because the cry characteristics changes with many factors such as reason for crying, infant's health and weight, age, etc, care is required while designing a corpus for a particular research application of infant cry analysis and classification. In this paper, the ideal characteristics of the corpus are proposed along with factors influencing infant cry characteristics, and experiences during data collection are shared. This study may help other researchers to build an infant cry corpus for their specific problem of study. Justification of the proposed characteristics is also given along with suitable examples.
  • Publication
    Partial matching and search space reduction for QbE-STD
    (Elsevier, 01-09-2017) Madhavi, Maulik C; Patil, Hemant; DA-IICT, Gandhinagar; Madhavi, Maulik C (200911036)
    Query-by-Example approach of spoken content retrieval has gained much attention because of its feasibility in the absence of speech recognition and its applicability in a multilingual matching scenario. This approach to retrieve spoken content is referred to as Query-by-Example Spoken Term Detection (QbE-STD). The state-of-the-art QbE-STD system performs matching between the frame sequence of query and test utterance via Dynamic Time Warping (DTW) algorithm. In realistic scenarios, there is a need to retrieve the query which does not appear exactly in the spoken document. However, the appeared instance of query might have the different suffix, prefix or word order. The DTW algorithm monotonically aligns the two sequences and hence, it is not suitable to perform partial matching between the frame sequence of query and test utterance. In this paper, we propose novel partial matching approach between spoken query and utterance using modified DTW algorithm where multiple warping paths are constructed for each query and test utterance pair. Next, we address the research issue associated with search complexity of DTW and suggest two approaches, namely, feature reduction approach and Bag-of-Acoustic-Words (BoAW) model. In feature reduction approach, the number of feature vectors is reduced by averaging across the consecutive frames within phonetic boundaries. Thus, a lesser number of feature vectors require fewer number of comparisons and hence, DTW speeds up the search computation. The search�computation time�gets reduced by�46�49% with a slight degradation in performance as compared to no feature reduction case. In BoAW model, we construct term frequency-inverse document frequency��vectors at segment-level to retrieve audio documents. The proposed segment-level BoAW model is used to match test utterance with a query using��vectors and the scores obtained are used to rank the test utterance. The BoAW model gave more than�80% recall value on�70% top retrieval. To re-score the detection, we further employ DTW search or modified DTW search to retrieve the spoken query from the selected utterances using BoAW model. QbE-STD experiments are conducted on different international benchmarks, namely, MediaEval spoken�web search�SWS 2013 and MediaEval query-by-example search on speech QUESST 2014.
  • Publication
    Auditory feature representation using convolutional restricted Boltzmann machine and Teager energy operator for speech recognition
    (AIP, 01-06-2017) Sailor, Hardik B; Patil, Hemant; DA-IICT, Gandhinagar; Sailor, Hardik B (201321002)
    In addition, for a hexagonal geometry, when the number of simultaneously communicating pairs are different in different adjacent cells, the optimum number of gateways per cell that maximizes the system capacity for all the cases is not a fixed number, but varies between a minimum value of�one�and a maximum value of�six, depending on the number of cells which has two simultaneously communication pairs.
  • Publication
    Effectiveness of Teager energy operator for epoch detection from speech signals
    (Springer, 01-12-2011) Viswanath, Srikant; Patil, Hemant; DA-IICT, Gandhinagar; Viswanath, Srikant (200701125)
    In this paper, we try to present the problem of epoch detection from a different perspective that not only deals with estimation of epoch instances (i.e., glottal activity) but also with quantification of the absence of epochs (i.e.,�no�glottal activity) in the unvoiced regions of speech signal. Most of the epoch detection methods perform significantly well in the voiced regions of speech but are not robust enough in the unvoiced regions of speech, i.e., they detect a number of�pseudo�epochs in the unvoiced regions of speech. We propose a simple method based on Teager Energy Operator (TEO) which not only determines the epochs in voiced region (due to its superior temporal resolution and its ability to capture airflow properties through the glottis) but also is very effective in unvoiced region. Recently proposed methods such as 0-Hz resonator-based method and DYPSA method gave a combined rate (CR) (for detecting epochs in voiced and unvoiced regions of speech) of 74.7% and 60%, respectively and a pseudo epoch rate (PER) (i.e., spurious epochs in the unvoiced regions of speech) of 62.9% and 54.04%, respectively. On the other hand, our proposed method gave a CR and PER of 87% and 0.27%, respectively. This result suggests that the proposed method captures�glottal activity�more efficiently both in voiced and unvoiced regions of speech signal. The performance of the proposed method is demonstrated using publicly available CMU-Arctic database using the epoch information from the electro-glottograph (EGG) as reference signal to serve as ground truth for estimation of glottal closure instants (GCI). Due to the noise suppression capability of TEO, the proposed method has almost no or little effect (i.e., robust) against signal degradations like white, babble, high frequency and vehicle noises as compared to 0-Hz resonator and DYPSA methods.
  • Publication
    LP spectra vs. mel spectra for identification of professional mimics in Indian languages
    (Springer, 19-05-2009) Basu, T K; Patil, Hemant; DA-IICT, Gandhinagar
    Automatic Speaker Recognition (ASR) is an economic tool for voice biometrics because of availability of low cost and powerful processors. For an ASR system to be successful in practical environments, it must have�high mimic resistance, i.e., the system should not be defeated by determined mimics which may be either identical twins or professional mimics. In this paper, we demonstrate the effectiveness of Linear Prediction (LP)-based features, viz., Linear Prediction Coefficients (LPC) and Linear Prediction Cepstral Coefficients (LPCC) over filterbank-based features such as Mel-Frequency Cepstral Coefficients (MFCC) and newly proposed Teager energy-based MFCC (T-MFCC) for the identification of professional mimics in Indian languages. Results are reported for real and fictitious experiments. On the whole, it is observed that LP-based features perform�better�than filterbank-based features (an average jump of 23.21% and 31.43% for fictitious experiments with professional mimic in Marathi and Hindi, respectively, whereas there is an average jump of 1.64% for real experiments with professional mimic in Hindi) and�we believe that this is the first time such results on identification of professional mimics in ASR are obtained. Analysis of the results is given with the help of Mean Square Error (MSE) between training and testing utterances for mimic�s imitations for target speakers and target speakers� normal voice. Fourier spectra and corresponding LP spectra for target speaker and its impersonations provided by professional mimic are shown to justify the results. Finally, dependence of LPC on physiological characteristics of vocal tract and its relation with respect to the problem addressed in this paper is studied.