Please use this identifier to cite or link to this item: http://drsr.daiict.ac.in//handle/123456789/785
Title: Auditory representation learning
Authors: Patil, Hemant A.
Sailor, Hardik B.
Keywords: Representation learning
Deep learning
Filterbank learning
Speech Databases
Sound classification
Auditory model
Speech signal
Signal processing
Audio processing
Speech recognition
Speech detection
Issue Date: 2018
Publisher: Dhirubhai Ambani Institute of Information and Communication Technology
Citation: Sailor, Hardik B. (2018). Auditory Representation Learning. Dhirubhai Ambani Institute of Information and Communication Technology, xxv, 218 p. (Acc. No: T00688)
Abstract: Representation learning (RL) or feature learning has a huge impact in the field of signal processing applications. The goal of the RL approaches is to learn the meaningful representation directly from the data that can be helpful to the pattern classifier. Specifically, the unsupervised RL has gained a significant interest in the feature learning in various signal processing areas including the speech and audio processing. Recently, various RL methods are used to learn the auditorylike representations from the speech signals or its spectral representations. In this thesis, we propose a novel auditory representation learning model based on the Convolutional Restricted Boltzmann Machine (ConvRBM). The auditorylike subband filters are learned when the model is trained directly on the raw speech and audio signals with arbitrary lengths. The learned auditory frequency scale is also nonlinear similar to the standard auditory frequency scales. However, the ConvRBM frequency scale is adapted to the sound statistics. The primary motivation for the development of our model is to apply in the Automatic Speech Recognition (ASR) task. Experiments on the standard ASR databases show that the ConvRBM filterbank performs better than the Mel filterbank. The stability analysis of the model is presented using Lipschitz continuity condition. The proposed model is improved by using annealing dropout and Adam optimization. Noise-robust representation is achieved by combining ConvRBM filterbank with an energy estimation using the Teager Energy Operator (TEO). As a part of the research work for the MeitY, Govt. of India sponsored consortium project, the ConvRBM is used as a front-end for the ASR system in the speech-based access for the agricultural commodities in the Gujarati language. Inspired by the success in the ASR task, we applied our model in three audio classification tasks, namely, Environmental Sound Classification (ESC), synthetic and replay Spoof Speech Detection (SSD) in the context of the Automatic Speaker Verification (ASV), and Infant Cry Classification (ICC).We further propose the two layer auditory model by stacking two ConvRBMs. We refer it as an Unsupervised Deep Auditory Model (UDAM) and it performed well compared to the single layer ConvRBM in the ASR task.
URI: http://drsr.daiict.ac.in//handle/123456789/785
Appears in Collections:PhD Theses

Files in This Item:
File Description SizeFormat 
201321002_Hardik B. Sailor.pdf20132100214.63 MBAdobe PDFThumbnail
View/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.