Self-Supervised Speech Representation for Speech Recognition

Chaturvedi, Shreya Sanjay

Please use this identifier to cite or link to this item: http://drsr.daiict.ac.in//handle/123456789/1136

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Patil, Hemant A.	-
dc.contributor.advisor	Sailor, Hardik B.	-
dc.contributor.author	Chaturvedi, Shreya Sanjay	-
dc.date.accessioned	2024-08-22T05:21:08Z	-
dc.date.available	2024-08-22T05:21:08Z	-
dc.date.issued	2022	-
dc.identifier.citation	Chaturvedi, Shreya Sanjay (2022). Self-Supervised Speech Representation for Speech Recognition. Dhirubhai Ambani Institute of Information and Communication Technology. xi, 81 p. (Acc. # T01056).	-
dc.identifier.uri	http://drsr.daiict.ac.in//handle/123456789/1136	-
dc.description.abstract	Voice Assistants (VAs) are nowadays an integral part of human�s life. The low resource applications of VAs, such as regional languages, children speech, medical conversation, etc are the key challenges faced during development of these VAs. On a broader perspective, VAs consist of three parts, namely, Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text to Speech (TTS) model. This thesis is focused on one part of them, i.e., ASR. In particular, opti- mization of low resource ASR is targeted with the application of children�s speech. Initially, a data augmentation technique was proposed to improve the performance of isolated hybrid DNN HMM ASR for children�s speech. Hence, we have used CycleGAN based augmentation technique, where children to children voice conversion is performed. Here, for conversion of characteristics, the speech signals were categorized into two classes based on the fundamental frequency threshold of speech. In this work, a detailed experimental analysis of various augmentation, such as SpecAugment, speed perturbation, and volume perturbation are done w.r.t. to ASR. Further, to optimize low resource ASR, the self supervised learning, i.e., wav2vec 2.0 have been explored. It is a semi supervised approach, where pretraining is performed with unlabelled data and then finetuned with labelled data. In addition, the fusion of Noisy Student Teacher (NST) learning is done with self supervised learning techniques. The key achievement of this work was efficient use of unlabelled data and even though the process involves iterative training, redundant training was negligible. The filtering of pseudo labelled data was done before utilizing it for finetuning. After Acoustic Model (AM) decoding, the Language Model (LM) was also used to optimize the performance. Additional work was also done in the direction of replay Spoofed Speech Detection (SSD). In this work, the significance of Delay and Sum (DAS) beamformer was investigated over State of the Art (SoTA) Minimum Variance Distortionless Response (MVDR) beamforming technique for replay SSD.	-
dc.publisher	Dhirubhai Ambani Institute of Information and Communication Technology	-
dc.subject	Automatic Speech Recognition	-
dc.subject	Data Augmentation	-
dc.subject	Self Supervised Learning	-
dc.subject	Noisy Student Teacher Learning	-
dc.subject	Replay Spoof Speech Detection	-
dc.classification.ddc	006.454 CHA	-
dc.title	Self-Supervised Speech Representation for Speech Recognition	-
dc.type	Dissertation	-
dc.degree	M. Tech (EC)	-
dc.student.id	202015004	-
dc.accession.number	T01056	-
Appears in Collections:	M Tech (EC) Dissertations

Files in This Item:

File	Size	Format
202015004.pdf	1.94 MB	Adobe PDF	View/Open

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets