Improving the Quality of Data to be used for Text Classification task

Mehta, Stuti

dc.contributor.advisor	Mitra, Suman
dc.contributor.author	Mehta, Stuti
dc.contributor.other	Roy, Anil
dc.date.accessioned	06/05/2022T05:05:41Z
dc.date.available	2023-02-18T05:05:41Z
dc.date.issued	2021
dc.identifier.citation	Mehta, Stuti (2021). Improving the Quality of Data to be used for Text Classification task. Dhirubhai Ambani Institute of Information and Communication Technology. viii, 43 p. (Acc.No: T00931)
dc.identifier.uri	http://drsr.daiict.ac.in//handle/123456789/992
dc.description.abstract	Text Classification is one of the most basic tasks of Natural Language Processing (NLP) and efforts have been made to improve the performance of the task by making changes to the classifier model. For any NLP problem, the data used is a very important aspect to get the best possible solution for the problem. However, not much work has been done to improve the data quality. Our work aims at improving the quality of the data for Text Classification task. This is done by removing some semantically difficult samples from the data, changing the training set and thereby improving the data quality. In order to improve the quality of data and removing the samples which cause a negative effect to the classifier performance, various methods have been considered. A novel method has been used to define difficulty. A penalty function is used to represent the difficulty of a sample. Based on the penalty value associated with the sample, the difficulty of the sample is determined. These difficult samples are then removed and newly obtained training set is used for classification. This newly obtained training set is an improved version of the training set. By training the classifier model on the newly obtained training set, an improvement is observed in the performance of the classifier. Thus, our work mainly improves the quality of the data by removing the difficult samples. Various methods have been used for finding the difficult samples and comparisons have been drawn from the results obtained.	en_US
dc.publisher	Dhirubhai Ambani Institute of Information and Communication Technology	en_US
dc.subject	Text Classification	en_US
dc.subject	Natural Language Processing	en_US
dc.classification.ddc	621.3821 MEH
dc.title	Improving the Quality of Data to be used for Text Classification task	en_US
dc.type	Dissertation
dc.degree	M. Tech
dc.student.id	201911001
dc.accession.number	T00931

Files in this item

Name:: 201911001_Stuti_Mehta_MTech_Th ...
Size:: 3.521Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

M Tech Dissertations [923]

Show simple item record