Show simple item record

dc.contributor.advisorMitra, Suman
dc.contributor.authorMehta, Stuti
dc.contributor.otherRoy, Anil
dc.date.accessioned06/05/2022T05:05:41Z
dc.date.available2023-02-18T05:05:41Z
dc.date.issued2021
dc.identifier.citationMehta, Stuti (2021). Improving the Quality of Data to be used for Text Classification task. Dhirubhai Ambani Institute of Information and Communication Technology. viii, 43 p. (Acc.No: T00931)
dc.identifier.urihttp://drsr.daiict.ac.in//handle/123456789/992
dc.description.abstractText Classification is one of the most basic tasks of Natural Language Processing (NLP) and efforts have been made to improve the performance of the task by making changes to the classifier model. For any NLP problem, the data used is a very important aspect to get the best possible solution for the problem. However, not much work has been done to improve the data quality. Our work aims at improving the quality of the data for Text Classification task. This is done by removing some semantically difficult samples from the data, changing the training set and thereby improving the data quality. In order to improve the quality of data and removing the samples which cause a negative effect to the classifier performance, various methods have been considered. A novel method has been used to define difficulty. A penalty function is used to represent the difficulty of a sample. Based on the penalty value associated with the sample, the difficulty of the sample is determined. These difficult samples are then removed and newly obtained training set is used for classification. This newly obtained training set is an improved version of the training set. By training the classifier model on the newly obtained training set, an improvement is observed in the performance of the classifier. Thus, our work mainly improves the quality of the data by removing the difficult samples. Various methods have been used for finding the difficult samples and comparisons have been drawn from the results obtained.en_US
dc.publisherDhirubhai Ambani Institute of Information and Communication Technologyen_US
dc.subjectText Classificationen_US
dc.subjectNatural Language Processingen_US
dc.classification.ddc621.3821 MEH
dc.titleImproving the Quality of Data to be used for Text Classification tasken_US
dc.typeDissertation
dc.degreeM. Tech
dc.student.id201911001
dc.accession.numberT00931


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record