Please use this identifier to cite or link to this item:
http://drsr.daiict.ac.in//handle/123456789/992
Title: | Improving the Quality of Data to be used for Text Classification task |
Authors: | Mitra, Suman Mehta, Stuti Roy, Anil |
Keywords: | Text Classification Natural Language Processing |
Issue Date: | 2021 |
Publisher: | Dhirubhai Ambani Institute of Information and Communication Technology |
Citation: | Mehta, Stuti (2021). Improving the Quality of Data to be used for Text Classification task. Dhirubhai Ambani Institute of Information and Communication Technology. viii, 43 p. (Acc.No: T00931) |
Abstract: | Text Classification is one of the most basic tasks of Natural Language Processing (NLP) and efforts have been made to improve the performance of the task by making changes to the classifier model. For any NLP problem, the data used is a very important aspect to get the best possible solution for the problem. However, not much work has been done to improve the data quality. Our work aims at improving the quality of the data for Text Classification task. This is done by removing some semantically difficult samples from the data, changing the training set and thereby improving the data quality. In order to improve the quality of data and removing the samples which cause a negative effect to the classifier performance, various methods have been considered. A novel method has been used to define difficulty. A penalty function is used to represent the difficulty of a sample. Based on the penalty value associated with the sample, the difficulty of the sample is determined. These difficult samples are then removed and newly obtained training set is used for classification. This newly obtained training set is an improved version of the training set. By training the classifier model on the newly obtained training set, an improvement is observed in the performance of the classifier. Thus, our work mainly improves the quality of the data by removing the difficult samples. Various methods have been used for finding the difficult samples and comparisons have been drawn from the results obtained. |
URI: | http://drsr.daiict.ac.in//handle/123456789/992 |
Appears in Collections: | M Tech Dissertations |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
201911001_Stuti_Mehta_MTech_Thesis.pdf Restricted Access | 3.61 MB | Adobe PDF | View/Open Request a copy |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.