Please use this identifier to cite or link to this item: http://drsr.daiict.ac.in//handle/123456789/41
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorMaitra, Anutosh
dc.contributor.authorNallagorla, V. S. R. Krishna
dc.date.accessioned2017-06-10T10:23:46Z
dc.date.available2017-06-10T10:23:46Z
dc.date.issued2004
dc.identifier.citationNallagorla, V. S. R. Krishna (2004). Automated categorization of structured documents. Dhirubhai Ambani Institute of Information and Communication Technology, x, 54 p. (Acc.No: T00004)
dc.identifier.urihttp://drsr.daiict.ac.in/handle/123456789/41
dc.description.abstractAutomatic text categorization is a problem of assigning text documents to preĀ¬-defined categories. This requires extraction of useful features. In most of the applications, text document features are commonly represented by the term frequency and the inverted document frequency. In case of structured documents, dominant features are often characterized by a few sentences bearing additional importance. The features from more important sentences should be considered more than other features. Another issue in automated document categorization is the manageability and integrity of large volume of text data where the documents can be very large and often certain parts of the document misrepresent the primary focus of the document. In this work, we study several document categorization techniques mostly in light of categorizing structured text. Categorization on summarized text is also studied in details with some specific purpose. While summarizing, the importance of appropriate sentences has been considered. The approach is verified by conducting experiments using news group data sets using three summarization methods. The set of whole documents and the summaries were subjected to a couple of classical categorization techniques, viz. Naive Bayesian and TF-IDF (Term Frequency-Inverse Document Frequency) algorithm. A major observation was preservation of the sanctity of the categorization even while that is based on the summary documents rather than the whole document. Goodness of the approach was verified on about 500 documents and the test results are enclosed.
dc.publisherDhirubhai Ambani Institute of Information and Communication Technology
dc.subjectCategorization techniques
dc.subjectDocument categorization
dc.subjectNaive bayesian
dc.subjectText categorization
dc.subjectTF-IDF
dc.subjectTerm frequency-inverse document frequency
dc.classification.ddc005.72 NAL
dc.titleAutomated categorization of structured documents
dc.typeDissertation
dc.degreeM.Tech
dc.student.id200211007
dc.accession.numberT00004
Appears in Collections:M Tech Dissertations

Files in This Item:
File Description SizeFormat 
200211007.pdf
  Restricted Access
2.06 MBAdobe PDFThumbnail
View/Open Request a copy


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.