Part of speech tagging for Gujarati text

Dave, Mainak

dc.contributor.advisor	Pandya, Abhinay
dc.contributor.author	Dave, Mainak
dc.date.accessioned	2017-06-10T14:39:15Z
dc.date.available	2017-06-10T14:39:15Z
dc.date.issued	2011
dc.identifier.citation	Dave, Mainak (2011). Part of speech tagging for Gujarati text. Dhirubhai Ambani Institute of Information and Communication Technology, viii, 44 p. (Acc.No: T00319)
dc.identifier.uri	http://drsr.daiict.ac.in/handle/123456789/356
dc.description.abstract	Part-of-speech (POS) tagging is a process of assigning the lexicon category to each lexicons in a given natural language sentence, that best suits the definition of the lexicon as well as the context of the sentence in which it is used. Part-of-speech tagging is an important part of Natural Language Processing (NLP) and is useful for most NLP applications. Part-of-speech tagging is often a primary step in most of the NLP tasks such as chunking, parsing, etc. Gujarati is the state language of Gujarat, a western state of India, and is spoken by 70 percent of the state's population. More than 46 million people worldwide consider Gujarati as their first language. Apart from Gujarat, it is widely spoken in the states of Maharashtra, Rajasthan, Karnataka and Madhya Pradesh and also around the world. Natural language processing of Gujarati is in its early stage of existence. Gujarati POS tagger is a core component for most NLP applications. Information retrieval, machine translation, shallow parsing and word sense disambiguation tasks can be work more effectively and efficiently with the existence of a POS tagger. Our focus in this work is to develop an effective Gujarati text POS tagger. Our main task of thesis is to built a system which can annotate part-of-speech for Gujarati texts automatically, with the help of various machine learning algorithms. We have used tag sets defined by IIIT Hyderabad. We have used two machine learning techniques one is Hidden Markov Model (HMM) and second is Conditional random Field (CRF). Since Gujarati is a morphologically reach language, we can use Morphological Analyzer (MA) to restrict the set of possible tags for a given words. Gujarati language is based on Paninian framework, rules of morphology are well-defined. Hence we have defined morphological rules for Gujarati. While MA helps us to restrict the possible choice of tags for a given word, one can also use prefix/suffix information (i.e., the sequence of first/last few characters of a word) to further improve the models. HMM model uses suffix information during smoothing process while CRF uses suffixes as a feature. (http://www.oclc.org/languagesets/educational/languages/india.htm)
dc.publisher	Dhirubhai Ambani Institute of Information and Communication Technology
dc.subject	Natural language processing
dc.subject	POS tagging
dc.subject	Statistical natural language processing
dc.subject	Word Segmentation
dc.subject	Part-of-Speech Tagging
dc.subject	Morphological Analysis
dc.subject	Computational linguistics
dc.subject	Linguistic analysis
dc.subject	Linguistics
dc.classification.ddc	006.35 DAV
dc.title	Part of speech tagging for Gujarati text
dc.type	Dissertation
dc.degree	M. Tech
dc.student.id	200911010
dc.accession.number	T00319

Files in this item

Name:: 200911010.pdf
Size:: 379.3Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

M Tech Dissertations [923]

Show simple item record