Document Language Classification Using Deep Learning Approaches

Shah, Sarathi Surendra

Please use this identifier to cite or link to this item: http://drsr.daiict.ac.in//handle/123456789/1011

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Joshi, Manjunath V.
dc.contributor.author	Shah, Sarathi Surendra
dc.date.accessioned	2022-05-06T17:08:20Z
dc.date.available	2023-02-24T17:08:20Z
dc.date.issued	2021
dc.identifier.citation	Shah, Sarathi Surendra (2021). Document Language Classification Using Deep Learning Approaches. Dhirubhai Ambani Institute of Information and Communication Technology. viii, 38 p. (Acc.No: T00946)
dc.identifier.uri	http://drsr.daiict.ac.in//handle/123456789/1011
dc.description.abstract	Optical character recognition (OCR) refers to the task of recognizing the characters or text from digital document images. OCR is a widely researched area for the past many years due to its applications in various fields. It helps in the natural language processing of the documents, convert the document text to speech, semantic analysis of the text, searching in the documents etc. Multilingual OCR works with documents having more than one language. Different OCR models have been created and optimized for a particular language. However, while dealing with multiple languages or translation of documents, one needs to detect the language of the document first and then give it as input to a model-specific to that language. So, while performing OCR on multilingual documents, it is better to first recognize the language of the document and then give it as input to the OCR model optimized for that particular language. Most of the researched work in this area focuses on identifying scripts, but considering that the Convolutional Neural Network (CNN) can learn appropriate features, our work focuses on language detection using learned features. We have proposed two classification models using CNN where one model classifies Gujarati and English language at word-level and the other classifies six Indian languages at page-level. We use a hierarchical based method in which a binary classification followed by the multiclass classification is used to improve detection accuracy for page-level classification. Largely, the current approaches do not use hierarchy and hence fail to identify the language correctly. The proposed hierarchical approach is used to detect six Indian languages namely: Tamil, Telugu, Kannada, Hindi, Marathi, Gujarati, using the CNN from printed documents based on the text content in a page. Experiments are performed on scanned government documents, and results indicate that the proposed approach performs better than the other similar methods. Advantage of our approach is that it is based on features extracted from the entire page rather than the words or characters, and it can also be applied to handwritten documents.
dc.subject	Optical Character Recognition
dc.subject	Document Language Classification
dc.subject	Convolutional Neural Network
dc.subject	Indian Languages
dc.classification.ddc	621.399 SHA
dc.title	Document Language Classification Using Deep Learning Approaches
dc.type	Dissertation
dc.degree	M. Tech
dc.accession.number	T00946
Appears in Collections:	M Tech Dissertations

Files in This Item:

File	Description	Size	Format
201911019_Sarathi_Shah_MTech_Thesis_Dean Research.pdf Restricted Access		2.05 MB	Adobe PDF	View/Open Request a copy

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets