Document Language Classification Using Deep Learning Approaches

Shah, Sarathi Surendra

Please use this identifier to cite or link to this item: http://drsr.daiict.ac.in//handle/123456789/1011

Title:	Document Language Classification Using Deep Learning Approaches
Authors:	Joshi, Manjunath V. Shah, Sarathi Surendra
Keywords:	Optical Character Recognition Document Language Classification Convolutional Neural Network Indian Languages
Issue Date:	2021
Citation:	Shah, Sarathi Surendra (2021). Document Language Classification Using Deep Learning Approaches. Dhirubhai Ambani Institute of Information and Communication Technology. viii, 38 p. (Acc.No: T00946)
Abstract:	Optical character recognition (OCR) refers to the task of recognizing the characters or text from digital document images. OCR is a widely researched area for the past many years due to its applications in various fields. It helps in the natural language processing of the documents, convert the document text to speech, semantic analysis of the text, searching in the documents etc. Multilingual OCR works with documents having more than one language. Different OCR models have been created and optimized for a particular language. However, while dealing with multiple languages or translation of documents, one needs to detect the language of the document first and then give it as input to a model-specific to that language. So, while performing OCR on multilingual documents, it is better to first recognize the language of the document and then give it as input to the OCR model optimized for that particular language. Most of the researched work in this area focuses on identifying scripts, but considering that the Convolutional Neural Network (CNN) can learn appropriate features, our work focuses on language detection using learned features. We have proposed two classification models using CNN where one model classifies Gujarati and English language at word-level and the other classifies six Indian languages at page-level. We use a hierarchical based method in which a binary classification followed by the multiclass classification is used to improve detection accuracy for page-level classification. Largely, the current approaches do not use hierarchy and hence fail to identify the language correctly. The proposed hierarchical approach is used to detect six Indian languages namely: Tamil, Telugu, Kannada, Hindi, Marathi, Gujarati, using the CNN from printed documents based on the text content in a page. Experiments are performed on scanned government documents, and results indicate that the proposed approach performs better than the other similar methods. Advantage of our approach is that it is based on features extracted from the entire page rather than the words or characters, and it can also be applied to handwritten documents.
URI:	http://drsr.daiict.ac.in//handle/123456789/1011
Appears in Collections:	M Tech Dissertations

Files in This Item:

File	Description	Size	Format
201911019_Sarathi_Shah_MTech_Thesis_Dean Research.pdf Restricted Access		2.05 MB	Adobe PDF	View/Open Request a copy

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets