Document Language Classification Using Deep Learning Approaches
Abstract
Optical character recognition (OCR) refers to the task of recognizing the characters or text from digital document images. OCR is a widely researched area for the past many years due to its applications in various fields. It helps in the natural language processing of the documents, convert the document text to speech, semantic analysis of the text, searching in the documents etc. Multilingual OCR works with documents having more than one language. Different OCR models have been created and optimized for a particular language. However, while dealing with multiple languages or translation of documents, one needs to detect the language of the document first and then give it as input to a model-specific to that language. So, while performing OCR on multilingual documents, it is better to first recognize the language of the document and then give it as input to the OCR model optimized for that particular language. Most of the researched work in this area focuses on identifying scripts, but considering that the Convolutional Neural Network (CNN) can learn appropriate features, our work focuses on language detection using learned features. We have proposed two classification models using CNN where one model classifies Gujarati and English language at word-level and the other classifies six Indian languages at page-level. We use a hierarchical based method in which a binary classification followed by the multiclass classification is used to improve detection accuracy for page-level classification. Largely, the current approaches do not use hierarchy and hence fail to identify the language correctly. The proposed hierarchical approach is used to detect six Indian languages namely: Tamil, Telugu, Kannada, Hindi, Marathi, Gujarati, using the CNN from printed documents based on the text content in a page. Experiments are performed on scanned government documents, and results indicate that the proposed approach performs better than the other similar methods. Advantage of our approach is that it is based on features extracted from the entire page rather than the words or characters, and it can also be applied to handwritten documents.
Collections
- M Tech Dissertations [923]