Binarizing degraded document  image for text extraction

Patel, Radhika

View/Open

201311004.pdf (1.911Mb)

Date

2015

Author

Patel, Radhika

Metadata

Show full item record

Abstract

The recent era of digitization is expected to be digitized many old important documents which are degraded due to various reasons. Binarizing Degraded Document Image for Text Extraction is a conversation of document color image to binary image. Document images have mostly two classes: background and text. It can also be considered as a text retrieval procedure as it extracts text from a degraded document. Degraded document image binarization have many challenges like huge text intensity variation, background contrast variation, bleed through, text size or stroke width variation in a single image, highly overlapped background and foreground intensity ranges etc. Many approaches are available for document image binarization, but none can handle all kind of degradation at the same time. Mostly, a combination of global and/or local thresholding along with various preprocessing as well as postprocessing techniques are used for document image binarization to handle most of the challenges. The approach proposed in this thesis is basically divided into three stages: preprocessing, Text-Area detection, post-processing. Preprocessing employs PCA to convert image from RGB to Gray, followed by gamma correction that enhances the contrast of the image. Contrast-enhanced image is filtered with DoG (Difference of Gaussian) filter to boost local features of a text, followed by equalization. Next stage involves identifying Text-Area. A Rough set based edge detection technique is used to find closed boundary around texts, which results into locating Text- Area along with some non-text area detected as text. Text is detected by applying logical operators on preprocessed image and edge detected image. Postprocessing technique takes care of false positives and false negative based on intensity values of preprocessed and gray image. The algorithm is also expected to be independent of the script. To demonstrate this, the algorithm is tested on Gujarati degraded document images. The Performance is evaluated based on various quantitative measures like Distance Reciprocal Distortion (DRD), Peak Signal-to-Noise Ratio (PSNR), F-Measure, and pseudo F-measure and It is compared with the state-of-the-art (SOTA) method. The proposed approach is close to the SOTA methods based on performance. It is able to binarize without losing text in some of the very challenging images, where state-of-the-art methods lose the text.

URI

http://drsr.daiict.ac.in/handle/123456789/539

Collections

M Tech Dissertations [923]