Error Detection and Correction in Indic-OCR

Rohit Saluja, Devraj Adiga, Ganesh Ramakrishnan, Mark Carman, Parag Chaudhuri

November 2017

Abstract

Optical Character Recognition (OCR) is the process of converting the document images into an editable electronic format. This has many advantages like data compression, enabling search or edit options in the images/text, and creating the database for other applications like Machine Translation, Speech Recognition, and enhancing dictionaries and language models. OCR in Indian Languages is challenging due to the presence of numerous inflections in them. Using Open Source and Commercial OCR systems, we have observed the Word Error Rates (WER) of around 20-50% on typewriter printed documents according to our experiments. Also, developing a highly accurate OCR system with an accuracy as high as 90% is not useful unless aided by the mechanism to identify errors. So, we our work deals with the problem of developing an end-to-end framework for Error Detection and Corrections in Indic-OCR to achieve 100% digitization accuracy with minimal human interaction. Our models have outperform state-of-the-art models for error detection in Indic-OCR for languages with varied inflections and we have solved the Out of Vocabulary problem for error correction in Indic-OCR, in our ICDAR-2017 conference paper. Moreover, we have analyzed various models and worked with different domains in Sanskrit in our work in WSC-2018 conference paper.

Type

Conference paper

Publication

Proceedings of International Conference on Document Analysis and Recognition (ICDAR) 2017

Other related papers are:

A Framework for Document Specific Error Detection and Corrections in Indic OCR
Rohit Saluja, Devaraj Adiga, Ganesh Ramakrishnan, Parag Chaudhuri and Mark Carman
International Workshop on Open Services and Tools for Document Analysis (ICDAR-OST) 2017, Kyoto, Japan
Improving the learnability of classifiers for Sanskrit OCR corrections
Devaraja Adiga, Rohit Saluja, Vaibhav Agrawal, Ganesh Ramakrishnan, Parag Chaudhuri, K. Ramasubramanian and Malhar Kulkarni
Proceedings of the 17th World Sanskrit Conference, Vancouver, 2018

indic ocr

Error Detection and Correction in Indic-OCR

Abstract

Related