Error Detection and Correction in Indic-OCR

Abstract

Optical Character Recognition (OCR) is the process of converting the document images into an editable electronic format. This has many advantages like data compression, enabling search or edit options in the images/text, and creating the database for other applications like Machine Translation, Speech Recognition, and enhancing dictionaries and language models. OCR in Indian Languages is challenging due to the presence of numerous inflections in them. Using Open Source and Commercial OCR systems, we have observed the Word Error Rates (WER) of around 20-50% on typewriter printed documents according to our experiments. Also, developing a highly accurate OCR system with an accuracy as high as 90% is not useful unless aided by the mechanism to identify errors. So, we our work deals with the problem of developing an end-to-end framework for Error Detection and Corrections in Indic-OCR to achieve 100% digitization accuracy with minimal human interaction. Our models have outperform state-of-the-art models for error detection in Indic-OCR for languages with varied inflections and we have solved the Out of Vocabulary problem for error correction in Indic-OCR, in our ICDAR-2017 conference paper. Moreover, we have analyzed various models and worked with different domains in Sanskrit in our work in WSC-2018 conference paper.

Publication
Proceedings of International Conference on Document Analysis and Recognition (ICDAR) 2017

Other related papers are:

  • A Framework for Document Specific Error Detection and Corrections in Indic OCR
    Rohit Saluja, Devaraj Adiga, Ganesh Ramakrishnan, Parag Chaudhuri and Mark Carman
    International Workshop on Open Services and Tools for Document Analysis (ICDAR-OST) 2017, Kyoto, Japan
  • Improving the learnability of classifiers for Sanskrit OCR corrections
    Devaraja Adiga, Rohit Saluja, Vaibhav Agrawal, Ganesh Ramakrishnan, Parag Chaudhuri, K. Ramasubramanian and Malhar Kulkarni
    Proceedings of the 17th World Sanskrit Conference, Vancouver, 2018

Related