Optical Character Recognition (OCR) is the process of converting the document images into an editable electronic format. This has many advantages like data compression, enabling search or edit options in the images/text, and creating the database for other applications like Machine Translation, Speech Recognition, and enhancing dictionaries and language models. OCR in Indian Languages is challenging due to the presence of numerous inflections in them. Using Open Source and Commercial OCR systems, we have observed the Word Error Rates (WER) of around 20-50% on typewriter printed documents according to our experiments. Also, developing a highly accurate OCR system with an accuracy as high as 90% is not useful unless aided by the mechanism to identify errors. So, we our work deals with the problem of developing an end-to-end framework for Error Detection and Corrections in Indic-OCR to achieve 100% digitization accuracy with minimal human interaction. Our models have outperform state-of-the-art models for error detection in Indic-OCR for languages with varied inflections and we have solved the Out of Vocabulary problem for error correction in Indic-OCR, in our ICDAR-2017 conference paper. Moreover, we have analyzed various models and worked with different domains in Sanskrit in our work in WSC-2018 conference paper.
Other related papers are: