Optical Character Recognition (OCR) and Post-editing system for the Sanskrit language.
Optical Character Recognition (OCR) is the process of converting the document images into an editable electronic format. This has many advantages like data compression, enabling search or edit options in the images/text, and creating the database for other applications like Machine Translation, Speech Recognition, and enhancing dictionaries and language models. OCR in Indian Languages is quite challenging due to richness in inflections. So, we started with the problem of developing "OpenOCRCorrect", an end-to-end framework for Error Detection and Corrections in Indic-OCR. Our models outperform state-of-the-art results in Error Detection in Indic-OCR for six Indic languages with varied inflections. Even after a good accuracy in OCR, the detected text needs a lot of improvement. Further, in the digitization process of such texts, the second step would be spelling correction and formatting of the text detected by the OCR models. Hence, the end goal is to convert the generated OCR text in accordance with the scanned images of the 10000 books. This will also help us preserve the rich Sanskrit Artifacts and Reference Materials for future uses so that it can be referenced in the future.
Sanskrit is an ancient language and a proud part of our Indian heritage. It has been passed down the generations, first verbally and later in printed format. As we witness an era of immense technological advances it is our duty to ensure that we carry forward our treasure trove of knowledge, encoded in sanskrit books into the digital era. This motivated us to undertake the task of digitization of our rich sanskrit literature in the form of the Akshar Anveshini project. With this endeavour we aim to translate several thousands of old sanskrit books on various topics like mathematics and astronomy, to a digital format. We wish to make these books easily accessible and discoverable to a greater amount of people than a physical print format allows for and make sure that a proud part of our culture stands the test of time.
The inhouse OCR(Optical Character Recognition) model for the digitization of Sanskrit books.
The Parinamika OCR system provides the following options to get the OCRed output of any document:
1. PDF documents
2. Page Images (jpeg, png)
3. Line Images
It uses state of the art technology that gives us 95.14% character level accuracy for Sanskrit text.
End-to-end framework for Error Detection and Corrections in Indic-OCR. Provides suggestions for words that probably have mistakes during the OCR method, hence any mistakes during the OCR process can be corrected by the user.
Credentials to use the software can be provided upon acknowledgement/request.
A tool to help us in efficient management of the annotation work being done.
The Project Manager UI provides the following functionalities
1. Book Management
2. Assigning sets
3. Corrector, Verifier, Formatter and Proofreader tracking
Publications related to the projects.
In Proceedings of The 35th AAAI Conference on Artificial Intelligence (AAAI 2021).
In Proceedings of The 15th International Conference on Document Analysis and Recognition (ICDAR 2019), Sydney, Australia.
In Proceedings of The 15th International Conference on Document Analysis and Recognition (ICDAR 2019), Sydney, Australia.
In Proceedings of The 2nd International Workshop on Open Services and Tools for Document Analysis, associated with the 15th International Conference on Document Analysis and Recognition (ICDAR-OST 2019), Sydney, Australia.
Accepted paper at the 7th IEEE Winter Conference on Applications of Computer Vision (WACV), 2019, Hawaii, USA.
International Conference on Document Analysis and Recognition (ICDAR) 2017, Kyoto, Japan.
1st International Workshop on Open Services and Tools for Document Analysis (ICDAR- OST) 2017, Kyoto, Japan.
Proceedings of the 17th World Sanskrit Conference, Vancouver, 2018.
Research and Innovation Symposium in Computing (RISC) 2017 (Most Admiring Poster Presentation Award), IIT-Bombay, India.
In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Beijing, China, July - 2015