This folder contains the transliteration resources used in the following papers:

Transliteration for Resource Scarce Languages
Manoj K. Chinnakotla, Om P. Damani, Avijit Satoskar
ACM Transcations on Asian Language Information Processing (TALIP),
Special issue on Indian Language IR,
September 2010.

Please note that the Persian-English rule base and transliteration dataset
were provided by Dr. Sarvnaz Karimi (skarimi@unimelb.edu.au).

S. Karimi, Machine Transliteration of Proper Names between English and 
Persian, PhD Thesis, RMIT University, 2008

The directory structure is as follows:

1. Hindi-Dataset:

* 30K dataset of parallel Hindi-English transliteration pairs
* Word origin (Indo-Arabic and Non Indo-Arabic) tagged data for 
the above transliteration pairs

2. Hindi-RuleBases:

* Common rule base mapping of Devanagari character sequences to 
English character sequences
* Origin specific rule bases for Indo-Arabic and Non Indo-Arabic

3. Monolingual-Resources:

* Unique words file from English Wikipedia used to train the CSM
* Unique words file with Logcount weights (from Weighing Step)
* Indian origin words used for training the Indo-Arabic CSM
in origin-wise CSM training

4. Persian-Dataset:

* Persian-English dataset of size 19,940
* Train, Dev and Test splits used for evaluation are also included

5. Persian-RuleBases:

* Persian-English rule base
