In
India?s multi-linguistic landscape, the need to facilitate
smooth communication between the Centre and the states is
vital for good governance. Machine Translation offers a great
solution to this problem. Srikanth RP & Dheeraj Kapoor
do a reality check on Machine Translation technology in India
and find out what?s happening on the ground
No
one knows fully understands the meaning of `Unity in
diversity' better than an Indian. Eighteen different
languages, plus the countless dialects, and what you have
is one messy pottage which very few would like to put their
hands in. While this diversity has been India's distinguishing
mark on the global scene, it has created quite a few hiccups
in the day to day administration of the country.
In an effort to root out this problem, the founding fathers
of our country nominated `Hindi' as the official
language of the country and ordained that all government communication
should be made through this medium. However, the situation
on ground zero seems to be far from ideal. Quite a chunk of
this communication is done in English. Couple this with the
fact that most state governments function in their own regional
languages and the situation becomes even more complex. This
predicament has given rise to an urgent need to translate
these documents into a language best understood by the target
audience, but with translators increasingly hard to find,
what could be the solution to this problem?
C-DAC?s
Darbari
says the socio-political structure of a country
has a direct bearing on the development of MT technology
|
Machine
translation, say experts, could offer a viable option to those
wishing to move on to an environment where thousands of verses
in English could be converted into regional languages on the
trot. In fact, Machine Translation (MT), thanks to its ability
to change the way communication is done, has emerged as one
of the most exciting technologies in recent times.
Says
R M K Sinha, professor, Computer Science and Engineering,
IIT Kanpur, "India has 18 major regional languages written
in 10 different scripts. However, English, though spoken by
a minuscule 3 percent of the population, is still the de-facto
link language for administration, business and control. All
grass root information of land, agriculture, health and education
needs to be disseminated in the respective regional languages
for effective communication and understanding. Hence, translation
is as important as basic and necessary infrastructure like
roads, water and transportation for a country like India."
Agrees Dr Hemant Darbari, group co-ordinator, Applied Artificial
Intelligence group, C-DAC, "The social or political importance
of MT arises from the socio-political importance of translation
in communities where more than one language is generally spoken.
The technology assumes even greater importance in non-English
speaking countries such as India."
Adds Durgesh Rao, research scientist, Knowledge based Computer
Systems Division, NCST, "India is a linguistically rich
area-we have Hindi, English, and fourteen other official
languages, each of which is spoken by millions of people.
Since most information is generated in English, Machine Translation
has emerged as a critical technology that can help communicate
and share information more effectively."
NCST?s
Rao says the lack
of online lexical resources has hampered the growth of
MT technology in India |
The
MT revolution was kick-started by C-DAC when it started work
on NLP (Natural language processing) and developed a parser
which could parse Hindi, Sanskrit, Gujarati, English and German.
While developing this technology, the company was looking
at practical implementations of the same and suggested it
to various agencies. Realising the immense potential of MT,
the Department of Official Language (DOL) Government of India
began actively funding such projects.
Today, the Ministry of Information Technology has realised
the importance of Machine Translation and has identified the
following domains for development of domain specific translation
systems: government administrative procedures and formats,
parliamentary questions and answers, pharmaceutical information
and legal terminology and judgements. The ministry also initiated
the `Technology for Development of Indian languages'
project in the year 1990-91 to support and fund R&D efforts
in the area of Information processing in Indian languages
covering machine translation among others.
However, with 18 different languages, translation is no kid's
play. As English and Hindi are a critical pair of languages
and constitute a bulk of the correspondence in government
offices, this pair has been identified as the priority area
for Machine aided Translation. Accordingly, two specific areas
of research have been identified. They are: MT systems for
translation between Indian languages and MT systems for translation
between English to Hindi. Currently, three institutions in
the country namely C-DAC, NCST and IIT have taken the lead
in developing applications using this cutting edge technology.
Under the knowledge-based computer systems project of the
DOE, C-DAC developed VYAKARTA, which could parse English,
Hindi, Gujarati and Sanskrit. It used the same parser to develop
MANTRA (a machine assisted translation tool for translating
official language sentences from English To Hindi). The same
was demonstrated to the Department of Official Languages who
financed the project entitled `English to Hindi Computer
assisted Translation System' for administrative purposes.
The aim of the project was to design, develop and implement
a computer assisted translation system for personnel administration.
The system is now able to translate letters and circulars
such as appointment letters and transfers and is also capable
of taking inputs from standard Word processing and DTP packages.
After successful completion of English to Hindi translation
in the above-specified domain, the company is now looking
to extend it to other domains and apply the developed techniques
for multi-lingual translation. This capability would also
enable it to achieve Machine translation between any language
pair.
Another organisation involved in the area of MT is Mumbai-based
NCST. NCST was one of the first institutes in India to work
IIT
Mumbai?s Bhattacharya
says interaction with the industry helps them understand
the technology better |
on
Machine Translation. Explains Rao "In the late 80s we
developed an early prototype, ScreenTalk, to translate PTI
news stories of specific categories, using a script-like approach.
Since then, we have continued our work and have developed
MaTra, a general-purpose framework for translation between
English and Indian languages, starting with Hindi." The
focus in MaTra is obviously on the innovative use of man machine
synergy. Currently, the domain being explored is news, which
can later be extended to any domain. The system breaks an
English sentence into chunks, analyses the structure and displays
it allowing the user to verify and correct it. MaTra can be
used in two ways. In the automatic mode, the system gives
the best translation it can which can be later post-edited
by the user. In the manual mode, the user can guide the system
towards the correct translation using an intuitive GUI. Adds
Rao, "We have an advanced prototype of this system that
works for simple sentences, and are extending it to cover
more complex sentences in an incremental fashion."
Talk about cutting edge technology and you simply cannot keep
the IIT's out of the picture. Explains Dr Pushpak Bhattacharya,
Department of Computer Science and Engineering, IIT Mumbai,
"IITs have long felt the need for investing in Machine
Translation. IIT Kanpur took the lead through projects such
as Anusaaraka, Anglabharati, Anubharati etc. Currently, a
very modern approach to this problem through the Universal
Networking Language is being pursued in IIT Bombay. The faculty
regularly interacts with industries on MT related problems.
Also numerous student projects at bachelors, masters and PhD
level are undertaken in IITs."
`ANGLABHARATI'
is said to be a revolutionary system in the field of Machine
Translation. The system is a machine aided translation system
for translation between English to Hindi, for the specific
domain of Public Health Campaigns. Explains Sinha, "We
at IIT Kanpur have developed ANGLABHARATI (a rule based system
for translation from English to all Indian languages) and
ANUBHARTI (an abstracted example based approach). An alpha
version of a system for English to Hindi based on ANGLABHARATI
technology is ready and is being field-tested by ER &
DCI Noida."
The technology behind developing a machine translation system
is not so simple. A good machine translation system cannot
be produced by merely replacing source language words with
target language words. A word for word translation does not
exactly produce a very satisfying target language text. A
good machine translation system must incorporate not only
a good knowledge of the vocabulary of both the source and
target language, but also of their grammar. For example, C-DAC's
MANTRA follows the strategy of not word-to-word or rule-to-rule
but lexical tree to tree, wherein a chunk to chunk level of
transfer can be done. This system uses the Tree Adjoining
Grammar (TAG) formalism for both parsing of English sentences
and generation of Hindi sentences. Currently focussing on
the domain of personnel administration, C-DAC claims that
text related to appointment, transfer and office orders are
translated successfully with almost 90-95 percent accuracy.
Adds Ajai Jain, associate professor, Computer Science and
Engineering, IIT Kanpur, "The most common technique to
use machine translation is by coding the grammatical rules
of source and target languages in the software and get the
translation done using these rules and dictionaries specifically
created for this purpose. The other technique is to store
the source and target language pairs and try and match the
new sentences for similarities from the existing example base
and obtain a translation based on the best match. There can
be an amalgamation of the above techniques, wherein patterns
are stored in place of raw examples. In addition, statistical
methods can be deployed to increase efficiency of the translation."
While all the current projects have focused their energies
on machine translation from English to Hindi, extending them
to other languages, the Anusaaraka project which started at
IIT Kanpur-and is now being continued at IIIT Hyderabad-is
innovative and was started with the explicit aim of translation
from one Indian language to another. Anusaaraka is a software
which is capable of converting text from one Indian language
to another. It produces output which a reader can understand
but is not exactly grammatical. For example, a Bengali to
Hindi Anusaaraka can take a Bengali text and produce output
in Hindi which can be understood by the user but will not
be grammatically perfect. Likewise, a person visiting a site
in a language he does not know can run Anusaaraka and read
the text. Anusaaraka's have been built from Telugu, Kannada,
Bengali, Marathi and Punjabi to Hindi. The system so developed
will be available as open source software.
Sceptics who doubt the efficiency of MT systems would be surprised
to know that there are several MT systems in use around the
world. Examples include the well known Systran (used by the
AltaVista search engine) and METEO (used at the Canadian Meteorological
Centre which does translation of over 45,000 words in weather
bulletins since 1977).
C-DAC is making sure that MT is poised for even more exciting
times with the proposed development of a Mantra Translation
JNU?s Anvita Abbi believes
India has made rapid progress in specific domains in Machine
Translation |
Server
which can be accessed by anyone on the Internet using a browser.
All a user has to do is send the English text and the server
sends back the translated text in the language requested.
C-DAC is also working on a domain specific translated chat
application. Here, one can select the language and all the
communication will be done in the selected language. This
means that even if you select Hindi and the other person selects
English, you will receive all messages in Hindi although the
other person types in English.
Despite such innovative projects enjoying the complete support
of the government, the development of the technology has not
been as rapid as expected. What could have hampered this growth?
"Machine Translation is acknowledged as a major challenge
the world over. When you take languages that are quite diverse,
such as English and Hindi, the complexity is compounded. Since
there is lack of appreciation of the nature of the task, popular
perception of MT falls into two extreme categories. MT is
either viewed as a simple problem that is already solved,
or is dismissed as totally impossible. The truth is somewhere
in between. As you know, services like Altavista and Google
are offering rough automatic translation among several languages.
This is mainly among European languages, and between English
and far Eastern languages. Indian languages are yet to be
covered! The languages that are now covered represent more
than 30 years of hard work! Of course, we can learn from this
experience and have Indian MT systems off the ground faster,
but we obviously cannot underestimate the size of the task,"
says Rao.
Adds Jain, "Today nobody in the government has a roadmap
for development of technologies like MT. They try to sponsor
short term projects and generally have a wide gap between
two projects. This leads to only patch working and loosing
trained manpower in this area."
As seen with any technology, MT in India too has its share
of blemishes. For example, the much touted Anusaaraka project
is dismissed as a non-starter by R M K Sinha who launched
the ANGLABHARATI project. Explains Sinha, "The Anusaaraka
methodology works to some extent only for the specific pair
of languages for which it was designed. It heavily exploits
the fact that two Indian languages have the same word order
but this is not necessarily true in all situations. Technically,
Anusaaraka can be considered to be a specific case of ANGLABHARATI
technology. Anubharati is generic in nature whereas Anusaaraka
is language pair specific with limited growth capability and
no guarantee for grammatical forms. Incidentally, the Anusaaraka
project was sponsored by the Government of India with large
funding and nothing tangible has come out of it even after
a decade of work. In fact, it has been a great mistake on
the part of the government to have funded this project to
the extent that other more promising MT paradigms investigation
and development have been starved of funds. It is unfortunate
that the country has been pushed behind by almost five years
due to lopsided support to Anusaaraka."
So what is the solution to such a problem? Explains Rao, "Coming
to the issue of lexical resources, building MT systems without
basic lexical resources such as online corpora, lexicons and
thesauri is like trying to build cities without brick, mortar
and cement. There is an acute lack of online lexical resources
for Indian languages. Whatever little exists has been developed
for specific groups and is not easily shareable. This is a
massive task which cannot be done by any single group. Indian
groups have now begun to address this challenge jointly by
starting a collaborative open source initiative called LERIL-Lexical
Resources for Indian languages, which includes several groups
such as IIIT Hyderabad, NCST, AU-KBC and Kendriya Hindi Sansthan."
Sharing of resources could be the key to helping MT projects
take off at a faster rate. Agrees Sinha, "What we lack
today is a rich lexical database and availability of trained
manpower to do R&D in this area. Given the unique multi-lingual
culture of the country and our leadership in this area, we
can become a global player in the field of MT if proper encouragement
and funding is provided for R&D."
Adds Bhattacharya, "The main reason is the absence of
lexical resources. Fortunately, the Ministry of Information
Technology is taking concrete steps towards creating these
resources. Using the funding from the Ministry, IIT Bombay
is building the Hindi Wordnet (to be followed by the Marathi
Wordnet). Once built, this will facilitate MT R&D in the
country."
In conclusion, it would be worthwhile to add that despite
all the issues involved, India has over the years made significant
progress in the field of MT. Currently, the Ministry of Information
Technology sponsors nearly 75 percent of these projects. Expressing
her views on the state of the technology in India, Anvita
Abbi, professor of linguistics, Jawaharlal Nehru University,
New Delhi, says, "The main problem faced in the area
of MT are syntactical. With various grammatical issues involved
in the languages-since each language has disparate structures-it
is difficult to capture these differences. However, MT has
over the years made notable progress and has been quite successful
in scientific and domain specific fields because of its objectivity."
|