The Internet today has to deal with multilinguality. People speak
different languages and the number of natural languages along with
their dialects is estimated to be close to 4,000. Of the top 100
languages in the world, English occupies the top position, with
Hindi coming fifth and Marathi fourteenth.
This is where UNL (Universal Networking Language) comes in. It is
a digital meta language for describing, summarizing, refining,
storing and disseminating information in a machine-independent and
human-language-neutral form. UNL represents information (ie,
meaning) sentence by sentence. Sentence information is represented
as a hyper-graph having concepts as nodes and relations as arcs.
This hyper-graph is also represented as a set of directed binary
relations, each between two of the concepts present in the sentence.
Concepts are represented as character-strings called UWs (Universal
Words).
The encoded UNL is used not only for machine translation, but
also for other document-processing activities. The encoding process
can be looked upon as the process of knowledge extraction. The
extracted knowledge is used for automatic hyper linking, summarizing
and categorizing of documents.
 |
UNL can describe and
disseminate information over the net irrespective of the
language used by different
people |
The UNL vocabulary consists of the following.
UWs (Universal Words): Labels that represent word
meaning
Relation Labels: Tags that represent the relationship
between UWs
Attribute Labels: Express additional information about the
UWs that appear in a sentence
A UNL expression can be seen as a UNL graph. For
example,
John, who is the chairman of the company, has arranged a meeting
at his residence.
The UNL for the sentence is
[S] mod(chairman(icl>post), company) aoj(chairman,
John) agt(arrange.@complete, John) pos(residence,
John) obj(arrange, meeting) plc(arrange,
residence) [/S]
You can see the UNL graph for the sentence in the accompanying
picture.
In the above, agt means the agent, obj the object, plc the place,
aoj the attributed object and mod the modifier. The detailed list of
such relations can be found in the reference cited in th einbox next
page. Also the icl construct helps restrict the meaning of the word.
In the above we show only one example of such restriction, viz.,
chairman(icl>post).
Conversion to and from UNL expressions Encoding into
UNL is first of all a parsing problem. The analysis process makes
use of a framework for morphological, syntactic and semantic
analysis synchronously. It analyses sentences by accessing a
knowledge-rich lexicon and interpreting the Analysis Rules, which
essentially capture the language phenomena. The process of
formulating the rules is programming a sophisticated
symbol-processing machine. Thus, the process of converting
natural-language sentences into UNL involves constructing analysis
rules and building a knowledge-rich lexicon linking the language
words with UWs covering the extremely varied language phenomena and
concepts.
Some examples of dictionary entries for Hindi are given
below.
The attributes in the lexicon are collectively called Lexical
Attributes (both semantic and syntactic attributes). The syntactic
attributes include the word category: noun, verb, adjectives, etc.
and attributes like person and number for nouns and tense in for
verbs.
Decoding the UNL expressions into a sentence of any target
language is done using word dictionary and the generation rules of
the target language. Initially, syntax planning of the target words
is done, after which the morphology is generated to produce a
natural sentence.
Some statistics We have constructed analysers for Hindi
and English and the generator for Hindi. The work on the generator
for Marathi has also has been started. This needed linking English,
Hindi and Marathi language strings with the UWs. Also the Analysis
and Generation rules for these languages had to be made. Below is
some quantitative information for the English and Hindi
languages.
Number of Entries in the Hindi-UW dictionary: 70,000
Number of Analysis Rules for English: ~5000
Number of Analysis Rules for Hindi: ~6000
Number of Generation Rules for Hindi: ~6500
Other applications Since the UNL expressions can
be looked upon as the extracted knowledge of the documents, we have
carried out research on how to use these for various
document-processing tasks. Notable among them are automatic hyper
linking and text clustering. In the former, the keywords—as
candidates for setting up links from—are obtained from the UNL
graphs. Heavily linked word-as are possible candidates for keywords.
Similarly, the linkage and relation label information in the UNL
graphs are used for constructing the document vectors in the
semantic dimension. These vectors are then processed with clustering
algorithms. The experimental results are promising.
UNL in India
In India, UNL work is being
carried on at the Computer Science and Engineering
Department, IIT Bombay. Here, we do sentence-level
encoding of English, Hindi and Marathi into the UNL form
and decode this information into Hindi and Marathi, thus
creating a way of semi-automated translation from
English to Hindi and Marathi and also between Hindi and
Marathi. For more on UNL, visit www.unl.
ias.unu.edu | |
Present and future UNL has been found to be very useful
for various multilingual information tasks as well as document
processing applications. The UNL graph is looked upon as the
extracted knowledge from the documents.
The countries participating in this project are Japan, China,
Indonesia, India, Jordan, Russia, Italy, France, Spain and Brazil.
The United Nations Head Quarters in Geneva are developing
multilingual information access systems using the UNL. In
IIT Bombay the following high-impact projects are making use of the
UNL representation for various text processing and language
technology tasks.
Multi-lingual Web
UNL can be a very effective
vehicle for developing multilingual Web-based
applications. The UNL expressions provide the meaning
content of the text and search can be carried out on
this meaning base instead of the text. This, of course,
means developing a novel kind of search-engine
technology. The merit of such a system is that the
information in one language need not be stored in
multiple languages.
| |
The Center for Indian Language Technology Solutions
(www.cse.iitb.ac.in/tukaram) funded by the Ministry of Information
Technology, India.
The Center for Intelligent Internet Research (www.cse.iitb.
ac.in/laiir) funded by Tata Consultancy Services.
Media Lab Asia (www.ircc. iitb.ac.in/~MLAsia), funded by the
Ministry of Information Technology, India and with participation
from the Masachusetts Institute of Technlogy, USA.
The commercial level exploitation of the UNL technology for the
Internet scale multilingual access is expected to happen in a couple
of years’ time.
Pushpak Bhattacharyya,
Department of Computer Science and Engineering, Indian Institute of
Technology, Bombay
|