Resource Center: Linux Home/Home Office Convergence Enterprise E-Biz
PC Quest Logo
Search    in        Archive     CD Search 
Home   About Us   Site Map   RQS   Shopping   Travel   Feedback   Help   Find a Job   Get Free IT Info   Recommend this site 

Home > Technology > Many Languages on the Net

This Week's Review
Linux
Hands On

More Reviews

Hardware

Software

Tech Trends
E-Commerce
Coding
Editor's Column
Business Computing

October, 2002 issue

CIOL Services

Ask Tech Expert | Training | Events | IT Jobs | Travel | IT Outsourcing | the DQweek Dealer Online Community | IT Shopping | Computers@Home Product Mart


Many Languages on the Net

What is Universal Networking Language and Multilingual Information Processing


Thursday, September 12, 2002

The Internet today has to deal with multilinguality. People speak different languages and the number of natural languages along with their dialects is estimated to be close to 4,000. Of the top 100 languages in the world, English occupies the top position, with Hindi coming fifth and Marathi fourteenth. 

This is where UNL (Universal Networking Language) comes in. It is a digital meta language for describing, summarizing, refining, storing and disseminating information in a machine-independent and human-language-neutral form. UNL represents information (ie, meaning) sentence by sentence. Sentence information is represented as a hyper-graph having concepts as nodes and relations as arcs. This hyper-graph is also represented as a set of directed binary relations, each between two of the concepts present in the sentence. Concepts are represented as character-strings called UWs (Universal Words). 

The encoded UNL is used not only for machine translation, but also for other document-processing activities. The encoding process can be looked upon as the process of knowledge extraction. The extracted knowledge is used for automatic hyper linking, summarizing and categorizing of documents.

UNL can describe and disseminate information over the net irrespective of the language used by different people

The UNL vocabulary consists of the following.

UWs (Universal Words): Labels that represent word meaning

Relation Labels: Tags that represent the relationship between UWs

Attribute Labels: Express additional information about the UWs that appear in a sentence

A UNL expression can be seen as a UNL graph. For example, 

John, who is the chairman of the company, has arranged a meeting at his residence.

The UNL for the sentence is

[S]
mod(chairman(icl>post), company)
aoj(chairman, John)
agt(arrange.@complete, John)
pos(residence, John)
obj(arrange, meeting)
plc(arrange, residence)
[/S]

You can see the UNL graph for the sentence in the accompanying picture.

In the above, agt means the agent, obj the object, plc the place, aoj the attributed object and mod the modifier. The detailed list of such relations can be found in the reference cited in th einbox next page. Also the icl construct helps restrict the meaning of the word. In the above we show only one example of such restriction, viz., chairman(icl>post). 

Conversion to and from UNL expressions
Encoding into UNL is first of all a parsing problem. The analysis process makes use of a framework for morphological, syntactic and semantic analysis synchronously. It analyses sentences by accessing a knowledge-rich lexicon and interpreting the Analysis Rules, which essentially capture the language phenomena. The process of formulating the rules is programming a sophisticated symbol-processing machine. Thus, the process of converting natural-language sentences into UNL involves constructing analysis rules and building a knowledge-rich lexicon linking the language words with UWs covering the extremely varied language phenomena and concepts.

An example of UNL graph

Some examples of dictionary entries for Hindi are given below.

The attributes in the lexicon are collectively called Lexical Attributes (both semantic and syntactic attributes). The syntactic attributes include the word category: noun, verb, adjectives, etc. and attributes like person and number for nouns and tense in for verbs. 

Decoding the UNL expressions into a sentence of any target language is done using word dictionary and the generation rules of the target language. Initially, syntax planning of the target words is done, after which the morphology is generated to produce a natural sentence.

Some statistics
We have constructed analysers for Hindi and English and the generator for Hindi. The work on the generator for Marathi has also has been started. This needed linking English, Hindi and Marathi language strings with the UWs. Also the Analysis and Generation rules for these languages had to be made. Below is some quantitative information for the English and Hindi languages.

Number of Entries in the Hindi-UW dictionary: 70,000

Number of Analysis Rules for English: ~5000

Number of Analysis Rules for Hindi: ~6000

Number of Generation Rules for Hindi: ~6500

Other applications 
Since the UNL expressions can be looked upon as the extracted knowledge of the documents, we have carried out research on how to use these for various document-processing tasks. Notable among them are automatic hyper linking and text clustering. In the former, the keywords—as candidates for setting up links from—are obtained from the UNL graphs. Heavily linked word-as are possible candidates for keywords. Similarly, the linkage and relation label information in the UNL graphs are used for constructing the document vectors in the semantic dimension. These vectors are then processed with clustering algorithms. The experimental results are promising.

UNL in India

In India, UNL work is being carried on at the Computer Science and Engineering Department, IIT Bombay. Here, we do sentence-level encoding of English, Hindi and Marathi into the UNL form and decode this information into Hindi and Marathi, thus creating a way of semi-automated translation from English to Hindi and Marathi and also between Hindi and Marathi. For more on UNL, visit www.unl. ias.unu.edu

Present and future
UNL has been found to be very useful for various multilingual information tasks as well as document processing applications. The UNL graph is looked upon as the extracted knowledge from the documents. 

The countries participating in this project are Japan, China, Indonesia, India, Jordan, Russia, Italy, France, Spain and Brazil. The United Nations Head Quarters in Geneva are developing multilingual information access systems using the UNL. 
In IIT Bombay the following high-impact projects are making use of the UNL representation for various text processing and language technology tasks.

Multi-lingual Web

UNL can be a very effective vehicle for developing multilingual Web-based applications. The UNL expressions provide the meaning content of the text and search can be carried out on this meaning base instead of the text. This, of course, means developing a novel kind of search-engine technology. The merit of such a system is that the information in one language need not be stored in multiple languages.

The Center for Indian Language Technology Solutions (www.cse.iitb.ac.in/tukaram) funded by the Ministry of Information Technology, India.

The Center for Intelligent Internet Research (www.cse.iitb. ac.in/laiir) funded by Tata Consultancy Services.

Media Lab Asia (www.ircc. iitb.ac.in/~MLAsia), funded by the Ministry of Information Technology, India and with participation from the Masachusetts Institute of Technlogy, USA.

The commercial level exploitation of the UNL technology for the Internet scale multilingual access is expected to happen in a couple of years’ time.

Pushpak Bhattacharyya, Department of Computer Science and Engineering, Indian Institute of Technology, Bombay   



Page(s)   1   

End of the article

PC Problems? Get a solution in 24 hours. Ask Tech Expert

Sun

Subscribe Now

Message boards

Discuss this and many other IT topics at the
CIOL message board

Previous Stories

Logical Partitioning and Server Optimization 

Of Smoke and Flame

JPEG of the New Millennium

 
For the PCQuest print publication: [ Magazine Subscription ]  [ Contact Info ]   [ PCQuest Team ]  [ Advertise
 PC Quest Logo   
Other Cyber Media web sites
   
  
Dataquest ]   [ Voice&Data ]   [ CIOL ]   [ Computers@Home ] 
DQ Channels India ]   [ IDC India ]   [ Training ]   [ CIOL Shop ] 
the DQweek ]   [ CIOL Jobs ]   [ Cyberexpo ]   [ Cyber Multimedia ] 
Cyber Media India ]   [ GlobalOutsourcing ]   [ Travel ]   [ Cyber Astro ] 
   
Cyber India Online Ltd.
                                                   [ Missing Issue ]

Cyber Media India Ltd

Copyright © CMIL. All rights reserved.
Reproduction in whole or in part in any form or medium without express written permission is prohibited.
Usage of this web site is subject to terms and conditions.
Broken links? Problems with site? Send email to mailto:webmaster@ciol.com?subject=From CIOL