General Information

      » Home

      » Area of Interest

      » Important Dates

      » Venue

      » Committees

 » Conference Program

      » Keynote Talks

      » Tutorials

      » Invited Industry Talk

      » Accepted Papers

      » Conference Brochure

 » Conference Registration

 » Accommodation Details

 » Student Accommodation

 » For Sponsors

 » Printable Flyer

 » Call For Papers

 » Submission Instructions

  Local Information

      » About Nagpur

      » Tourist Spots

      » Maharashtra Tourism

      » India Tourism

 » COMAD 2010 Gallery

 » Past Conferences

Tutorial Speakers

Wednesday, December 8, 2010 11.00 Hrs


Dr Sameep Mehta and Dr Vinayaka Pandit , IBM Research – India


Title: A Survey on Sampling Techniques and Applications



Dealing with massive data is a key challenge in data mining. Several
approaches have been used to deal with this challenge. Sampling is one of the powerful and resource-effective approaches to address this challenge. The key idea of sampling is that a small subset of the data can be so sampled that it is representative of the statistical properties of the entire data. This enables us to employ more exhausting data mining
techniques on the sample to get accurate and useful information from the data; in a resource-effective manner (resource here refers to both time and computational resources). This tutorial deals with sampling as a technique for data mining in the offline setting. Some of the important issues involved in sampling are (i) How to obtain the sample from the data, (ii) How to design a sample when it is to be obtained in the form of surveys, and (iii) What is the size of the sample required to ensure that the result of the data mining on the sample is close to that of data mining on the entire data. We shall introduce the basic concepts of sampling, present popular sampling techniques used in the literature, and present case-studies. Streaming computation is a very closely related area where sampling has turned out to be a powerful technique. It is used as a way to maintain a small, but representative sketch of the data. But, there are several importance differences in the two models. The streaming model allows one to look at the entire data in a single pass (or a small number of passes) of the data, but only retain a small sketch at any point of time.  On the other hand, the model we consider does not allow us to look at the entire data, but does allow us to randomly access any element of the data. This tutorial mainly focuses on the data mining model.  Slides



Sameep Mehta: Sameep is researcher at IBM Research India since 2006.    Prior to joining IBM, he finished his PhD from Ohio State University. His current research interests include Data Mining, Business Analytics and Services Science.

Vinayaka Pandit: Vinayaka is a researcher at IBM Research - India and hold a PhD from Indian Institute of Technology, New Delhi. His research interests include Approximation Algorithms, Data Mining and Business Optimization.


Wednesday, December 8, 2010 16.20 Hrs


Dr. N. Jeyakumar Natarajan, Bharathiar University, India


Title: Information Retrieval and Text Mining Opportunities in Bioinformatics



The last decade has been marked by a tremendous growth in both experimental and   computational   biomedical   data.  Further, substantial amount of biomedical knowledge is recorded in only free-text form in abstracts and full-text articles. For example the biomedical literature database MEDLINE currently contains about 18 million abstracts and an average about 40,000 to 50,000 new abstracts are added every month. Until only a few years ago, human reasoning was the primary method for the extracting, synthesizing and interpreting the information contained in the biomedical literature and supporting biological databases.  However, in recent years the number of online documents and other biological information repositories has grown tremendously.  However  to  turn  these  data  into  biological  insight or interpretation  of  the  collected  data  remains a key challenge in modern biology.   Given the millions of published documents, two IT fields Information Retrieval (IR) and Text Mining (TM) has much to offer and plays a promising role.

The  objective  of  this  tutorial  is  to  introduce  various  IR  and  TM opportunities   in  biomedical  data  analysis.  Information retrieval is concerned with the automatic identification of relevant documents from large text collections. Text mining is the application of techniques from machine learning in conjugation with natural language processing, and statistical/mathematical approaches to extract useful knowledge from text.
IR and TM have been applied successfully to various biological problems such as Intelligent Information Retrieval, biomedical text sub classification and clustering (e.g.  Find classes of bio-entities), biomedical concept identification (e.g.  bio-entities relevant to particular study), concept relation extraction (e.g.  Conceptual interactions) and explore the new knowledge (e.g. biomedical pathways and functions).  The tutorial will introduce both IR and TM basics, methodology, followed by various applications areas in biomedical domain.  Slide 1  Slide 2



Jeyakumar Natarajan is a Reader at Dept, of Bioinformatics, Bharathiar University, Coimbatore, India.  His Ph.D. is in biomedical informatics from University of Ulster, United Kingdom where he is worked on developing data mining and text mining systems for protein-protein interactions and robust analysis of microarray data.  He also holds post-doctoral work at Northwestern Medical School, Northwestern-University, Chicago, US. His research area is the intersection of computer science, biology, and computational linguistics. His current research activities focused on information retrieval, data mining, text mining using machine learning methods for biomedical data analysis and interpretation. His other research interests include web mining, bio-ontologies, and ontology mining in bioinformatics. Jeyakumar is a frequent invited speaker on the above topics in various universities and research institutions across India.



Thursday, December 9, 2010 13.30 Hrs


Prof. S Muthukrishnan, Rutgers University, USA.


Title: Data Mining Problems in Internet Ad Systems



We are inspired by systems that have emerged in the past decade that enable advertisements (ads) on the Internet.  Such Internet ad systems handle billions of transactions every day involving millions of users, websites and advertisers, and are the basis for billions of dollars worth industry. They crucially rely on real-time collection, management and analysis of data for their effectiveness. Further, they represent unusual challenges for data analysis: nearly all parties in Internet ad systems from marketers to publishers use active, selfish strategies that both help
generate new data as well as distort data produced due to their selfish strategies. Mining such data while cognizant of the inherent game theory is a great research challenge. The tutorial will provide an overview of Internet ad systems and discuss in detail both data management as well as data mining tasks that arise.



The speaker S. (Muthu) Muthukrishnan is a Professor in Rutgers University and a Research Scientist at Google. Muthukrishnan's research interest is in databases and algorithms, recently on data stream management and in algorithms for Internet ad systems.


Friday, December 10, 2010 10.30 Hrs


Dr Prithviraj Sen, Yahoo! Labs, India


Title: Representing large-scale uncertainty through probabilistic databases:



A number of real-world applications require large-scale uncertainty management. Examples include information integration, information extraction, sensor networks and data collected from scientific experiments, among others. Recent developments in the field of probabilistic databases have resulted in a confluence of ideas from various fields to achieve this goal, and there is still some distance to go. In this tutorial, I will start from the basics of possible worlds semantics discussing the semantics of a probabilistic database, go on to discuss query evaluation under possible world semantics, followed by query evaluation in the presence of correlated data. Throughout this tutorial, I will point out connections with related concepts from the field of machine learning. I will also draw from my experience of being part of the group at the University of Maryland, College Park, involved in developing PrDB, a probabilistic database management system designed to run queries efficiently in the presence of large-scale, structured correlated data. Hopefully, this tutorial will function as a primer for researchers who are curious to know the basics of probabilistic databases, and to researchers already involved in the field it will function as a means to keep abreast of latest trends.  Slides



Prithviraj Sen is a researcher working at Yahoo! Labs, Bangalore. Prior to this he completed his PhD at the University of Maryland, College Park. His PhD thesis was devoted to designing more expressive probabilistic databases and how to make inference during query evaluation more efficient. On a broader scale, his areas of interest encompass machine learning, database systems and problems lying in the intersection of these two areas.