Tutorials

Prof. Indrajit Bhattacharya (IISc Bangalore)

Abstract: Over the last decade, probabilistic topic models have emerged as an extremely powerful and popular tool for analyzing large collections of unstructured data. While originally proposed for textual data, topic models have since been applied for various other types of data, such as images, videos, music, social networks and biological data. In this tutorial, I will discuss both the modeling and algorithmic aspects of topic models. I will review the fundamentals of probabilistic generative models, and explain how they can be applied for textual data, starting from simple unigram models to the Latent Dirichlet Allocation model. Then I will look at the problem of learning and inference using topic models, explain why exact inference is intractable for them, review the principle of inference using sampling, and discuss Gibbs Sampling strategies for inference in topic models. As applications of topic models, we will look at semantic search and sentiment analysis. Finally, I will discuss some short-comings of LDA, and briefly touch upon more advanced topic models, such as syntactic, correlated, dynamic and supervised topic models.

Biography:

Indrajit Bhattacharya is an Assistant Professor in the Computer Science and Automation Department at the Indian Institute of Science, Bangalore. His areas of research are Machine Learning and Data Mining, with a focus on Hierarchical Bayesian Models for non-iid data. He completed his PhD in Computer Science from the University of Maryland at College Park in 2006, and his BTech in Computer Science and Engineering from Indian Institute of Technology, Kharagpur in 1999. Prior to joining IISc, he worked as a Research Scientist at the IBM Research Lab, New Delhi.

The State of Data Privacy

Dr. Srivatsan Laxman (Microsoft Research India)

Abstract: As learning and data mining algorithms mature, we find ourselves increasingly surrounded and reliant on many applications, like search, social-networking or business intelligence. The data in such settings often contain sensitive information of individuals, corporations or governments, and this leads us to the important issue of data privacy. There is a growing concern that algorithms used for analyzing data may also (inadvertently) compromise privacy by revealing specific information about the parties involved.
Early work in data privacy established that mere removal or encryption of Personally Identifiable Information in user records is insufficient to guarantee privacy. This led to a sequence of works that tried to formalize a definition for privacy, starting with k-anonymity, and followed by notions like l-diversity, t-closeness and m-invariance. However, all these definitions were broken by a sequence of simple attacks, based either on responses to multiple queries or on suitable auxiliary information available to an adversary. In 2006, Dwork, et al., proposed the idea of Differential Privacy (DP) where, by adding a calibrated amount of noise, it is possible to guarantee that an adversary will learn essentially the same thing about a user, whether or not the user's record was included in the data. The main benefits of DP are that the guarantees are agnostic to auxiliary information and it is possible to precisely quantify the deterioration in DP guarantee under multiple queries (or composition). DP has quickly gained popularity (especially among the theory community) as an important formal notion of privacy with significant potential. Despite its growing success, there are several drawbacks of DP that have prevented its adoption in practice. Foremost among them is that DP adds very high levels of noise to the output, oftentimes leading to unusable query responses. This is because the DP framework assumes the adversary knows almost all the entries of the data base and disregards any possible probabilistic data generation model for the data. This is contrary to what we see in the real world, where data often has strong statistical characterizations and the knowledge of the adversary about specific data entries is often limited. The statistics community has also explored techniques for disclosure control as a privacy-preservation mechanism, but so far, a broad consensus has evaded the privacy community regarding suitability of statistical assumptions under which disclosure control guarantees may be provided.
In this tutorial, I will introduce the area of data privacy and highlight the main challenges in this field of research. A wide range of privacy definitions will be covered including k-anonymity (and its variants), Differential Privacy and statistical disclosure control. One of the goals of the tutorial is to bring out the fundamental difficulties in developing formal notions of privacy and the inherent contradictions that exist between privacy and data analysis. A second goal is to analyze the merits and demerits of various privacy definitions that exist today, hopefully throwing light on what needs to be done in-future to achieve formal, yet practical frameworks for privacy preservation.

Biography:

Srivatsan Laxman is a Researcher at Microsoft Research India, Bangalore. He obtained his Ph.D. from the Dept. of Electrical Engineering, Indian Institute of Science, Bangalore, in 2006. His research interests are broadly in the areas of pattern recognition and data mining. In particular, his work has focused on various aspects of pattern discovery, efficient algorithms for discovering patterns, statistical analysis/significance of patterns in data and the learning/application of generative models based on frequent patterns in data. In the context of data privacy, his research interests revolve around foundational issues in data privacy, privacy definitions, as well as their practical implications.

Text Mining of Biomedical Literature Repositories

Prof. Ashish Tendulkar (TIFR Mumbai)

Abstract: There is an increasing interest in the development of biomedical text mining applications not only to enable improved literature search, but also to automatically detect pointers between biologically relevant entities described in articles and their corresponding records in existing annotation databases. The rapid growth of natural language data in biomedical sciences (including scientific articles, patents, patient records, database textual descriptions) together with the practical relevance of these resources for the design, interpretation and evaluation of bioinformatics and experimental research resulted in the implementation of a considerable number of new applications. For the development and maintenance of manually annotated database, text mining assisted literature duration has been especially promising, as well as for the construction of gold standard datasets and gene lists in the context of Systems Biology and gene set enrichment. Attempts have been made also to integrate text mining with other bioinformatics data such as sequence, structural and gene expression information.

We plan to focus primarily on applications of text mining and issues in building text mining systems. We will begin with gentle introduction to text mining and its application in various Biology and Bioinformatics related domains. Existing resources for building text mining applications will be presented in terms of (1) useful data collections, (2) lexical resources, (3) features of natural language data that can be exploited by text mining systems and (4) data mining and natural language processing systems. Also the main types of currently available text mining applications will be discussed, including the retrieval and classification of articles, the identification of mentions of biological entities such as genes, proteins and cell types and the extraction of functional descriptions or protein interaction. The use of literature for knowledge discovery and hypothesis generation will be described. A crucial aspect of literature mining systems is evaluation and usability; these two aspects will be covered trough recent community evaluation efforts such as the BioCreative challenge and the BioCreative metaserver initiative. In order to show what kind of queries and results are currently supported by text mining and information extraction systems, practical example cases will be illustrated in detail, complementing the previously introduced basic descriptions of the underlying methodology. Finally a practical case study will show the step by step implementation of a text mining system illustrating how it is possible to construct such a system for a particular information need.

After the tutorial, the participants should be aware of the importance of the biomedical literature as a central data and information source for biology and bioinformatics. They should be able to understand how existing text mining systems work and on what features they rely. Participants would have an overview of currently available tools and how to construct such an application in practice.

Biography:
Ashish Tendulkar is a visiting fellow in the School of Technology and Computer Science at Tata Institute of Fundamental Research (TIFR) in Mumbai.

Clustering Data Streams
Prof. Vasudha Bhatnagar, Prof. Sharanjit Kaur (Dept. Of Computer Sc., University of Delhi)

Abstract: Potency to consolidate and capture natural structures from unlabeled data has made clustering a popular choice in stream mining. Single scan of data, bounded memory usage, constant per-point processing time and capturing data evolution are the key challenges during clustering of streaming data.
The objective of this tutorial is to present a comprehensive overview of common approaches used for clustering streams with emphasis on synopsis selection. We begin with a discussion on important issues related to stream clustering, followed by a critique analysis of three main approaches. Some contemporary and well-known algorithms for each approach are discussed. We close the tutorial with a formulation of a generic architectural framework for stream clustering algorithms for better understanding of the issues.

Outline of the Tutorial Topics:

Introduction to streaming data
Challenges in clustering streams
Three main approaches used for clustering stream with their critical analysis
A generic architectural framework for algorithms for clustering streams

Biography:

Vasudha Bhatnagar did her masters in Computer Applications from University of Delhi, Delhi, India in 1985. She worked in Centre for Development of Telematics from 1985 - 1989 as a software engineer in Operating System and Traffic group. She completed doctoral studies from Jamia Millia Islamia, New Delhi, India in 2001. She is currently an Associate Professor in the Department of Computer Science, University of Delhi, Delhi, India. Her broad area of interest is Intelligent Data Analysis. She is particularly interested in developing process models for Knowledge Discovery in Databases, and algorithms for classification and clustering. She has offered a tutorial “ Modeling Changes in Evolving Datasets” , in 10th IEEE Conference on Intelligent System Design and Applications, Cairo, Egypt (29 Nov - 1 Dec 2010).

Sharanjit Kaur received her masters’ degree in computer applications in 1994 from Thapar Institute of Engineering and Technology, Patiala, India. She did Ph.D. from Department of Computer Science, University of Delhi, Delhi, India in February, 2011. She started teaching in 1995 in Acharya Narendra Dev College (University of Delhi), Delhi and is currently Associate Professor there. Her research interest spans the area of stream mining and databases.

Ranking Mechanisms for Interaction Networks
Dr. Sameep Mehta, Dr. Ramasuri Narayanam and Dr. Vinayaka Pandit (IBM Research)

Abstract: Interaction networks are prevalent in real world applications and they manifest in several forms such as online social networks, collaboration networks, technological networks, and biological networks. In the analysis of interaction networks, an important aspect is to determine a set of key nodes either with respect to positional power in the network or with respect to behavioral influence. This calls for designing ranking mechanisms to rank nodes/edges in the network and there exists several well known ranking mechanisms in the literature for this task such as Google page rank and centrality measures in social sciences. We note that these traditional ranking mechanisms are based on the structure of the underlying network. More recently, we witness applications wherein the ranking mechanisms should take into account not only the structure of the network but also other important aspects of the networks such as the value created by the nodes in the network and the marginal contribution of the nodes in the network. Motivated by this observation, the goal of this tutorial is to provide conceptual understanding of recent advances in designing efficient and scalable ranking mechanisms for large interaction networks along with applications to social network analysis.

Biographies:
Sameep Mehta is researcher at IBM Research India since 2006. Prior to joining IBM, he finished his PhD from Ohio State University. His current research interests include Data Mining, Business Analytics and Services Science.

Ramasuri Narayanam received his Ph.D. in Computer Science from Indian Institute of Science (IISc), Bangalore, India in 2011. He is a researcher at IBM Research - India. His research interests include game theory, social networks, mechanism design, and electronic commerce.

Vinayaka Pandit is a Researcher in the Analytics and Optimization department at IBM Research - India and based in Bangalore. He obtained his PhD in Computer Science from IIT-Delhi. He is primarily interested in design and analysis of algorithms. He is also interested in applying algorithmic insights to solve practical problems in domains like operations research, data mining, and databases.

The 17^th International Conference on Management of Data
(COMAD 2011)
December 19--21, 2011, Bangalore, India

About the Conference

Conference Program

Venue and Travel

Tutorials

The 17th International Conference on Management of Data (COMAD 2011) December 19--21, 2011, Bangalore, India

About the Conference

Conference Program

Venue and Travel

Tutorials

The 17^th International Conference on Management of Data
(COMAD 2011)
December 19--21, 2011, Bangalore, India