Conference Officers
PC Committee
Conference Program
short papers

S1: Queries and Transactions

  1. Ranking Database Queries With User Feedback: A Neural Network Approach
    Nevedita Mallick, Ganesh Agarwal, Srinivasan Turuvekere (U. of Illinois, Urbana-Champaign, USA)

    Currently, websites on the Internet serving structured data allow users to perform search based on simple equality or range constraints on data attributes. However, to begin with, users may not know what is desirable to them precisely, to be able to express it accurately in terms of primitive equality or range constraints. Additionally, in most websites, the results provided to users can be sorted with respect to values of any one particular attribute at a time. For the user, this is like searching for a needle in a haystack because the user's notion of interesting objects is generally a function of multiple attributes.
    In this paper, we develop an approach to (i) support a family of functions involving multiple attributes to rank the tuples, and (ii) improve the ranking of results returned to the user by incorporating user feedback (to learn user's notion of interestingness) with the help of a neural network. The user feedback driven approach is effective in modeling a user's intuitive sense of desirability of a tuple, a notion that is otherwise near impossible to quantify mathematically. To prove the effectiveness of our approach, we have built a middleware for an application domain that implements and evaluates these ideas.

  2. Supporting Keyword Queries on Structured Databases with Limited Search Interfaces
    Nurcan Yuruk, Xiaowei Xu (U. of Arkansas, Little Rock, USA), Chen Li (U. of California, Irvine, USA), Jeffrey Xu Yu (Chinese U. of Hong Kong, China)

    Many Web sources provide forms to allow users to query their hidden data. For instance, online stores such as Amazon.com have search interfaces, using which users can query information about books by providing conditions on attributes of title, author, and publisher. We propose a novel system framework that supports keyword queries on structured data behind such limited search forms. It provides user-friendly query interfaces for users to type in IR-style keyword queries to find relevant records. We study research challenges in the framework and conduct extensive experiments on real datasets to show the practicality of our framework and evaluate different algorithms.

  3. Automated Data Discovery in Similarity Score Queries
    Fatih Altiparmak, Ali Tosun, Hakan Ferhatosmanoglu, Ahmet Sacan (Ohio State U., USA)

    A vast amount of information is being stored in scientific databases on the web. The dynamic nature of the scientific data, the cost of providing an up-to-date snapshot of the whole database, and proprietary considerations compel the database owners to hide the original data behind search interfaces. The information is often provided to researchers through similarity-search query interfaces, which limits a proper and focused analysis of the data. In this study, we present systematic methods of data discovery through similarity-score queries in such ``uncooperative'' databases. The methods are generalized to multi-dimensional data, and to L-p norm distance functions. The accuracy and performance of our methods are demonstrated on synthetic and real-life datasets. The methods developed in this study enable the scientists to obtain the data within the range of their research interests, overcoming the limitations of the similarity-search interface. The results of this study also present implications in data privacy and security areas, where the discovery of the original data is not desired.

  4. Efficient Algorithms for Node Disjoint Subgraph Homeomorphism Determination
    Yanghua Xiao, Wentao Wu, Wei Wang, Zhenying He (Fudan U., China)

    Recently, great efforts have been dedicated to researches on the management of large-scale graph-based data, where node disjoint subgraph homeomorphism relation between graphs has been shown to be more suitable than (sub)graph isomorphism in many cases, especially in those cases where node skipping and node mismatching are desired. However, no efficient algorithm for node disjoint subgraph homeomorphism determination (ndSHD) has been available. In this paper, we propose two computationally efficient ndSHD algorithms based on state spaces searching with backtracking, which employ many heuristics to prune the search spaces. Experimental results on synthetic data sets show that the proposed algorithms are efficient, require relatively little time in most of cases, can scale to large or dense graphs, and can accommodate to more complex fuzzy matching cases.

  5. The Chronon Based Model for Temporal Databases
    Anurag D, Anup Sen (IIM Calcutta, India)

    Among various models like TSQL2, XML and SQL:2003 for temporal databases, TSQL2 is perhaps the most comprehensive one and introduces new constructs for temporal query support. However, these constructs mask the user from the underlying conceptual model and essentially provides a “black box” flavour. Further, the temporal attributes are not directly accessible. In this paper, we propose a “chronon” based conceptual (relational) model, which works on a tuple timestamping paradigm. Here, the time attribute is treated in similar fashion as any other attribute and is available to the user for inclusion in SQL statements, making the model easy to understand and use. A corresponding efficient physical model based on the attribute versioning format akin to XML/SQL:2003 has also been proposed and the mapping between the proposed conceptual and physical models has been described. We evaluate the model by comparing with TSQL2 and SQL:2003.

  6. Main Memory Commit Processing: The Impact of Priorities
    Heine Kolltveit, Svein-Olaf Hvasshovd (NTNU, Norway)

    Distributed transaction systems require an atomic commitment protocol to preserve ACID properties. The overhead of commit processing is a significant part of the load on a distributed database. Here, we propose approaches where the overhead is reduced by prioritizing urgent messages and operations. This is done in the context of main memory primary-backup systems, and the proposed approaches is found to significantly reduce the response time as seen by the client. Also, by piggybacking messages on each other over the network, the throughput is increased. Simulation results show that performance can be significantly improved using this approach, especially for utilizations above 50%.

S2: XML Databases

  1. Exploiting ID References for Effective Keyword Search in XML Database
    Bo Chen (National U. of Singapore), Jiaheng Lu (U. of California, Irvine, USA), Tok Wang Ling (National U. of Singapore)

    In this paper, we study novel Tree + IDREF data model for keyword search in XML. In this model, we propose novel Lowest Referred Ancestor (LRA) pair, Extended LRA (ELRA) pair and ELRA group semantics for effective and efficient keyword search. We develop efficient algorithms to compute the search results based on our semantics. Experimental study shows the superiority of our approach.

  2. Storage techniques for multi-versioned XML documents
    Laura Irina Rusu, Wenny Rahayu (LaTrobe U., Australia), David Taniar (Monash U., Australia)

    Following the increased popularity of using XML documents for information exchange between applications and representing semi-structured data, the issue of warehousing collections of XML documents has become strongly imperative. The focus of this paper is on the storage of dynamic (multi-versioned) XML documents. In this paper we present four different storage methods for dynamic XML documents, and evaluate each of them using two performance indicators, namely I/O loading and versioning costs.

  3. Twig'n Join: Progressive Query Processing of Multiple XML Streams
    Wee Hyong Tok, Stephane Bressan, Mong Li Lee (National U. of Singapore)

    We propose a practical approach to the progressive processing of (FWR) XQuery queries on multiple XML streams, called Twig'n Join (or TnJ). The query is decomposed into a query plan combining several twig queries on the individual streams, followed by a multi-way join and a final twig query. The processing is itself accordingly decomposed into three pipelined stages progressively producing streams of XML fragments. Twig'n Join combines the advantages of the recently proposed TwigM algorithm and our previous work on relational result-rate based progressive joins. In addition, we introduce a novel dynamic probing technique, called Result-Oriented Probing (ROP), which determines an optimal probing sequence for the multi-way join. This significantly reduces the amount of redundant probing for results. We comparatively evaluate the performance of Twig'n Join using both synthetic and real-life data from standard XML query processing benchmarks. We show that Twig'n Join is indeed effective and efficient for processing multiple XML streams.

  4. TwigBuffer: Avoiding Useless Intermediate Solutions Completely in Twig Joins
    Jiang Li, Junhu Wang (Griffith U., Australia)

    Twig pattern matching plays a crucial role in XML data processing. TwigStack is a holistic twig join algorithm that solves the problem in two steps: (1) finding potentially useful intermediate path solutions, (2) merging the intermediate solutions. The algorithm is optimal when the twig pattern has only -edges, in the sense that no useless partial solutions are generated in the first step (thus expediting the second step and boosting the overall performance). However, when -edges are present, a large set of useless partial solutions may be produced, which directly downgrades the overall performance. Recently, some improved versions of the algorithm (e.g., TwigStackList and iTwigJoin) have been proposed in an attempt to reduce the number of useless partial solutions when -edges are involved. However, none of the algorithms can avoid useless partial solutions completely. In this paper, we propose a new algorithm, TwigBuffer, that is guaranteed to completely avoid the useless partial solutions. Our algorithm is based on an ingenious strategy to buffer and manipulate elements in stacks and lists. Experiments show that TwigBuffer significantly outperforms previous algorithms when arbitrary /-edges are present.

  5. An Approach for XML Similarity Join using Tree Serialization
    Lianzi Wen, Toshiyuki Amagasa, Hiroyuki Kitagawa (U. of Tsukuba, Japan)

    This paper proposes a scheme for similarity join over XML data based on XML data serialization and subsequent similarity matching over XML node subsequences. With the recent explosive diffusion of XML, great volumes of electronic data are now marked up with XML. As a consequence, a growing amount of XML data represents similar contents, but with dissimilar structures. To extract as much information as possible from this heterogeneous information, similarity join has been used. Our proposed similarity join for XML data can be summarized as follows: 1) we serialize XML data as XML node sequences; 2) we extract semantically/structurally coherent subsequences; 3) we filter out dissimilar subsequences using textual information; and 4) we extract pairs of subsequences as the final result by checking structural similarity. The above process is costly to execute. To make it scalable against large document sets, we use Bloom filter to speed up text similarity computation. We show the feasibility of the proposed scheme by experiments.

  6. A Holistic Algorithm for Efficiently Evaluating Xtwig Joins
    Bo Ning, Guoren Wang (Northeastern U., China), Jeffrey Xu Yu (Chinese U. of Hong Kong,China)

    More and more XML data have been generated and used in the data exchange. XML employs a tree-structure data model, but lots of queries submitted by users are not like the tree-structure. Those queries contain ancestor axis in predicates, and specify the pattern of selection predicates on multiple elements from descendants to ancestors. Efficiently finding all occurrences of such an xtwig pattern in an XML database is crucial for XML query processing. A straightforward method is to rewrite an xtwig pattern to equivalent reverse-axis-free one. However, this method needs scanning the element streams several times and is rather expensive to evaluate. In this paper, we study the xtwig pattern, and propose two basic decomposing methods, VertiDec and HoriDec, and a holistic processing method, XtwigStack, for processing xtwig queries. The experiments show that the holistic algorithm is much more efficient than the rewriting and decomposition approaches.

S3: Data Mining

  1. Association Rules Induced by Item and Quantity Purchased
    Animesh Adhikari (S P Chowgule College, Goa, India), P. R. Rao (Goa U., India)

    Most of the real market basket data are non-binary in the sense that an item could be purchased multiple times in the same transaction. In this case, there are two types of occurrences of an itemset in a database: the number of transactions in the database containing the itemset, and the number of occurrences of the itemset in the database. Traditional support-confidence framework might not be adequate for extracting association rules in such a database. In this paper, we introduce three categories of association rules. We introduce a framework based on traditional support-confidence framework for mining each category of association rules. We present experimental results based on two databases.

  2. An Indexed Trie Approach to Incremental Mining of Closed Frequent Itemsets based on a Galois Lattice Framework
    Kalpana Balasubrahmanyan (Avinashilingam U., India), R Nadarajan, Senthil Babu (PSG College of Technology, India)

    Incrementality is a major challenge in the mining of dynamic databases. In such databases, the maintenance of association rules can be directly mapped into the problem of maintaining closed frequent itemsets. A number of incremental strategies have been proposed earlier with several limitations. A serious limitation is the need to examine the entire family of closed itemsets, whenever there are insertions or deletions in the database. The proposed strategy relies on an efficient and selective update of the closed itemsets using an indexed trie structure. The framework emphasizes on certain fundamental and structural properties of Galois Lattice theory to overcome the limitations of the earlier approaches. Apart from facilitating a selective update, the indexed structure removes the necessity of working with a wholly memory resident trie.

  3. Efficiently Mining Recent Frequent Patterns over Online Data Streams
    Guohui Li, Hui Chen (Huazhong U. of Science and Technlogy, China), LihChyun Shu (National Cheng Kung U., Taiwan)

    This paper proposes a method for mining the frequent patterns in an arbitrary sliding window of data streams. As streams flow, the contents of which are captured with SWP-tree by scanning the stream only once, and the obsolete and infrequent patterns are deleted by periodically pruning the tree. To differentiate the patterns of recently generated transactions from those of historic transactions, a time decaying model is also applied. The experimental results show that the proposed method is efficient and scalable, and it is superior to other analogous algorithms.

  4. Index-Supported Similarity Search using Multiple Representations
    Johannes Assfalg, Hans-Peter Kriegel, Peter Kunath, Alexey Pryakhin, Michael Kats (Ludwig Maximilians U., Germany)

    Similarity search in complex databases is of utmost interest in a wide range of application domains. Often, complex objects are described by several representations. The combination of these different representations usually contains more information compared to only one representation. In our work, we introduce the use of an index structure in combination with a negotiation-theory-based approach for deriving a suitable subset of representations for a given query object. This most promising subset of representations is determined in an unsupervised way at query time. We experimentally show how this approach significantly increases the efficiency of the query processing step. At the same time the effectiveness, i.e. the quality of the search results, is equal or even higher compared to standard combination methods.

  5. Distance Based Feature Selection for Clustering Microarray Data
    Manoranjan Dash (Nanyang Technological U., Singapore)

    In microarray data, clustering is the fundamental task for separating genes into biologically functional groups or for classifying tissues and phenotypes. Recently, with innovative gene expression microarray data technologies, thousands of expression levels of genes (features) can be measured simultaneously in a single experiment. The large number of genes with a lot of noise causes high complexity for cluster analysis. This challenge has raised the demand for feature selection - an effective dimensionality reduction technique that removes noisy features. In this paper we propose a novel filter method for feature selection. The suggested method, called ClosestFS, is based on a distance measure. For each feature, the distance is evaluated by computing its impact on the histogram for the whole data. Our experimental results show that the quality of clustering results (evaluated by several widely used measures) of K-means algorithm using ClosestFS as the pre-processing step is significantly better than that of the pure K-means.

  6. Knowledge Transferring Based on Implicit Link Analysis
    Xiao Ling, Wenyuan Dai, Gui-Rong Xue, Yong Yu (Shanghai Jiaotong U., China)

    In this paper, we design a local classification algorithm using implicit link analysis, considering the situation that the labeled and unlabeled data are drawn from two different albeit related domains. In contrast to many global classifiers, e.g. Support Vector Machines, our local classifier only takes into account the neighborhood information around unlabeled data points, and is hardly based on the global distribution in the data set. Thus, the local classifier has good abilities to tackle the non-i.i.d. classification problem since its generalization will not degrade by the bias w.r.t. each unlabeled data point. We build a local neighborhood by connecting the similar data points. Based on these implicit links, the Relaxation Labeling technique is employed. In this work, we theoretically and empirically analyze our algorithm, and show how our algorithm improves the traditional classifiers. It turned out that our algorithm greatly outperforms the state-of-the-art supervised and semi-supervised algorithms when classifying documents across different domains.

S4: Data Warehouses and Industrial Applications

  1. Redundant Array of Inexpensive Nodes for DWS
    Jorge Vieira (Critical Software SA, Portugal), Marco Vieira (U. of Coimbra, Portugal), Marco Costa (Critical Software SA, Portugal), Henrique Madeira (U. of Coimbra, Portugal)

    The DWS (Data Warehouse Striping) technique is a round-robin data partitioning approach especially designed for distributed data warehousing en-vironments. In DWS the fact tables are distributed by an arbitrary number of low-cost computers and the queries are executed in parallel by all the com-puters, guarantying a nearly optimal speed up and scale up. However, the use of a large number of inexpensive nodes increases the risk of having node failures that impair the computation of queries. This paper proposes an approach that provides Data Warehouse Striping with the capability of answering to queries even in the presence of node failures. This approach is based on the selective replication of data over the cluster nodes, which guarantees full availability when one or more nodes fail. The proposal was evaluated using the newly TPC-DS benchmark and the results show that the approach is quite effective.

  2. Load-Balancing for WAN Data Warehouses
    Pedro Furtado (U. Coimbra, Portugal)

    Although the basic Data Warehouse schema concept is centralized, there are increasingly application domains in which there is the need to have several sites or computers input and analyze the data, therefore distributed data placement and processing is necessary. Given that sites may have different amounts of data and different processing capacities, how can we conform to the placement requirements of the context and balance such a system effectively? In WAN environments the network speed is a very relevant factor and there are application requirements concerning the place where each piece of data stays, based on who produced the data (ownership). We propose a new strategy that accepts the placement requirements of the desired context and uses an effective automatic approach to determine fixed-sized chunks and to balance and process those chunks efficiently. Our experimental results show the validity of the approach and how to minimize the context limitations.

  3. Enabling Privacy-Preserving e-Payment Processing
    Mafruz Ashrafi, See-Kiong Ng (Inst. for Infocomm Research, Singapore)

    The alarming increase in the number of data breaching incidents from high profile companies reflects that buying goods or services from online merchants can pose a serious risk of customers’ privacy and the merchants’ business reputation. The conventional approach of encrypting customer data at merchant side using the merchant’s secret key is no longer adequate for preserving customer privacy. An e-payment scheme that can guarantee customer authenticity while keeping the customer’s sensitive details secret from the various parties involved in the online transaction is needed. We propose here an online protocol for processing e-payments that minimizes the customer’s privacy as well as merchant business risks. Using a non-reusable password-based authentication approach, the proposed protocol allows consumers to purchase goods or services from an online merchant anonymously, thus achieving the ideal privacy environment in which to shop. The payment details sent to a merchant will become obsolete after the first use, thereby preventing any subsequent fraudulent transactions by a third party. Such protocol can be easily deployed in an e-commerce environment to strengthen the integrity of the electronic payment system.

  4. Managing and Correlating Historical Events using an Event Timeline Datatype
    Abhishek Biswas, Bhupalam Sagar (Amrita School of Engineering, India), Jagannathan Srinivasan (Oracle, USA)

    An event is occurrence of an incident at a particular instant of time with a set of attributes describing it. A collection of similar events ordered by time is called an event timeline. This paper presents the notion and design of an event timeline datatype, which can be used to store, maintain and operate on a collection of events as a single abstraction. The datatype represents event occurrences as bitmaps with methods for accessing and updating the timeline. Two additional pair-wise operators union (OR) and intersection (AND) allow correlating timelines to discover new information. Semi-structuring event attributes of a timeline as (name, value) pairs provides a basis for search and indexing. The datatype can be used either to develop simple applications requiring timelines, such as genealogy, patient history, etc. or to develop more involved applications tracing the evolution of a field based on timelines depicting developments in related areas.

  5. Online Collaborative Stock Control and Selling Among E-retailers
    Tao-Chung Lin, Toly Chen (Feng Chia U., Taiwan)

    Zwass claimed that in e-commerce there are five areas (commerce, collaboration, communication, connection, and computation) in which innova-tion opportunities could be found. This research is devoted to investigating one way of innovative collaboration among e-retailers and applies the concept of virtual warehousing under an online inter-organizational framework, so as to facilitate selling and to accelerate inventory depletion by resources sharing and mutual assistance. An online collaborative stock control and selling mecha-nism among e-retailers is therefore constructed. To evaluate the effectiveness of the constructed mechanism, two experimental B2C websites are also built for simulation analyses. According to experimental results, the constructed mechanism could indeed improve the efficiency of selling products for the par-ticipating e-retailers by shortening the average time to stock depletion up to 14%.

  6. AutoSeek: A Tool for Mining Automotive Warranty Claims Data for Effective Root Cause Analysis
    Ashish Sureka (Infosys, India)

    We present an application of text analytics in automotive industry and describe a research prototype for extracting named-entities in textual data recorded in automotive warranty claim forms. We describe an application for gaining useful insights about products defect reported to the dealer during the warranty period of vehicles. The prototype is developed for air-conditioning subsystem and consists of two main components: a text tagging and annotation engine a query engine. We present some real world examples with sample output and share our design and implementation experiences.

S5: Mobile and Distributed Data

  1. A Similarity Search of Trajectory Data using Textual Information Retrieval Techniques
    Yu Suzuki, Jun Ishiduka, Kyoji Kawagoe (Ritsumeikan U., Japan)

    In this paper, we propose a novel similarity measure between trajectory data and geometry data using textual information retrieval techniques. Currently, many trajectory data are generated and used for sightseeing. When users search trajectory data at sightseeing, if a user's current position is similar to a retrieval target trajectory datum, this trajectory datum should be useful. However, even if the euclidean distances between the moving points in trajectory data and the user's position have small values, these trajectory data are not always relevant to the user's interests. In this paper, we deal with textual information retrieval method to measure the similarity values between retrieval target trajectory data and user's current position. Using our proposed method, users can gain relevant sightseeing spots as retrieval results. In our experimental evaluation, we confirmed that our proposed method can retrieve intuitively relevant trajectory data.

  2. Constrained k-Nearest Neighbor Query Processing over Moving Object Trajectories
    Yunjun Gao, Gencai Chen (Zhejiang U., China), Qing Li (City U. of Hong Kong, China), Chun Li, Chun Chen (Zhejiang U., China)

    Given a set D of trajectories, a query object (point or trajectory) q, a time interval T, and a constrained region CR, a constrained k-nearest neighbor (CkNN) query over moving object trajectories retrieves from D within T, the k (>= 1) trajectories that lie closest to q and intersect (or are enclosed by) CR. In this paper, we propose several algorithms for efficiently processing CkNN search on moving object trajectories. In particular, we thoroughly investigate two types of CkNN queries, viz. CkNN_P and CkNN_T queries, which are de-fined w.r.t. stationary query points and moving query trajectories, respectively. The performance of our algorithms is evaluated with extensive experiments using both real and synthetic datasets.

  3. Location Update Strategies for Network-Constrained Moving Objects
    Zhiming Ding (ISCAS, Beijing, China), Xiaofang Zhou (U. of Queensland, Australia)

    Location update strategy is one of the most important factors that affect the performance of moving objects databases. In this paper, a new location update mechanism, Location Update Mechanism for Network-Constrained Moving Objects (Net-LUM), is proposed. Through active-motion-vector-based network-matching and special treatment with junctions, Net-LUM can achieve better performances in terms of communication costs and location tracking accuracy, which is confirmed by the experimental results.

  4. A P2P Meta-Index for Spatio-Temporal Moving Object Databases
    Cecilia Hernandez, Andrea Rodriguez (U. of Concepcion, Chile), Mauricio Marin (Yahoo! Research, Chile)

    In this paper we propose a distributed meta-index using a peer-to-peer protocol to allow spatio-temporal queries of moving objects on a large set of distributed database servers. We present distribution and fault tolerance strategies of a meta-index that combines partial trace of object movements with aggregated data about the number of objects in the database servers. The degrees of distribution and fault tolerance were compared by using discrete-event simulators with demanding spatio-temporal workloads. The results show that the distributed meta-index using Chord provides good performance, scalability and fault tolerance.

  5. An Update Propagation Strategy Considering Access Frequency in Peer-to-Peer Networks
    Toshiki Watanabe, Akimitsu Kanzaki, Takahiro Hara, Shojiro Nishio (Osaka U., Japan)

    In a P2P network, a data update occurred on a particular peer should be immediately propagated to other peers holding its replicas. In our previous work, we proposed a novel update propagation strategy using a tree structure for delay reduction and node failure tolerance. This strategy propagates the updated data according to the tree. In this paper, we extend our previous strategy to selectively propagate each updated data considering the data access frequency of each peer. The extended strategy propagates the updated data to peers which frequently access the data, whereas only a small message informing that the replica has become invalid is propagated to peers which rarely access the data.

  6. Towards Automated Analysis of Connections Network in Distributed Stream Processing System
    Marcin Gorawski, Pawel Marks (Silesian U. of Technology, Poland)

    Not so long ago data warehouses were used to process data sets loaded periodically during ETL process (Extraction, Transformation and Loading). We could distinguish two kinds of ETL processes: full and incremental. Now we often have to process real-time data and analyse them almost on-the-fly, so the analyses are always up to date. There are many possible applications for real-time data warehouses. In most cases two features are important: delivering data to the warehouse as quick as possible, and not losing any tuple in case of failures. In this paper we describe an architecture for gathering and processing data from geographically distributed data sources and we define a method for analysing properties of the connections structure, finding the weakest points in case of single and multiple node failures. At the end of the paper our future plans are described briefly.