| |
|
- Ranking Database Queries With User Feedback: A Neural Network Approach
Nevedita Mallick, Ganesh Agarwal, Srinivasan Turuvekere (U. of Illinois, Urbana-Champaign, USA)
Currently, websites on the Internet serving structured data allow
users to perform search based on simple equality or range
constraints on data attributes. However, to begin with, users may
not know what is desirable to them precisely, to be able to
express it accurately in terms of primitive equality or range
constraints. Additionally, in most websites, the results provided to
users can be sorted with respect to values of any one particular
attribute at a time. For the user, this is like searching
for a needle in a haystack because the user's notion of
interesting objects is generally a function of multiple
attributes.
In this paper, we develop an approach to (i) support a family of
functions involving multiple attributes to rank the tuples, and (ii)
improve the ranking of results returned to the user by incorporating
user feedback (to learn user's notion of interestingness)
with the help of a neural network. The user feedback driven approach
is effective in modeling a user's intuitive sense of desirability of
a tuple, a notion that is otherwise near impossible to quantify
mathematically. To prove the effectiveness of our approach, we have
built a middleware for an application domain that implements and
evaluates these ideas.
- Supporting Keyword Queries on Structured Databases with Limited Search Interfaces
Nurcan Yuruk, Xiaowei Xu (U. of Arkansas, Little Rock, USA), Chen Li (U. of California, Irvine, USA), Jeffrey Xu Yu (Chinese U. of Hong Kong, China)
Many Web sources provide forms to allow users to query their hidden data. For instance, online stores such as Amazon.com have search interfaces, using which users can query information about books by providing conditions on attributes of title, author, and publisher. We propose a novel system framework that supports keyword queries on structured data behind such limited search forms. It provides user-friendly query interfaces for users to type in IR-style keyword queries to find relevant records. We study research challenges in the framework and conduct extensive experiments on real datasets to show the practicality of our framework and evaluate different algorithms.
- Automated Data Discovery in Similarity Score Queries
Fatih Altiparmak, Ali Tosun, Hakan Ferhatosmanoglu, Ahmet Sacan (Ohio State U., USA)
A vast amount of information is being stored in scientific
databases on the web. The dynamic nature of the scientific data,
the cost of providing an up-to-date snapshot of the whole
database, and proprietary considerations compel the database
owners to hide the original data behind search interfaces. The
information is often provided to researchers through
similarity-search query interfaces, which limits a proper and
focused analysis of the data. In this study, we present
systematic methods of data discovery through similarity-score
queries in such ``uncooperative'' databases. The methods are
generalized to multi-dimensional data, and to L-p norm distance
functions. The accuracy and performance of our methods are
demonstrated on synthetic and real-life datasets. The methods
developed in this study enable the scientists to obtain the data
within the range of their research interests, overcoming the
limitations of the similarity-search interface. The results of
this study also present implications in data privacy and
security areas, where the discovery of the original data is not
desired.
- Efficient Algorithms for Node Disjoint Subgraph Homeomorphism Determination
Yanghua Xiao, Wentao Wu, Wei Wang, Zhenying He (Fudan U., China)
Recently, great efforts have been dedicated to researches on the
management of large-scale graph-based data, where node disjoint
subgraph homeomorphism relation between graphs has been shown to be
more suitable than (sub)graph isomorphism in many cases, especially
in those cases where node skipping and node mismatching are desired.
However, no efficient algorithm for node disjoint subgraph
homeomorphism determination (ndSHD) has been available. In this
paper, we propose two computationally efficient ndSHD algorithms
based on state spaces searching with backtracking, which employ many
heuristics to prune the search spaces. Experimental results on
synthetic data sets show that the proposed algorithms are efficient,
require relatively little time in most of cases, can scale to large
or dense graphs, and can accommodate to more complex fuzzy matching
cases.
- The Chronon Based Model for Temporal Databases
Anurag D, Anup Sen (IIM Calcutta, India)
Among various models like TSQL2, XML and SQL:2003 for temporal databases, TSQL2 is perhaps the most comprehensive one and introduces new constructs for temporal query support. However, these constructs mask the user from the underlying conceptual model and essentially provides a “black box” flavour. Further, the temporal attributes are not directly accessible. In this paper, we propose a “chronon” based conceptual (relational) model, which works on a tuple timestamping paradigm. Here, the time attribute is treated in similar fashion as any other attribute and is available to the user for inclusion in SQL statements, making the model easy to understand and use. A corresponding efficient physical model based on the attribute versioning format akin to XML/SQL:2003 has also been proposed and the mapping between the proposed conceptual and physical models has been described. We evaluate the model by comparing with TSQL2 and SQL:2003.
- Main Memory Commit Processing: The Impact of Priorities
Heine Kolltveit, Svein-Olaf Hvasshovd (NTNU, Norway)
Distributed transaction systems require an atomic commitment protocol to preserve ACID properties. The overhead of commit processing is a significant part of the load on a distributed database. Here, we propose approaches where the overhead is reduced by prioritizing urgent messages and operations. This is done in the context of main memory primary-backup systems, and the proposed approaches is found to significantly reduce the response time as seen by the client. Also, by piggybacking messages on each other over the network, the throughput is increased. Simulation results show that performance can be significantly improved using this approach, especially for utilizations above 50%.
- Exploiting ID References for Effective Keyword Search in XML Database
Bo Chen (National U. of Singapore), Jiaheng Lu (U. of California, Irvine, USA), Tok Wang Ling (National U. of Singapore)
In this paper, we study novel Tree + IDREF data model
for keyword search in XML. In this model, we propose novel
Lowest Referred Ancestor (LRA) pair, Extended LRA
(ELRA) pair and ELRA group semantics
for effective and efficient keyword search. We develop efficient
algorithms to compute the search results based on our semantics.
Experimental study shows the superiority of our approach.
- Storage techniques for multi-versioned XML documents
Laura Irina Rusu, Wenny Rahayu (LaTrobe U., Australia), David Taniar (Monash U., Australia)
Following the increased popularity of using XML documents for information exchange between applications and representing semi-structured data, the issue of warehousing collections of XML documents has become strongly imperative. The focus of this paper is on the storage of dynamic (multi-versioned) XML documents. In this paper we present four different storage methods for dynamic XML documents, and evaluate each of them using two performance indicators, namely I/O loading and versioning costs.
- Twig'n Join: Progressive Query Processing of Multiple XML Streams
Wee Hyong Tok, Stephane Bressan, Mong Li Lee (National U. of Singapore)
We propose a practical approach to the progressive processing of (FWR)
XQuery queries on multiple XML streams, called Twig'n Join (or TnJ). The
query is decomposed into a query plan combining several twig queries on
the individual streams, followed by a multi-way join and a final twig query. The
processing is itself accordingly decomposed into three pipelined stages
progressively producing streams of XML fragments. Twig'n Join combines the
advantages of the recently proposed TwigM algorithm and our previous work on relational result-rate based progressive joins. In addition, we introduce a novel dynamic probing technique, called Result-Oriented Probing (ROP), which determines an optimal probing sequence for the multi-way join. This significantly reduces the amount of redundant probing for results.
We comparatively evaluate the performance of Twig'n Join using both synthetic and real-life data from standard XML query processing benchmarks. We show that Twig'n Join is indeed effective and efficient for processing multiple XML streams.
- TwigBuffer: Avoiding Useless Intermediate Solutions Completely in Twig Joins
Jiang Li, Junhu Wang (Griffith U., Australia)
Twig pattern matching plays a crucial role in XML data processing.
TwigStack is a holistic twig join
algorithm that solves the problem in two steps: (1) finding
potentially useful intermediate path solutions, (2) merging the
intermediate solutions. The algorithm is optimal when the twig
pattern has only -edges, in the sense that no useless partial solutions are
generated in the first step (thus expediting the second step and
boosting the overall performance). However, when -edges are
present, a large set of useless partial solutions may be produced,
which directly downgrades the overall performance. Recently, some
improved versions of the algorithm (e.g., TwigStackList and
iTwigJoin) have been proposed in an attempt to reduce the number
of useless partial solutions when -edges are involved. However,
none of the algorithms can avoid useless partial solutions
completely. In this paper, we propose a new algorithm, TwigBuffer, that
is guaranteed to completely avoid the useless partial solutions.
Our algorithm is based on an ingenious strategy to buffer and
manipulate elements in stacks and lists. Experiments show that TwigBuffer
significantly outperforms previous algorithms when arbitrary
/-edges are present.
- An Approach for XML Similarity Join using Tree Serialization
Lianzi Wen, Toshiyuki Amagasa, Hiroyuki Kitagawa (U. of Tsukuba, Japan)
This paper proposes a scheme for similarity join over XML data based
on XML data serialization and subsequent similarity matching over
XML node subsequences. With the recent explosive diffusion of XML,
great volumes of electronic data are now marked up with XML. As a
consequence, a growing amount of XML data represents similar
contents, but with dissimilar structures. To extract as much
information as possible from this heterogeneous information,
similarity join has been used. Our proposed similarity
join for XML data can be summarized as follows: 1) we serialize XML
data as XML node sequences; 2) we extract semantically/structurally
coherent subsequences; 3) we filter out dissimilar subsequences
using textual information; and 4) we extract pairs of subsequences
as the final result by checking structural similarity. The above
process is costly to execute. To make it scalable against
large document sets, we use Bloom filter to speed up text similarity computation. We show the feasibility of the proposed scheme by experiments.
- A Holistic Algorithm for Efficiently Evaluating Xtwig Joins
Bo Ning, Guoren Wang (Northeastern U., China), Jeffrey Xu Yu (Chinese U. of Hong Kong,China)
More and more XML data have been generated and used in the data
exchange. XML employs a tree-structure data model, but lots of
queries submitted by users are not like the tree-structure. Those
queries contain ancestor axis in predicates, and specify the
pattern of selection predicates on multiple elements from
descendants to ancestors. Efficiently finding all occurrences of
such an xtwig pattern in an XML database is crucial for XML query
processing. A straightforward method is to rewrite an xtwig
pattern to equivalent reverse-axis-free one. However, this method
needs scanning the element streams several times and is rather
expensive to evaluate. In this paper, we study the xtwig pattern,
and propose two basic decomposing methods, VertiDec and
HoriDec, and a holistic processing method,
XtwigStack, for processing xtwig queries. The experiments
show that the holistic algorithm is much more efficient than the
rewriting and decomposition approaches.
- Association Rules Induced by Item and Quantity Purchased
Animesh Adhikari (S P Chowgule College, Goa, India), P. R. Rao (Goa U., India)
Most of the real market basket data are non-binary in the sense that an item could be purchased multiple times in the same transaction. In this case, there are two types of occurrences of an itemset in a database: the number of transactions in the database containing the itemset, and the number of occurrences of the itemset in the database. Traditional support-confidence framework might not be adequate for extracting association rules in such a database. In this paper, we introduce three categories of association rules. We introduce a framework based on traditional support-confidence framework for mining each category of association rules. We present experimental results based on two databases.
- An Indexed Trie Approach to Incremental Mining of Closed Frequent Itemsets based on a Galois Lattice Framework
Kalpana Balasubrahmanyan (Avinashilingam U., India), R Nadarajan, Senthil Babu (PSG College of Technology, India)
Incrementality is a major challenge in the mining of dynamic databases. In such databases, the maintenance of association rules can be directly mapped into the problem of maintaining closed frequent itemsets. A number of incremental strategies have been proposed earlier with several limitations. A serious limitation is the need to examine the entire family of closed itemsets, whenever there are insertions or deletions in the database. The proposed strategy relies on an efficient and selective update of the closed itemsets using an indexed trie structure. The framework emphasizes on certain fundamental and structural properties of Galois Lattice theory to overcome the limitations of the earlier approaches. Apart from facilitating a selective update, the indexed structure removes the necessity of working with a wholly memory resident trie.
- Efficiently Mining Recent Frequent Patterns over Online Data Streams
Guohui Li, Hui Chen (Huazhong U. of Science and Technlogy, China), LihChyun Shu (National Cheng Kung U., Taiwan)
This paper proposes a method for mining the frequent patterns in an arbitrary
sliding window of data streams. As streams flow, the contents of which are
captured with SWP-tree by scanning the stream only once, and
the obsolete and infrequent patterns are deleted by periodically pruning the
tree. To differentiate the patterns of recently generated transactions from
those of historic transactions, a time decaying model is also applied. The
experimental results show that the proposed method is efficient and scalable,
and it is superior to other analogous algorithms.
- Index-Supported Similarity Search using Multiple Representations
Johannes Assfalg, Hans-Peter Kriegel, Peter Kunath, Alexey Pryakhin, Michael Kats (Ludwig Maximilians U., Germany)
Similarity search in complex databases is of utmost interest in a
wide range of application domains. Often, complex objects are
described by several representations. The combination of these
different representations usually contains more information compared
to only one representation. In our work, we introduce the use of an
index structure in combination with a negotiation-theory-based
approach for deriving a suitable subset of representations for a
given query object. This most promising subset of representations is
determined in an unsupervised way at query time. We experimentally
show how this approach significantly increases the efficiency of the
query processing step. At the same time the effectiveness, i.e. the
quality of the search results, is equal or even higher compared to
standard combination methods.
- Distance Based Feature Selection for Clustering Microarray Data
Manoranjan Dash (Nanyang Technological U., Singapore)
In microarray data, clustering is the fundamental task for separating genes into biologically functional groups or for classifying tissues and phenotypes. Recently, with innovative gene expression microarray data technologies, thousands of expression levels of genes (features) can be measured simultaneously in a single experiment. The large number of genes with a lot of noise causes high complexity for cluster analysis. This challenge has raised the demand for feature selection - an effective dimensionality reduction technique that removes noisy features. In this paper we propose a novel filter method for feature selection. The suggested method, called ClosestFS, is based on a distance measure. For each feature, the distance is evaluated by computing its impact on the histogram for the whole data. Our experimental results show that the quality of clustering results (evaluated by several widely used measures) of K-means algorithm using ClosestFS as the pre-processing step is significantly better than that of the pure K-means.
- Knowledge Transferring Based on Implicit Link Analysis
Xiao Ling, Wenyuan Dai, Gui-Rong Xue, Yong Yu (Shanghai Jiaotong U., China)
In this paper, we design a local classification algorithm
using implicit link analysis, considering the situation that
the labeled and unlabeled data are drawn from two different albeit
related domains. In contrast to many global classifiers, e.g.
Support Vector Machines, our local classifier only takes into
account the neighborhood information around unlabeled data points,
and is hardly based on the global distribution in the data set.
Thus, the local classifier has good abilities to tackle the
non-i.i.d. classification problem since its generalization
will not degrade by the bias w.r.t. each unlabeled data point. We
build a local neighborhood by connecting the similar data points.
Based on these implicit links, the Relaxation Labeling
technique is employed. In this work, we theoretically and
empirically analyze our algorithm, and show how our algorithm
improves the traditional classifiers. It turned out that our
algorithm greatly outperforms the state-of-the-art supervised and
semi-supervised algorithms when classifying documents across
different domains.
- Redundant Array of Inexpensive Nodes for DWS
Jorge Vieira (Critical Software SA, Portugal), Marco Vieira (U. of Coimbra, Portugal), Marco Costa (Critical Software SA, Portugal), Henrique Madeira (U. of Coimbra, Portugal)
The DWS (Data Warehouse Striping) technique is a round-robin data partitioning approach especially designed for distributed data warehousing en-vironments. In DWS the fact tables are distributed by an arbitrary number of low-cost computers and the queries are executed in parallel by all the com-puters, guarantying a nearly optimal speed up and scale up. However, the use of a large number of inexpensive nodes increases the risk of having node failures that impair the computation of queries. This paper proposes an approach that provides Data Warehouse Striping with the capability of answering to queries even in the presence of node failures. This approach is based on the selective replication of data over the cluster nodes, which guarantees full availability when one or more nodes fail. The proposal was evaluated using the newly TPC-DS benchmark and the results show that the approach is quite effective.
- Load-Balancing for WAN Data Warehouses
Pedro Furtado (U. Coimbra, Portugal)
Although the basic Data Warehouse schema concept is centralized, there are increasingly application domains in which there is the need to have several sites or computers input and analyze the data, therefore distributed data placement and processing is necessary. Given that sites may have different amounts of data and different processing capacities, how can we conform to the placement requirements of the context and balance such a system effectively? In WAN environments the network speed is a very relevant factor and there are application requirements concerning the place where each piece of data stays, based on who produced the data (ownership). We propose a new strategy that accepts the placement requirements of the desired context and uses an effective automatic approach to determine fixed-sized chunks and to balance and process those chunks efficiently. Our experimental results show the validity of the approach and how to minimize the context limitations.
- Enabling Privacy-Preserving e-Payment Processing
Mafruz Ashrafi, See-Kiong Ng (Inst. for Infocomm Research, Singapore)
The alarming increase in the number of data breaching incidents from high profile companies reflects that buying goods or services from online merchants can pose a serious risk of customers’ privacy and the merchants’ business reputation. The conventional approach of encrypting customer data at merchant side using the merchant’s secret key is no longer adequate for preserving customer privacy. An e-payment scheme that can guarantee customer authenticity while keeping the customer’s sensitive details secret from the various parties involved in the online transaction is needed. We propose here an online protocol for processing e-payments that minimizes the customer’s privacy as well as merchant business risks. Using a non-reusable password-based authentication approach, the proposed protocol allows consumers to purchase goods or services from an online merchant anonymously, thus achieving the ideal privacy environment in which to shop. The payment details sent to a merchant will become obsolete after the first use, thereby preventing any subsequent fraudulent transactions by a third party. Such protocol can be easily deployed in an e-commerce environment to strengthen the integrity of the electronic payment system.
- Managing and Correlating Historical Events using an Event Timeline Datatype
Abhishek Biswas, Bhupalam Sagar (Amrita School of Engineering, India), Jagannathan Srinivasan (Oracle, USA)
An event is occurrence of an incident at a particular instant of time with a set of attributes describing it. A collection of similar events ordered by time is called an event timeline. This paper presents the notion and design of an event timeline datatype, which can be used to store, maintain and operate on a collection of events as a single abstraction. The datatype represents event occurrences as bitmaps with methods for accessing and updating the timeline. Two additional pair-wise operators union (OR) and intersection (AND) allow correlating timelines to discover new information. Semi-structuring event attributes of a timeline as (name, value) pairs provides a basis for search and indexing. The datatype can be used either to develop simple applications requiring timelines, such as genealogy, patient history, etc. or to develop more involved applications tracing the evolution of a field based on timelines depicting developments in related areas.
- Online Collaborative Stock Control and Selling Among E-retailers
Tao-Chung Lin, Toly Chen (Feng Chia U., Taiwan)
Zwass claimed that in e-commerce there are five areas (commerce, collaboration, communication, connection, and computation) in which innova-tion opportunities could be found. This research is devoted to investigating one way of innovative collaboration among e-retailers and applies the concept of virtual warehousing under an online inter-organizational framework, so as to facilitate selling and to accelerate inventory depletion by resources sharing and mutual assistance. An online collaborative stock control and selling mecha-nism among e-retailers is therefore constructed. To evaluate the effectiveness of the constructed mechanism, two experimental B2C websites are also built for simulation analyses. According to experimental results, the constructed mechanism could indeed improve the efficiency of selling products for the par-ticipating e-retailers by shortening the average time to stock depletion up to 14%.
- AutoSeek: A Tool for Mining Automotive Warranty Claims Data for Effective Root Cause Analysis
Ashish Sureka (Infosys, India)
We present an application of text analytics in automotive industry and describe a research prototype for extracting named-entities in textual data recorded in automotive warranty claim forms. We describe an application for gaining useful insights about products defect reported to the dealer during the warranty period of vehicles. The prototype is developed for air-conditioning subsystem and consists of two main components: a text tagging and annotation engine a query engine. We present some real world examples with sample output and share our design and implementation experiences.
- A Similarity Search of Trajectory Data using Textual Information Retrieval Techniques
Yu Suzuki, Jun Ishiduka, Kyoji Kawagoe (Ritsumeikan U., Japan)
In this paper, we propose a novel similarity measure between trajectory
data and geometry data using textual information retrieval
techniques. Currently, many trajectory data are generated and used for
sightseeing. When users search trajectory data at sightseeing, if a
user's current position is similar to a retrieval target trajectory
datum, this trajectory datum should be useful. However, even if the
euclidean distances between the moving points in trajectory data and
the user's position have small values, these trajectory data are not
always relevant to the user's interests. In this paper, we deal with
textual information retrieval method to measure the similarity values
between retrieval target trajectory data and user's current
position. Using our proposed method, users can gain relevant
sightseeing spots as retrieval results. In our experimental evaluation,
we confirmed that our proposed method can retrieve intuitively relevant
trajectory data.
- Constrained k-Nearest Neighbor Query Processing over Moving Object Trajectories
Yunjun Gao, Gencai Chen (Zhejiang U., China), Qing Li (City U. of Hong Kong, China), Chun Li, Chun Chen (Zhejiang U., China)
Given a set D of trajectories, a query object (point or trajectory) q, a time interval T, and a constrained region CR, a constrained k-nearest neighbor (CkNN) query over moving object trajectories retrieves from D within T, the k (>= 1) trajectories that lie closest to q and intersect (or are enclosed by) CR. In this paper, we propose several algorithms for efficiently processing CkNN search on moving object trajectories. In particular, we thoroughly investigate two types of CkNN queries, viz. CkNN_P and CkNN_T queries, which are de-fined w.r.t. stationary query points and moving query trajectories, respectively. The performance of our algorithms is evaluated with extensive experiments using both real and synthetic datasets.
- Location Update Strategies for Network-Constrained Moving Objects
Zhiming Ding (ISCAS, Beijing, China), Xiaofang Zhou (U. of Queensland, Australia)
Location update strategy is one of the most important factors that affect the performance of moving objects databases. In this paper, a new location update mechanism, Location Update Mechanism for Network-Constrained Moving Objects (Net-LUM), is proposed. Through active-motion-vector-based network-matching and special treatment with junctions, Net-LUM can achieve better performances in terms of communication costs and location tracking accuracy, which is confirmed by the experimental results.
- A P2P Meta-Index for Spatio-Temporal Moving Object Databases
Cecilia Hernandez, Andrea Rodriguez (U. of Concepcion, Chile), Mauricio Marin (Yahoo! Research, Chile)
In this paper we propose a distributed meta-index using a peer-to-peer
protocol to allow spatio-temporal queries of moving objects on a large set of distributed database
servers. We present distribution and fault tolerance strategies of a meta-index that combines partial trace of object movements with aggregated data about the number of objects in the database servers. The degrees of distribution and fault tolerance were compared by using discrete-event simulators with demanding spatio-temporal workloads. The results show that the distributed meta-index using Chord provides good performance, scalability and fault tolerance.
- An Update Propagation Strategy Considering Access Frequency in Peer-to-Peer Networks
Toshiki Watanabe, Akimitsu Kanzaki, Takahiro Hara, Shojiro Nishio (Osaka U., Japan)
In a P2P network, a data update occurred on a particular peer should be immediately
propagated to other peers holding its replicas.
In our previous work, we proposed a novel update propagation strategy using a tree structure for delay reduction and node failure tolerance.
This strategy propagates the updated data according to the tree.
In this paper, we extend our previous strategy to selectively propagate each updated data considering the data access frequency of each peer.
The extended strategy propagates the updated data to
peers which frequently access the data, whereas only a small message informing that
the replica has become invalid is propagated to peers which rarely access
the data.
- Towards Automated Analysis of Connections Network in Distributed Stream Processing System
Marcin Gorawski, Pawel Marks (Silesian U. of Technology, Poland)
Not so long ago data warehouses were used to process data sets loaded periodically during ETL process (Extraction, Transformation and Loading). We could distinguish two kinds of ETL processes: full and incremental. Now we often have to process real-time data and analyse them almost on-the-fly, so the analyses are always up to date. There are many possible applications for real-time data warehouses. In most cases two features are important: delivering data to the warehouse as quick as possible, and not losing any tuple in case of failures. In this paper we describe an architecture for gathering and processing data from geographically distributed data sources and we define a method for analysing properties of the connections structure, finding the weakest points in case of single and multiple node failures. At the end of the paper our future plans are described briefly.
|