| |
|
- Holistic Constraint-Preserving Transformation from Relational Schema to XML Schema
Rui Zhou, Chengfei Liu, Jianxin Li (Swinburne U. of Technology, Australia)
In this paper, we propose a holistic scheme of transforming a
relational schema into an XML Schema with integrity constraints
preserved. This scheme facilitates constructing a schema for the
published XML views of relational data. With this schema, users are
able to issue qualified queries against XML views, and discover
update anomalies in advance before propagating the view updates into
relational database. Compared to the previous work which splits the
transformation process into two steps, we establish a holistic
solution to directly transform a relational schema into an XML Schema
without building a reference graph. We achieve this by classifying
the underlying relations in a more concise and effective way, and
applying the converting rules wisely. The rules are also devised to
be more compact and less complicated in contrast to those in our
previous work. Finally, we manage to crack another hard nut which
was seldom touched before, i.e. converting circularly referenced
relations into recursive XML Schema.
- An Optimized Two-Step Solution for Updating XML Views
Ling Wang, Ming Jiang, Elke Rundensteiner, Murali Mani (Worcester Polytechnic Inst., USA)
View updating is a long standing difficult problem. Given a view
defined over base data sources and a view update, there are
several different updates over the base data sources, called
translations, that perform the update. A translation is said to
be correct if it performs the update and at the same time
does not update any portion of the view not specified in the
update (no view side-effects). The view update problem finds
a correct translation for a given view update if one exists. In
the relational scenario, previous research has attempted to study
the view update problem either by utilizing only the schema
knowledge, or by directly examining the base data. While utilizing
only the schema knowledge is very efficient, we are not guaranteed
to find a correct translation even if one exists. On the other
hand, examining the base data is guaranteed to find a correct
translation if one exists, but is very time-consuming. The view
update problem is even more complex in the XML context due to the
nested hierarchical structure of XML and the restructuring
capabilities of the XQUERY view specification. In this paper we
propose a schema-centric framework, named HUX, for efficiently
updating XML views specified over relational databases. HUX is
complete (always finds a correct translation if one exists) and is
efficient. The efficiency of HUX is achieved as follows. Given a
view update, HUX first exploits the schema to determine whether
there will never be a correct translation, or there will always be
a correct translation. Only if the update cannot be classified
using the schema, HUX will examine the base data to determine if
there is a correct translation. This data-level checking is
further optimized in HUX, by exploiting the schema knowledge
extracted in the first step to significantly prune the space of
potential translations that is explored. Experiments
illustrate the performance benefits of HUX over
previous solutions.
- Even an Ant Can Create an XSD
Ondrej Vosta, Irena Mlynkova, Jaroslav Pokorny (Charles U., Czech Republic)
The XML has undoubtedly become a standard for data representation and manipulation. But most of XML documents are still created without the respective description of its structure, i.e. an XML schema. Hence, in this paper we focus on the problem of automatic inferring of an XML schema for a given sample set of XML documents. In particular, we focus on new features of XML Schema language and we propose an algorithm which is an improvement of a combination of verified approaches that is, at the same time, enough general and can be further enhanced. Using a set of experiments we illustrate the behavior of the algorithm on both real-world and artificial XML data.
- A User Driven Data Mining Process Model and Learning System
Esther Ge, Richi Nayak, Yue Xu, Yuefeng Li (Queensland U. of Technology, Australia)
This paper deals with the problem of using the data mining models in a real-world situation where the user can not provide all the inputs with which the predictive model is built. A learning system framework, Query Based Learning System (QBLS), is developed for improving the performance of the predictive models in practice where not all inputs are available for querying to the system. The automatic feature selection algorithm called Query Based Feature Selection (QBFS) is developed for selecting features to obtain a balance between the relative minimum subset of features and the relative maximum classification accuracy. Performance of the QBLS system and the QBFS algorithm is successfully demonstrated with a real-world application.
- Efficient Mining of Recurrent Rules from a Sequence Database
David Lo, Siau-Cheng Khoo (National U. of Singapore), Chao Liu (U. of Illinois, Urbana-Champaign, USA)
We study a novel problem
of mining significant recurrent rules from a sequence database.
Recurrent rules have the form ``whenever a series of precedent
events occurs, eventually a series of consequent events occurs''.
Recurrent rules are intuitive and characterize behaviors in many
domains. An example is in the domain of software specifications, in
which the rules capture a family of program properties beneficial to
program verification and bug detection. Recurrent rules generalize
existing work on sequential and episode rules by considering
repeated occurrences of premise and consequent events within a
sequence and across multiple sequences, and by removing the
"window" barrier. Bridging the gap between mined rules and program
specifications, we formalize our rules in linear temporal logic. We
introduce and apply a novel notion of rule redundancy to ensure
efficient mining of a compact representative set of rules.
Performance studies on benchmark datasets and a case study on an
industrial system have been performed to show the scalability and
utility of our approach.
- Uniqueness Mining
Rohit Paravastu, Hanuma Kumar, Vikram Pudi (IIIT Hyderabad, India)
In this paper we consider the problem of extracting the
special properties of any given record in a dataset. We are
interested in determining what makes a given record unique
or different from the majority of the records in a dataset. In
the real world, records typically represent objects or people
and it is often worthwhile to know what special properties
are present in each object or person, so that we can make
the best use of them. This problem has not been considered
earlier in the research literature. We approach this problem
using ideas from clustering, attribute oriented induction
(AOI) and frequent itemset mining. Most of the time consuming
work is done in a preprocessing stage and the
online computation of the uniqueness of a given record is
instantaneous.
- Discovering Spatial Interaction Patterns
Chang Sheng, Wynne Hsu, Mong Li Lee, Anthony Tung (National U. of Singapore)
Advances in sensing and satellite technologies and the growth of
Internet have resulted in the easy accessibility of vast amount of
spatial data. Extracting useful knowledge from these data is an
important and challenging task, in particular, finding interaction
among spatial features. Existing works typically adopt a grid-like
approach to transform the continuous spatial space to a discrete
space. In this paper, we propose to model the spatial features in
a continuous space through the use of influence functions. For
each feature type, we build an influence map that captures the
distribution of the feature instances. Superimposing the influence
maps allows the interaction of the feature types to be quickly
determined. Experiments on both synthetic and real world datasets
indicate that the proposed approach is scalable and is able to
discover patterns that have been missed by existing methods.
- Topological Relationships Between Map Geometries
Mark McKenney, Markus Schneider (U. of Florida, USA)
The importance of topological relationships between spatial objects is recognized in many disciplines. In the field of spatial databases, topological relationships have played an important role, providing mechanisms for spatial constraint specification and spatial indexing techniques. The use of spatial data in a map form has become popular, resulting in spatial data models such as topologies or, more generally, map geometries, which model collections of spatial objects that satisfy certain topological constraints. However, the topological relationships between map geometries remain unexplored. In this paper, we identify the set of valid topological relationships between map geometries, and then provide a mechanism by which they can be directly implemented on top of existing systems by using topological relationships between regions.
- MBR Models for Uncertainty Regions of Moving Objects
Shayma Alkobaisi, Wan D. Bae, Seon Ho Kim (U. of Denver, USA), Byunggu Yu (U. of District of Columbia, USA)
The increase in the advanced location based services such as
traffic coordination and management necessitates the need for
advanced models tracking the positions of Moving Objects (MOs)
like vehicles. Computers cannot continuously update locations of
MOs because of computational overhead, which limits the accuracy
of evaluating MOs' positions. Due to the uncertain nature of
such positions, efficiently managing and quantifying the
uncertainty regions of MOs are needed in order to improve query
response time. These regions can be rather irregular which makes
them unsuitable for indexing. This paper presents Minimum
Bounding Rectangles (MBR approximations for three
uncertainty region models, namely, the Cylinder Model (CM),
the Funnel Model of Degree 1 (FMD1) and the Funnel
Model of Degree 2 (FMD2). We also propose an estimation of
the MBR of FMD2 that achieves a good balance between
computation time and selectivity (false-hits). Extensive
experiments on both synthetic and real spatio-temporal datasets
showed an order of magnitude improvement of the estimated model
over the other modeling methods in terms of the number of MBRs
retrieved during query process, which directly corresponds to the
number of physical page accesses.
- Summarization Graph Indexing: Beyond Frequent Structure-based Approach
Lei Zou (Huazhong U. of Science and Technology, China), Lei Chen (Hong Kong U. of Science and Technology, China), Huaming Zhang (U. of Alabama, Huntsville, USA), YanSheng Lu, Qiang Lou (Huazhong U. of Science and Technology, China)
Graph is an important data structure to model complex structural
data, such as chemical compounds, proteins, and XML documents. Among
many graph data-based applications, sub-graph search is a key
problem, which is defined as given a query Q, retrieving all
graphs containing Q as a sub-graph in the graph database. Most
existing sub-graph search methods try to filter out false positives
(graphs that are not possible in the results) as many as possible by
indexing some frequent sub-structures in graph database.
However, due to ignoring the relationships between sub-structures,
these methods still admit a high percentage of false positives. In
this paper, we propose a novel concept, Summarization Graph,
which is a complete graph and captures most topology information of
the original graph, such as sub-structures and their relationships.
Based on Summarization Graphs, we convert the filtering problem into
retrieving objects with set-valued attributes. Moreover, we build an
efficient signature file-based index to improve the filtering
process. We prove theoretically that the pruning power of our method
is larger than existing structure-based approaches. Finally, we show
by extensive experimental study on real and synthetic data sets that
the size of candidate set generated by Summarization Graph-based
approach is only about 50% of that left by existing graph
indexing methods, and the total response time of our method is
reduced 2-10 times.
- Best Paper: Bulk-Loading the ND-Tree in Non-ordered Discrete Data Spaces
Hyun-Jeong Seok (U. of Michigan, USA), Gang Qian, Qiang Zhu, Alexander Oswald (U. of Central Oklahoma, USA), Sakti Pramanik (Michigan State U., USA)
Applications demanding multidimensional index structures for performing
efficient similarity queries often involve a large amount of data. The
conventional tuple-loading approach to building such an index structure
for a large data set is
inefficient. To overcome the problem, a number of algorithms to bulk-load
the index structures, like the R-tree, from scratch for large data sets in
continuous data spaces have been proposed. However, many of them
cannot be directly applied to a non-ordered discrete data space (NDDS)
where data values on each dimension are discrete and have no natural
ordering. No bulk-loading algorithm has been developed specifically for
an index structure, such as the ND-tree, in an NDDS. In this paper, we present
a bulk-loading algorithm, called the NDTBL, for the ND-tree in NDDSs. It
adopts a special in-memory structure to efficiently
construct the target ND-tree. It utilizes and extends some operations in the
original ND-tree tuple-loading algorithm to exploit the properties
of an NDDS in choosing and splitting data sets/nodes during the
bulk-loading process.
It also employs some strategies such as multi-way splitting
and memory buffering to enhance efficiency. Our experimental
studies show that the presented algorithm is quite promising in bulk-loading
the ND-tree for large data sets in NDDSs.
- An Incremental Maintenance Scheme of Data Cubes
Dong Jin, Tatsuo Tsuji, Takayuki Tsuchida, Ken Higuchi (U. of Fukui, Japan)
Data cube construction is a commonly used operation in data warehouses. Because of the volume of data stored and analyzed in a data warehouse and the amount of computation involved in data cube construction, incremental maintenance of data cube is really effective. To maintain a data cube incrementally, previous methods were mainly for relational databases. In this paper, we employ an extendible multidimensional array model to maintain data cubes. Such an array enables incremental cube maintenance without relo-cating any data dumped at an earlier time, while computing the data cube effi-ciently by utilizing the fast random accessing capability of arrays. Our data cube scheme and related maintenance methods are presented in this paper, and cost analysis on our approach is shown and compared with existing methods.
- A Data Partition Based Near Optimal Scheduling Algorithm for Wireless Multi-Channel Data Broadcast
Ping Yu, Weiwei Sun, Yongrui Qin, Zhuoyao Zhang, Bole Shi (Fudan U., China)
Data broadcast is an efficient way to disseminate information to
large numbers of users in wireless environments. The Square Root
Rule (SRR) is the theoretical basis for the single channel broadcast
scheduling. In this paper, we extend the SRR and propose the
Multi-channel Square Root Rule (MSRR) for scheduling variable-length
data with skewed access probabilities on variable-bandwidth
channels. The theoretical optimal average access latency is also
provided. However, this optimal value can not be achieved in
reality. Based on MSRR, we provide a two-phase scheduling algorithm
which achieves near optimal access latency. First data are
partitioned and allocated to different channels according to MSRR.
Second, different scheduling strategies are adopted on each channel
according to the skewness of data subset allocated on that channel.
Experiments show that the difference of average access latency
between our results and the optimal value is below five percent in
most situations.
- A Test Paradigm for Detecting Changes in Transactional Data Streams
Willie Ng, Manoranjan Dash (Nanyang Technological U., Singapore)
A pattern is considered useful if it can be used to help a person to
achieve his goal. Mining data streams for useful patterns is
important in many applications. However, data stream can change
their behavior over time and, when significant change occurs, much
harm is done to the mining result if it is not properly handled. In
the past, there have been many studies mainly on adapting to changes
in data streams. We contend that adapting to changes is simply not
enough. The ability to detect and characterize change is also
essential in many applications, for example intrusion detection,
network traffic analysis, data streams from intensive care units
etc. Detecting changes is nontrivial. In this paper, an online
algorithm for change detection in utility mining is proposed. In
order to provide a mechanism for making quantitative description of
the detected change, we adopt the statistical test. We believe there
is the opportunity for an immensely rewarding synergy between data
mining and statistic. Different statistical significance tests are
evaluated and our study shows that the Chi-square test is the most
suitable for enumerated or count data (as is the case for high
utility itemsets). We demonstrate the effectiveness of the proposed
method by testing it on IBM QUEST market-basket data.
- Teddies: Trained Eddies for Reactive Stream Processing
Kajal Claypool (Oracle Corporation, USA), Mark Claypool (Worcester Polytechnic Inst., USA)
In this paper, we present an adaptive stream query processor, Teddies,
that combines the key advantages of the Eddies system with the
scalability of the more traditional dataflow model. In particular, we
introduce the notion of adaptive packetization of tuples to overcome
the large memory requirements of the Eddies system. The Teddies
optimizer groups tuples with the same history into data packets which
are then scheduled on a per packet basis through the query tree.
Corresponding to the introduction of this second dimension -- the
packet granularity -- we propose an adaptive scheduler that can react
to not only the varying statistics of the input streams and the
selectivity of the operators, but also to the fluctuations in the
internal packet sizes. The scheduler degrades to the Eddies scheduler
in the worst case scenario. We present experimental results that
compare both the reaction time as well as the scalability of the
Teddies system with the Eddies and the data flow systems, and classify
the conditions under which the Teddies' simple packet optimizer
strategy outperforms the per-tuple Eddies optimizer strategy.
- Flood Little, Cache More: Effective Result-reuse in P2P IR Systems
Christian Zimmer, Srikanta Bedathur, Gerhard Weikum (MPI for Informatics, Germany)
State-of-the-art Peer-to-Peer Information Retrieval (P2P IR) systems
suffer from their lack of response time guarantee especially with
scale. To address this issue, a number of techniques for caching of
multi-term inverted list intersections and query results have been
proposed recently. Although these enable speedy query evaluations
with low network overheads, they fail to consider the potential
impact of caching on result quality improvements. In this paper, we
propose the use of a cache-aware query routing scheme, that not only
reduces the response delays for a query, but also presents an
opportunity to improve the result quality while keeping the network
usage low. In this regard, we make three-fold contributions in this
paper. First of all, we develop a cache-aware, multi-round query
routing strategy that balances between query efficiency and
result-quality. Next, we propose to aggressively reuse the cached
results of even subsets of a query towards an approximate caching
technique that can drastically reduce the bandwidth overheads, and
study the conditions under which such a scheme can retain good
result-quality. Finally, we empirically evaluate these techniques
over a fully functional P2P IR system, using a large-scale Wikipedia
benchmark, and using both synthetic and real-world query workloads.
Our results show that our proposal to combine result caching with
multi-round, cache-aware query routing can reduce network traffic by
more than half while doubling the result quality.
- Load Balancing for Moving Object Management in a P2P Network
Mohammed Eunus Ali, Egemen Tanin, Rui Zhang, Lars Kulik (U. of Melbourne, Australia)
Online games and location-based services now form the potential
application domains for the P2P paradigm. In P2P systems,
balancing the workload is essential for overall performance.
However, existing load balancing techniques for P2P systems were
designed for stationary data. They can produce undesirable
workload allocations for moving objects that is continuously
updated. In this paper, we propose a novel load balancing
technique for moving object management using a P2P network. Our
technique considers the mobility of moving objects and uses an
accurate cost model to optimize the performance in the management
network, in particular for handling location updates in tandem
with query processing. In a comprehensive set of experiments, we
show that our load balancing technique gives constantly better
update and query performance results than existing load balancing
techniques.
- Serializable Executions with Snapshot Isolation: Modifying Application Code or Mixing Isolation Levels?
Mohammad Alomari, Michael Cahill, Alan Fekete, Uwe Roehm (U. of Sydney, Australia)
Snapshot Isolation concurrency control (SI) allows substantial
performance gains compared to holding commit-duration readlocks,
while still avoiding many anomalies such as lost updates or
inconsistent reads. However, for some sets of application programs,
SI can result in non-serializable execution, in which database
integrity constraints can be violated. The literature reports two
different approaches to ensuring all executions are serializable
while still using SI for concurrency control. In one approach, the
application programs are modified (without changing their stand-alone
semantics) by introducing extra conflicts. In the other approach, the
application programs are not changed, but a small subset of them must
be run using standard two-phase locking rather than SI. We compare
the performance impact of these two approaches. Our conclusion is that the convenience of preserving the application code
(and adjusting only the isolation level for some transactions) leads
to a very substantial performance penalty against the best that can be done through application modification.
- SemanticTwig: A Semantic Approach to Optimize XML Query Processing
Zhifeng Bao,Tok Wang Ling (National U. of Singapore), Jiaheng Lu (U. of California, Irvine, USA), Bo Chen (National U. of Singapore)
Twig pattern matching (TPM) is the core operation of XML query processing. Existing approaches rely on either efficient data structures or novel labeling/indexing schemes to reduce the intermediate result size, but none of them takes into account the rich semantic information resided in XML document and the query issued. Moreover, in order to fulfill the semantics of the XPath/XQuery query, most of them require costly post processing to eliminate redundant matches and group matching results. In this paper, we propose an innovative semantics-aware query optimization approach to overcome these limitations. In particular, we exploit the functional dependency derived from the given semantic information to stop query processing early; we distinguish the output and predicate nodes of a query, then propose a query breakup technique and build a query plan, such that for each distinct query output, we avoid finding the redundant matches having the same results as the first match in most cases. Both I/O and structural join cost are saved, and much less intermediate results are produced. Experiments show the effectiveness of our optimization.
- Approximate XML Query Answers in DHT-Based P2P Networks
Weimin He, Leonidas Fegaras (U. of Texas, Arlington, USA)
Due to the increasing number of independent data providers on the web, there is a growing number of web applications that require locating data sources distributed over the internet. Most of the current proposals in the literature focus on developing effective routing data synopses to answer simple XPath queries in structured or unstructured P2P networks. In this paper, we present an effective framework to support XPath queries extended with full-text search predicates over schema-less XML data distributed in a DHT-based P2P network. We construct two concise routing data synopses, termed structural summary and peer-document synopsis, to route the user query to most relevant peers that own documents that can satisfy the query. To evaluate the structural components in the query, a general query footprint derivation algorithm is developed to extract the query footprint from the query and match it with structural summaries. To improve the search performance, we adopt a lazy query evaluation strategy for evaluating the full-text search predicates in the query. Finally, we develop effective strategies to balance the data load distribution in the system. We conduct extensive experiments to show the scalability of our system, validate the efficiency and accuracy of our routing data synopses, and demonstrate the effectiveness of our load balancing schemes.
- Efficient Top-k Search across Heterogeneous XML Data Sources
Jianxin Li, Chengfei Liu (Swinburne U. of Technology, Australia), Jeffrey Xu Yu (Chinese U. of Hong Kong, China), Rui Zhou (Swinburne U. of Technology, Australia)
An important issue arising from XML query relaxation is how to
efficiently search the top-k best answers from a large number of XML
data sources, while minimizing the searching cost, i.e., finding the
k matches with the highest computed scores by only traversing part
of the documents. This paper resolves this issue by proposing a
bound-threshold based scheduling strategy. It can answer a top-k XML
query as early as possible by dynamically scheduling the query over
XML documents. In this work, the total amount of documents that need
to be visited can be greatly reduced by skipping those documents
that will not produce the desired results with the bound-threshold
strategy. Furthermore, most of the candidates in each visited
document can also be pruned based on the intermediate results. Most
importantly, the partial results can be output immediately during
the query execution, rather than waiting for the end of all results
to be determined. Our experimental results show that our query
scheduling and processing strategies are both practical and
efficient.
- Example-Based Robust DB-Outlier Detection for High Dimensional Data
Yuan Li, Hiroyuki Kitagawa (U. of Tsukuba, Japan)
This paper presents a method of outlier detection to identify exceptional objects that match user intentions in high dimensional datasets. Outlier detection is a crucial element of many applications like financial analysis and fraud detection. Scholars have made numerous investigations, but the results show that current methods fail to directly discover outliers from high dimensional datasets due to the curse of dimensionality. Beyond that, many algorithms require several decisive parameters to be predefined. Such vital parameters are considerably difficult to determine without identifying datasets beforehand. To address these problems, we take an Example-Based approach and examine behaviors of projections of the outlier examples in a dataset. An example-based approach is promising, since users are probably able to provide a few outlier examples to suggest what they want to detect. An important point is that the method should be robust, even if user-provided examples include noises or inconsistencies. Our proposed method is based on the notion of DB- (Distance-Based) Outliers. Experiments demonstrate that our proposed method is effective and efficient on both synthetic and real datasets and can tolerate noise examples.
- A Novel Fingerprint Matching Method by Excluding Elastic Distortion
Keming Mao, Guoren Wang (Northeastern U., China)
Fingerprint matching is a key issue in fingerprint recognition
systems. Although there already exist many researches about
fingerprint matching algorithm, it is still a challenging problem
for reliable person authentication because of the complex
distortions involved in the fingerprint. The non-linear
deformation can lead to the change of both position and direction
of minutiae and hence decrease the reliability of the minutiae. In
this paper, we propose a novel minutiae-based fingerprint matching
algorithm. First a new structure Adjacent Feature Union(AFU) is
defined, which is invariant to rotation and translation. AFU
represents the local feature of a fingerprint and it is used to
align the fingerprint. Moreover, the information of the ridge
frequency and block orientation is utilized to construct a
distortion-tolerant model. Using this model the position and
direction of the minutiae can be readjusted and thus enhance the
performance of the matching. The proposed method can minimize the
effect of the elastic distortion in fingerprint. Experiment
results demonstrate the effectiveness of the matching method.
- Approximate Clustering of Time Series Using Compact Model-based Descriptions
Hans-Peter Kriegel, Peer Kroger, Alexey Pryakhin, Matthias Renz, Andrew Zherdin (Ludwig Maximilians U., Germany)
Clustering time series is usually limited by the fact that the
length of the time series has a significantly negative influence on
the runtime. On the other hand, approximative clustering applied to
existing compressed representations of time series (e.g. obtained
through dimensionality reduction) usually suffers from low
accuracy. We propose a method for the compression of time series based
on mathematical models that explore dependencies between different
time series. In particular, each time series is represented by a
combination of a set of specific reference time series. The cost of
this representation depend only on the number of reference time series
rather than on the length of the time series. We show that using only
a small number of reference time series yields a rather accurate
representation while reducing the storage cost and runtime of
clustering algorithms significantly. Our experiments illustrate that
these representations can be used to produce an approximate clustering
with high accuracy and considerably reduced runtime.
- An Approach for Extracting Bilingual Terminology from Wikipedia
Maike Erdmann, Kotaro Nakayama, Takahiro Hara, Shojiro Nishio (Osaka U., Japan)
With the demand of bilingual dictionaries covering domain-specific terminology, research in the field of automatic dictionary extraction has become popular. However, accuracy and coverage of dictionaries created based on bilingual text corpora are often not sufficient for domain-specific terms. Therefore, we present an approach to extracting bilingual dictionaries from the link structure of Wikipedia, a huge scale encyclopedia that contains a vast amount of links between articles in different languages. Our methods analyze not only these interlanguage links but extract even more translation candidates from redirect page and link text information. In an experiment, we proved the advantages of our methods compared to a traditional approach of extracting bilingual terminology from parallel corpora.
- Cost-Effective Search Strategy Framework for Automatic Dictionary Construction
Hideki Kawai, Hironori Mizuguchi, Masaaki Tsuchida (NEC Research, Japan)
In this paper, we propose a cost-effective search strategy framework to extract keywords in the same semantic class from the Web. Constructing a dictionary based on the bootstrapping technique is one promising approach to harnessing knowledge scattered around the Web. Open web application programming interfaces (APIs) are powerful tools for the knowledge-gathering process. However, we have to consider the cost of API calls because too many queries can overload the search engines, and they also limit the number of API calls. Our goal is to optimize a search strategy that can collect as many new words as possible with the least API calls. Our results show that the optimized search strategy can extract 64,642 words in five different domains with a precision of 0.94 with only 1,000 search API calls.
- Learning Bayesian Network Structure from Incomplete Data without any Assumption
Celine Fiot (U. Montpellier II, France), G. A. Putri Saptawati (Institut Teknologi Bandung, Indonesia), Anne Laurent, Maguelonne Teisseire (U. Montpellier II, France)
Since most real-life data contain missing values, reasoning and learning with incomplete data has become crucial in data mining and machine learning. In particular, Bayesian networks are one machine learning technique that allows for reasoning with incomplete data, but training such networks on incomplete data may be a difficult task. Many methods were thus proposed to learn Bayesian network structure from incomplete data, based on multiple structure generation and scoring of their adequacy to the dataset. However, this kind of approaches may be time-consuming. Therefore we propose an efficient dependency analysis approach that uses a redefinition of probability calculation to take incomplete records into account while learning BN structure, without generating multiple possibilities. Some experiments on well-known benchmarks are described to show the validity of our proposal.
|