Conference Officers
PC Committee
Conference Program
short papers

F1: XML Schemas

  1. Holistic Constraint-Preserving Transformation from Relational Schema to XML Schema
    Rui Zhou, Chengfei Liu, Jianxin Li (Swinburne U. of Technology, Australia)

    In this paper, we propose a holistic scheme of transforming a relational schema into an XML Schema with integrity constraints preserved. This scheme facilitates constructing a schema for the published XML views of relational data. With this schema, users are able to issue qualified queries against XML views, and discover update anomalies in advance before propagating the view updates into relational database. Compared to the previous work which splits the transformation process into two steps, we establish a holistic solution to directly transform a relational schema into an XML Schema without building a reference graph. We achieve this by classifying the underlying relations in a more concise and effective way, and applying the converting rules wisely. The rules are also devised to be more compact and less complicated in contrast to those in our previous work. Finally, we manage to crack another hard nut which was seldom touched before, i.e. converting circularly referenced relations into recursive XML Schema.

  2. An Optimized Two-Step Solution for Updating XML Views
    Ling Wang, Ming Jiang, Elke Rundensteiner, Murali Mani (Worcester Polytechnic Inst., USA)

    View updating is a long standing difficult problem. Given a view defined over base data sources and a view update, there are several different updates over the base data sources, called translations, that perform the update. A translation is said to be correct if it performs the update and at the same time does not update any portion of the view not specified in the update (no view side-effects). The view update problem finds a correct translation for a given view update if one exists. In the relational scenario, previous research has attempted to study the view update problem either by utilizing only the schema knowledge, or by directly examining the base data. While utilizing only the schema knowledge is very efficient, we are not guaranteed to find a correct translation even if one exists. On the other hand, examining the base data is guaranteed to find a correct translation if one exists, but is very time-consuming. The view update problem is even more complex in the XML context due to the nested hierarchical structure of XML and the restructuring capabilities of the XQUERY view specification. In this paper we propose a schema-centric framework, named HUX, for efficiently updating XML views specified over relational databases. HUX is complete (always finds a correct translation if one exists) and is efficient. The efficiency of HUX is achieved as follows. Given a view update, HUX first exploits the schema to determine whether there will never be a correct translation, or there will always be a correct translation. Only if the update cannot be classified using the schema, HUX will examine the base data to determine if there is a correct translation. This data-level checking is further optimized in HUX, by exploiting the schema knowledge extracted in the first step to significantly prune the space of potential translations that is explored. Experiments illustrate the performance benefits of HUX over previous solutions.

  3. Even an Ant Can Create an XSD
    Ondrej Vosta, Irena Mlynkova, Jaroslav Pokorny (Charles U., Czech Republic)

    The XML has undoubtedly become a standard for data representation and manipulation. But most of XML documents are still created without the respective description of its structure, i.e. an XML schema. Hence, in this paper we focus on the problem of automatic inferring of an XML schema for a given sample set of XML documents. In particular, we focus on new features of XML Schema language and we propose an algorithm which is an improvement of a combination of verified approaches that is, at the same time, enough general and can be further enhanced. Using a set of experiments we illustrate the behavior of the algorithm on both real-world and artificial XML data.

F2: Data Mining

  1. A User Driven Data Mining Process Model and Learning System
    Esther Ge, Richi Nayak, Yue Xu, Yuefeng Li (Queensland U. of Technology, Australia)

    This paper deals with the problem of using the data mining models in a real-world situation where the user can not provide all the inputs with which the predictive model is built. A learning system framework, Query Based Learning System (QBLS), is developed for improving the performance of the predictive models in practice where not all inputs are available for querying to the system. The automatic feature selection algorithm called Query Based Feature Selection (QBFS) is developed for selecting features to obtain a balance between the relative minimum subset of features and the relative maximum classification accuracy. Performance of the QBLS system and the QBFS algorithm is successfully demonstrated with a real-world application.

  2. Efficient Mining of Recurrent Rules from a Sequence Database
    David Lo, Siau-Cheng Khoo (National U. of Singapore), Chao Liu (U. of Illinois, Urbana-Champaign, USA)

    We study a novel problem of mining significant recurrent rules from a sequence database. Recurrent rules have the form ``whenever a series of precedent events occurs, eventually a series of consequent events occurs''. Recurrent rules are intuitive and characterize behaviors in many domains. An example is in the domain of software specifications, in which the rules capture a family of program properties beneficial to program verification and bug detection. Recurrent rules generalize existing work on sequential and episode rules by considering repeated occurrences of premise and consequent events within a sequence and across multiple sequences, and by removing the "window" barrier. Bridging the gap between mined rules and program specifications, we formalize our rules in linear temporal logic. We introduce and apply a novel notion of rule redundancy to ensure efficient mining of a compact representative set of rules. Performance studies on benchmark datasets and a case study on an industrial system have been performed to show the scalability and utility of our approach.

  3. Uniqueness Mining
    Rohit Paravastu, Hanuma Kumar, Vikram Pudi (IIIT Hyderabad, India)

    In this paper we consider the problem of extracting the special properties of any given record in a dataset. We are interested in determining what makes a given record unique or different from the majority of the records in a dataset. In the real world, records typically represent objects or people and it is often worthwhile to know what special properties are present in each object or person, so that we can make the best use of them. This problem has not been considered earlier in the research literature. We approach this problem using ideas from clustering, attribute oriented induction (AOI) and frequent itemset mining. Most of the time consuming work is done in a preprocessing stage and the online computation of the uniqueness of a given record is instantaneous.

F3: Spatial Data

  1. Discovering Spatial Interaction Patterns
    Chang Sheng, Wynne Hsu, Mong Li Lee, Anthony Tung (National U. of Singapore)

    Advances in sensing and satellite technologies and the growth of Internet have resulted in the easy accessibility of vast amount of spatial data. Extracting useful knowledge from these data is an important and challenging task, in particular, finding interaction among spatial features. Existing works typically adopt a grid-like approach to transform the continuous spatial space to a discrete space. In this paper, we propose to model the spatial features in a continuous space through the use of influence functions. For each feature type, we build an influence map that captures the distribution of the feature instances. Superimposing the influence maps allows the interaction of the feature types to be quickly determined. Experiments on both synthetic and real world datasets indicate that the proposed approach is scalable and is able to discover patterns that have been missed by existing methods.

  2. Topological Relationships Between Map Geometries
    Mark McKenney, Markus Schneider (U. of Florida, USA)

    The importance of topological relationships between spatial objects is recognized in many disciplines. In the field of spatial databases, topological relationships have played an important role, providing mechanisms for spatial constraint specification and spatial indexing techniques. The use of spatial data in a map form has become popular, resulting in spatial data models such as topologies or, more generally, map geometries, which model collections of spatial objects that satisfy certain topological constraints. However, the topological relationships between map geometries remain unexplored. In this paper, we identify the set of valid topological relationships between map geometries, and then provide a mechanism by which they can be directly implemented on top of existing systems by using topological relationships between regions.

  3. MBR Models for Uncertainty Regions of Moving Objects
    Shayma Alkobaisi, Wan D. Bae, Seon Ho Kim (U. of Denver, USA), Byunggu Yu (U. of District of Columbia, USA)

    The increase in the advanced location based services such as traffic coordination and management necessitates the need for advanced models tracking the positions of Moving Objects (MOs) like vehicles. Computers cannot continuously update locations of MOs because of computational overhead, which limits the accuracy of evaluating MOs' positions. Due to the uncertain nature of such positions, efficiently managing and quantifying the uncertainty regions of MOs are needed in order to improve query response time. These regions can be rather irregular which makes them unsuitable for indexing. This paper presents Minimum Bounding Rectangles (MBR approximations for three uncertainty region models, namely, the Cylinder Model (CM), the Funnel Model of Degree 1 (FMD1) and the Funnel Model of Degree 2 (FMD2). We also propose an estimation of the MBR of FMD2 that achieves a good balance between computation time and selectivity (false-hits). Extensive experiments on both synthetic and real spatio-temporal datasets showed an order of magnitude improvement of the estimated model over the other modeling methods in terms of the number of MBRs retrieved during query process, which directly corresponds to the number of physical page accesses.

F4: Indexes and Cubes

  1. Summarization Graph Indexing: Beyond Frequent Structure-based Approach
    Lei Zou (Huazhong U. of Science and Technology, China), Lei Chen (Hong Kong U. of Science and Technology, China), Huaming Zhang (U. of Alabama, Huntsville, USA), YanSheng Lu, Qiang Lou (Huazhong U. of Science and Technology, China)

    Graph is an important data structure to model complex structural data, such as chemical compounds, proteins, and XML documents. Among many graph data-based applications, sub-graph search is a key problem, which is defined as given a query Q, retrieving all graphs containing Q as a sub-graph in the graph database. Most existing sub-graph search methods try to filter out false positives (graphs that are not possible in the results) as many as possible by indexing some frequent sub-structures in graph database. However, due to ignoring the relationships between sub-structures, these methods still admit a high percentage of false positives. In this paper, we propose a novel concept, Summarization Graph, which is a complete graph and captures most topology information of the original graph, such as sub-structures and their relationships. Based on Summarization Graphs, we convert the filtering problem into retrieving objects with set-valued attributes. Moreover, we build an efficient signature file-based index to improve the filtering process. We prove theoretically that the pruning power of our method is larger than existing structure-based approaches. Finally, we show by extensive experimental study on real and synthetic data sets that the size of candidate set generated by Summarization Graph-based approach is only about 50% of that left by existing graph indexing methods, and the total response time of our method is reduced 2-10 times.

  2. Best Paper: Bulk-Loading the ND-Tree in Non-ordered Discrete Data Spaces
    Hyun-Jeong Seok (U. of Michigan, USA), Gang Qian, Qiang Zhu, Alexander Oswald (U. of Central Oklahoma, USA), Sakti Pramanik (Michigan State U., USA)

    Applications demanding multidimensional index structures for performing efficient similarity queries often involve a large amount of data. The conventional tuple-loading approach to building such an index structure for a large data set is inefficient. To overcome the problem, a number of algorithms to bulk-load the index structures, like the R-tree, from scratch for large data sets in continuous data spaces have been proposed. However, many of them cannot be directly applied to a non-ordered discrete data space (NDDS) where data values on each dimension are discrete and have no natural ordering. No bulk-loading algorithm has been developed specifically for an index structure, such as the ND-tree, in an NDDS. In this paper, we present a bulk-loading algorithm, called the NDTBL, for the ND-tree in NDDSs. It adopts a special in-memory structure to efficiently construct the target ND-tree. It utilizes and extends some operations in the original ND-tree tuple-loading algorithm to exploit the properties of an NDDS in choosing and splitting data sets/nodes during the bulk-loading process. It also employs some strategies such as multi-way splitting and memory buffering to enhance efficiency. Our experimental studies show that the presented algorithm is quite promising in bulk-loading the ND-tree for large data sets in NDDSs.

  3. An Incremental Maintenance Scheme of Data Cubes
    Dong Jin, Tatsuo Tsuji, Takayuki Tsuchida, Ken Higuchi (U. of Fukui, Japan)

    Data cube construction is a commonly used operation in data warehouses. Because of the volume of data stored and analyzed in a data warehouse and the amount of computation involved in data cube construction, incremental maintenance of data cube is really effective. To maintain a data cube incrementally, previous methods were mainly for relational databases. In this paper, we employ an extendible multidimensional array model to maintain data cubes. Such an array enables incremental cube maintenance without relo-cating any data dumped at an earlier time, while computing the data cube effi-ciently by utilizing the fast random accessing capability of arrays. Our data cube scheme and related maintenance methods are presented in this paper, and cost analysis on our approach is shown and compared with existing methods.

F5: Data Streams

  1. A Data Partition Based Near Optimal Scheduling Algorithm for Wireless Multi-Channel Data Broadcast
    Ping Yu, Weiwei Sun, Yongrui Qin, Zhuoyao Zhang, Bole Shi (Fudan U., China)

    Data broadcast is an efficient way to disseminate information to large numbers of users in wireless environments. The Square Root Rule (SRR) is the theoretical basis for the single channel broadcast scheduling. In this paper, we extend the SRR and propose the Multi-channel Square Root Rule (MSRR) for scheduling variable-length data with skewed access probabilities on variable-bandwidth channels. The theoretical optimal average access latency is also provided. However, this optimal value can not be achieved in reality. Based on MSRR, we provide a two-phase scheduling algorithm which achieves near optimal access latency. First data are partitioned and allocated to different channels according to MSRR. Second, different scheduling strategies are adopted on each channel according to the skewness of data subset allocated on that channel. Experiments show that the difference of average access latency between our results and the optimal value is below five percent in most situations.

  2. A Test Paradigm for Detecting Changes in Transactional Data Streams
    Willie Ng, Manoranjan Dash (Nanyang Technological U., Singapore)

    A pattern is considered useful if it can be used to help a person to achieve his goal. Mining data streams for useful patterns is important in many applications. However, data stream can change their behavior over time and, when significant change occurs, much harm is done to the mining result if it is not properly handled. In the past, there have been many studies mainly on adapting to changes in data streams. We contend that adapting to changes is simply not enough. The ability to detect and characterize change is also essential in many applications, for example intrusion detection, network traffic analysis, data streams from intensive care units etc. Detecting changes is nontrivial. In this paper, an online algorithm for change detection in utility mining is proposed. In order to provide a mechanism for making quantitative description of the detected change, we adopt the statistical test. We believe there is the opportunity for an immensely rewarding synergy between data mining and statistic. Different statistical significance tests are evaluated and our study shows that the Chi-square test is the most suitable for enumerated or count data (as is the case for high utility itemsets). We demonstrate the effectiveness of the proposed method by testing it on IBM QUEST market-basket data.

  3. Teddies: Trained Eddies for Reactive Stream Processing
    Kajal Claypool (Oracle Corporation, USA), Mark Claypool (Worcester Polytechnic Inst., USA)

    In this paper, we present an adaptive stream query processor, Teddies, that combines the key advantages of the Eddies system with the scalability of the more traditional dataflow model. In particular, we introduce the notion of adaptive packetization of tuples to overcome the large memory requirements of the Eddies system. The Teddies optimizer groups tuples with the same history into data packets which are then scheduled on a per packet basis through the query tree. Corresponding to the introduction of this second dimension -- the packet granularity -- we propose an adaptive scheduler that can react to not only the varying statistics of the input streams and the selectivity of the operators, but also to the fluctuations in the internal packet sizes. The scheduler degrades to the Eddies scheduler in the worst case scenario. We present experimental results that compare both the reaction time as well as the scalability of the Teddies system with the Eddies and the data flow systems, and classify the conditions under which the Teddies' simple packet optimizer strategy outperforms the per-tuple Eddies optimizer strategy.

F6: P2P and Transactions

  1. Flood Little, Cache More: Effective Result-reuse in P2P IR Systems
    Christian Zimmer, Srikanta Bedathur, Gerhard Weikum (MPI for Informatics, Germany)

    State-of-the-art Peer-to-Peer Information Retrieval (P2P IR) systems suffer from their lack of response time guarantee especially with scale. To address this issue, a number of techniques for caching of multi-term inverted list intersections and query results have been proposed recently. Although these enable speedy query evaluations with low network overheads, they fail to consider the potential impact of caching on result quality improvements. In this paper, we propose the use of a cache-aware query routing scheme, that not only reduces the response delays for a query, but also presents an opportunity to improve the result quality while keeping the network usage low. In this regard, we make three-fold contributions in this paper. First of all, we develop a cache-aware, multi-round query routing strategy that balances between query efficiency and result-quality. Next, we propose to aggressively reuse the cached results of even subsets of a query towards an approximate caching technique that can drastically reduce the bandwidth overheads, and study the conditions under which such a scheme can retain good result-quality. Finally, we empirically evaluate these techniques over a fully functional P2P IR system, using a large-scale Wikipedia benchmark, and using both synthetic and real-world query workloads. Our results show that our proposal to combine result caching with multi-round, cache-aware query routing can reduce network traffic by more than half while doubling the result quality.

  2. Load Balancing for Moving Object Management in a P2P Network
    Mohammed Eunus Ali, Egemen Tanin, Rui Zhang, Lars Kulik (U. of Melbourne, Australia)

    Online games and location-based services now form the potential application domains for the P2P paradigm. In P2P systems, balancing the workload is essential for overall performance. However, existing load balancing techniques for P2P systems were designed for stationary data. They can produce undesirable workload allocations for moving objects that is continuously updated. In this paper, we propose a novel load balancing technique for moving object management using a P2P network. Our technique considers the mobility of moving objects and uses an accurate cost model to optimize the performance in the management network, in particular for handling location updates in tandem with query processing. In a comprehensive set of experiments, we show that our load balancing technique gives constantly better update and query performance results than existing load balancing techniques.

  3. Serializable Executions with Snapshot Isolation: Modifying Application Code or Mixing Isolation Levels?
    Mohammad Alomari, Michael Cahill, Alan Fekete, Uwe Roehm (U. of Sydney, Australia)

    Snapshot Isolation concurrency control (SI) allows substantial performance gains compared to holding commit-duration readlocks, while still avoiding many anomalies such as lost updates or inconsistent reads. However, for some sets of application programs, SI can result in non-serializable execution, in which database integrity constraints can be violated. The literature reports two different approaches to ensuring all executions are serializable while still using SI for concurrency control. In one approach, the application programs are modified (without changing their stand-alone semantics) by introducing extra conflicts. In the other approach, the application programs are not changed, but a small subset of them must be run using standard two-phase locking rather than SI. We compare the performance impact of these two approaches. Our conclusion is that the convenience of preserving the application code (and adjusting only the isolation level for some transactions) leads to a very substantial performance penalty against the best that can be done through application modification.

F7: XML Processing

  1. SemanticTwig: A Semantic Approach to Optimize XML Query Processing
    Zhifeng Bao,Tok Wang Ling (National U. of Singapore), Jiaheng Lu (U. of California, Irvine, USA), Bo Chen (National U. of Singapore)

    Twig pattern matching (TPM) is the core operation of XML query processing. Existing approaches rely on either efficient data structures or novel labeling/indexing schemes to reduce the intermediate result size, but none of them takes into account the rich semantic information resided in XML document and the query issued. Moreover, in order to fulfill the semantics of the XPath/XQuery query, most of them require costly post processing to eliminate redundant matches and group matching results. In this paper, we propose an innovative semantics-aware query optimization approach to overcome these limitations. In particular, we exploit the functional dependency derived from the given semantic information to stop query processing early; we distinguish the output and predicate nodes of a query, then propose a query breakup technique and build a query plan, such that for each distinct query output, we avoid finding the redundant matches having the same results as the first match in most cases. Both I/O and structural join cost are saved, and much less intermediate results are produced. Experiments show the effectiveness of our optimization.

  2. Approximate XML Query Answers in DHT-Based P2P Networks
    Weimin He, Leonidas Fegaras (U. of Texas, Arlington, USA)

    Due to the increasing number of independent data providers on the web, there is a growing number of web applications that require locating data sources distributed over the internet. Most of the current proposals in the literature focus on developing effective routing data synopses to answer simple XPath queries in structured or unstructured P2P networks. In this paper, we present an effective framework to support XPath queries extended with full-text search predicates over schema-less XML data distributed in a DHT-based P2P network. We construct two concise routing data synopses, termed structural summary and peer-document synopsis, to route the user query to most relevant peers that own documents that can satisfy the query. To evaluate the structural components in the query, a general query footprint derivation algorithm is developed to extract the query footprint from the query and match it with structural summaries. To improve the search performance, we adopt a lazy query evaluation strategy for evaluating the full-text search predicates in the query. Finally, we develop effective strategies to balance the data load distribution in the system. We conduct extensive experiments to show the scalability of our system, validate the efficiency and accuracy of our routing data synopses, and demonstrate the effectiveness of our load balancing schemes.

  3. Efficient Top-k Search across Heterogeneous XML Data Sources
    Jianxin Li, Chengfei Liu (Swinburne U. of Technology, Australia), Jeffrey Xu Yu (Chinese U. of Hong Kong, China), Rui Zhou (Swinburne U. of Technology, Australia)

    An important issue arising from XML query relaxation is how to efficiently search the top-k best answers from a large number of XML data sources, while minimizing the searching cost, i.e., finding the k matches with the highest computed scores by only traversing part of the documents. This paper resolves this issue by proposing a bound-threshold based scheduling strategy. It can answer a top-k XML query as early as possible by dynamically scheduling the query over XML documents. In this work, the total amount of documents that need to be visited can be greatly reduced by skipping those documents that will not produce the desired results with the bound-threshold strategy. Furthermore, most of the candidates in each visited document can also be pruned based on the intermediate results. Most importantly, the partial results can be output immediately during the query execution, rather than waiting for the end of all results to be determined. Our experimental results show that our query scheduling and processing strategies are both practical and efficient.

F8: Complex Pattern Processing

  1. Example-Based Robust DB-Outlier Detection for High Dimensional Data
    Yuan Li, Hiroyuki Kitagawa (U. of Tsukuba, Japan)

    This paper presents a method of outlier detection to identify exceptional objects that match user intentions in high dimensional datasets. Outlier detection is a crucial element of many applications like financial analysis and fraud detection. Scholars have made numerous investigations, but the results show that current methods fail to directly discover outliers from high dimensional datasets due to the curse of dimensionality. Beyond that, many algorithms require several decisive parameters to be predefined. Such vital parameters are considerably difficult to determine without identifying datasets beforehand. To address these problems, we take an Example-Based approach and examine behaviors of projections of the outlier examples in a dataset. An example-based approach is promising, since users are probably able to provide a few outlier examples to suggest what they want to detect. An important point is that the method should be robust, even if user-provided examples include noises or inconsistencies. Our proposed method is based on the notion of DB- (Distance-Based) Outliers. Experiments demonstrate that our proposed method is effective and efficient on both synthetic and real datasets and can tolerate noise examples.

  2. A Novel Fingerprint Matching Method by Excluding Elastic Distortion
    Keming Mao, Guoren Wang (Northeastern U., China)

    Fingerprint matching is a key issue in fingerprint recognition systems. Although there already exist many researches about fingerprint matching algorithm, it is still a challenging problem for reliable person authentication because of the complex distortions involved in the fingerprint. The non-linear deformation can lead to the change of both position and direction of minutiae and hence decrease the reliability of the minutiae. In this paper, we propose a novel minutiae-based fingerprint matching algorithm. First a new structure Adjacent Feature Union(AFU) is defined, which is invariant to rotation and translation. AFU represents the local feature of a fingerprint and it is used to align the fingerprint. Moreover, the information of the ridge frequency and block orientation is utilized to construct a distortion-tolerant model. Using this model the position and direction of the minutiae can be readjusted and thus enhance the performance of the matching. The proposed method can minimize the effect of the elastic distortion in fingerprint. Experiment results demonstrate the effectiveness of the matching method.

  3. Approximate Clustering of Time Series Using Compact Model-based Descriptions
    Hans-Peter Kriegel, Peer Kroger, Alexey Pryakhin, Matthias Renz, Andrew Zherdin (Ludwig Maximilians U., Germany)

    Clustering time series is usually limited by the fact that the length of the time series has a significantly negative influence on the runtime. On the other hand, approximative clustering applied to existing compressed representations of time series (e.g. obtained through dimensionality reduction) usually suffers from low accuracy. We propose a method for the compression of time series based on mathematical models that explore dependencies between different time series. In particular, each time series is represented by a combination of a set of specific reference time series. The cost of this representation depend only on the number of reference time series rather than on the length of the time series. We show that using only a small number of reference time series yields a rather accurate representation while reducing the storage cost and runtime of clustering algorithms significantly. Our experiments illustrate that these representations can be used to produce an approximate clustering with high accuracy and considerably reduced runtime.

F9: IR Techniques

  1. An Approach for Extracting Bilingual Terminology from Wikipedia
    Maike Erdmann, Kotaro Nakayama, Takahiro Hara, Shojiro Nishio (Osaka U., Japan)

    With the demand of bilingual dictionaries covering domain-specific terminology, research in the field of automatic dictionary extraction has become popular. However, accuracy and coverage of dictionaries created based on bilingual text corpora are often not sufficient for domain-specific terms. Therefore, we present an approach to extracting bilingual dictionaries from the link structure of Wikipedia, a huge scale encyclopedia that contains a vast amount of links between articles in different languages. Our methods analyze not only these interlanguage links but extract even more translation candidates from redirect page and link text information. In an experiment, we proved the advantages of our methods compared to a traditional approach of extracting bilingual terminology from parallel corpora.

  2. Cost-Effective Search Strategy Framework for Automatic Dictionary Construction
    Hideki Kawai, Hironori Mizuguchi, Masaaki Tsuchida (NEC Research, Japan)

    In this paper, we propose a cost-effective search strategy framework to extract keywords in the same semantic class from the Web. Constructing a dictionary based on the bootstrapping technique is one promising approach to harnessing knowledge scattered around the Web. Open web application programming interfaces (APIs) are powerful tools for the knowledge-gathering process. However, we have to consider the cost of API calls because too many queries can overload the search engines, and they also limit the number of API calls. Our goal is to optimize a search strategy that can collect as many new words as possible with the least API calls. Our results show that the optimized search strategy can extract 64,642 words in five different domains with a precision of 0.94 with only 1,000 search API calls.

  3. Learning Bayesian Network Structure from Incomplete Data without any Assumption
    Celine Fiot (U. Montpellier II, France), G. A. Putri Saptawati (Institut Teknologi Bandung, Indonesia), Anne Laurent, Maguelonne Teisseire (U. Montpellier II, France)

    Since most real-life data contain missing values, reasoning and learning with incomplete data has become crucial in data mining and machine learning. In particular, Bayesian networks are one machine learning technique that allows for reasoning with incomplete data, but training such networks on incomplete data may be a difficult task. Many methods were thus proposed to learn Bayesian network structure from incomplete data, based on multiple structure generation and scoring of their adequacy to the dataset. However, this kind of approaches may be time-consuming. Therefore we propose an efficient dependency analysis approach that uses a redefinition of probability calculation to take incomplete records into account while learning BN structure, without generating multiple possibilities. Some experiments on well-known benchmarks are described to show the validity of our proposal.