Keynotes

WAIM 2010 has invited four prominent researchers in information management and KDD areas. Their contributions are world-widely recognized. Their talks will cover the most interesting research topics and the current progress in these research areas. All of them are well-known world-class presenters. Their talks will bring a highlight of the conference.


Web Information Credibility

Speaker: Katsumi Tanaka, Kyoto University, Japan

Abstract: World Wide Web is the biggest repository of information and knowledge. Such information gives people a framework for organizing their private and professional lives. Research aimed at evaluating the credibility of Web content has recently become increasingly crucial because the Web has started to influence our daily lives. The abundance of content on the Web, the lack of publishing barriers, and poor quality control of Web content raise credibility issues. If users are not aware of the credibility of Web information, they can be easily misled, and sometimes it is dangerous to users.
    For example, some researchers reported that there are more than twenty thousand health-related sites on the Web, but more than half of such sites have not been reviewed by medical specialists. Wikipedia has been more popular on the Web, but the risks of Wikipedia are also indicated from the viewpoint of credibility. There are a lot of exaggerated ads and fake images and movies. The importance of the image forensic research also becomes important.
    Many dimensions concerned with the information credibility are grouped into two key components: expertise and trustworthiness. Expertise is a factor about the writer's ability to produce correct or fair information and the degree to which the reader can perceive knowledge and skill from the information. The expertise factor is defined by the terms knowledgeable, experienced, competent, and so on. The trustworthiness is a factor about readers' perceptions that the information is true as they know it, and it is the degree to which readers can perceive the goodness or morality of the target information. The trustworthiness factor is defined by the terms well-intentioned, unbiased, reputable, and so on. In the areas of Web search and mining, however, most of conventional research have focused on ranking search results based on popularity by analyzing link structures or on mining useful rules from the Web. They have not focused on the analysis of the credibility of target information.
    Consequently, few users perform rigorous evaluations of the credibility of obtained information. Therefore, the exploration of a general framework and automatic tools for supporting users in the judgment of web content credibility are becoming increasingly necessary.
    In this talk, we describe a new framework and methods for evaluating the Web information credibility. These include: a bipartite-graph framework for evaluating the credibility of relations, and several methods for analyzing Web.


Analyzing Data Quality using Data Auditor

Speaker: Divesh Srivastava, AT&T Lab, USA

Abstract: Monitoring databases maintain configuration and measurement tables about computer systems, such as networks and computing clusters, and serve important business functions, such as troubleshooting customer problems, analyzing equipment failures, planning system upgrades, etc. These databases are prone to many data quality issues: configuration tables may be incorrect due to data entry errors, while measurement tables may be affected by incorrect, missing, duplicate and delayed polls.
    We describe Data Auditor, a system for analyzing data quality and exploring data semantics of monitoring databases. Given a user-supplied constraint, such as a boolean predicate expected to be satisfied by every tuple, a functional dependency, or an inclusion dependency, Data Auditor computes "pattern tableaux", which are concise summaries of subsets of the data that satisfy or fail the constraint. We discuss the architecture of Data Auditor, including the supported types of constraints and the tableau generation mechanism. We also show the utility of our approach on an operational network monitoring database.


Rebuilding the World from Views

Speaker: Xiaofang Zhou, Professor, School of Information Technology and Electrical Engineering, The University of Queensland

Abstract: With the ever-increasing growth of the internet, more and more data sets are being made available. Most of this data has its origin in the real world, often describing the same objects or events from different viewpoints. One can thus consider data sets obtained from different sources as different (and possibly inconsistent) views of our world, and it makes sense to try to integrate them in some form, e.g. to answer questions which involve data from multiple sources. While data integration is an old and well-investigated subject, the nature of the data sets to be integrated is changing. They increase in volume as well as complexity, are often undocumented, relationships between data sets are more fuzzy, and representations of the same real-word object differ. To address these challenges, new methods for rapid, semi-automatic, loose and virtual integration, exploration and querying of large families of data sets must be developed.
    In an ongoing project we are investigating a framework for sampling and matching data sets in an efficient manner. In particular, we consider the problem of creating and analyzing samples of relational databases to find relationships between string-valued attributes. Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. We deal with the issue of differing representation of objects, i.e., ¡®dirty¡¯ data, by employing new similarity measures between sets of strings, which not only consider set based similarity, but also similarity between strings instances. To make the measures effective, especially in light of data sets being large and distributed, we developed efficient algorithms for distributed sample creation and similarity computation. Central to this is that sampling is synchronized. For clean data this means that the same values are sampled for each set (if present). For dirty data one must ensure that similar values are sampled for each set (if present), and we manage to do so in a probabilistic manner. Test results show that for dirty data our measures are more accurate for measuring value overlap than existing sample-based methods, and sometimes dramatically so, but we also observe that there is a clear tradeoff between accuracy and speed. This motivates a two-stage filtering approach, with both measures operating on the same samples, which consist of original values and are thus suitable for human inspection as well.
    The next step of our research is to extend such a sampling and matching approach to multiple attributes and semi-structured data, and to construct search and query systems which make direct use of the matches discovered.


Approximate Query Processing in Sensor Networks

Speaker: Jianzhong Li, Profeesor, School of Computer Science and Technology, Harbin Institute of Technology

Abstract: Many emerging applications are collecting massive volumes of sensor data from networks of distributed devices, such as sensor networks and cyber-physical systems. These environments are commonly characterized by the intrinsic volatility and uncertainty of the sensor data, and the strict communication (energy) constraints of the distributed devices. Approximate query processing is an important methodology that exploits the tolerance of many applications to inaccuracies in reported data in order to reserve communication overhead. The research challenge is how to ensure communication efficiency without sacrificing result usefulness.
    Many prior work depends on users to impose preferences or constraints on approximate query processing, such as result inaccuracies, candidate set size, and response time. We argue that the pre-determined user preferences may turn out to be inappropriate and become a substantial source of i) jeopardized query results, ii) prohibitive response time, and iii) overwhelming communication overhead.
     Our work 'Probing Queries in Wireless Sensor Networks' (ICDCS 2008) studies a scenario where empty sets may be returned as accurate query results, yet users may benefit from approximate answer sets not exactly conforming the specified query predicates. The approximate answer sets can be used not only to answer the query approximately but also to guide users to modify their queries for further probing the monitored objects. The distance between sensing data and a query and the dominating relationship between sensing data are first defined. Then, three algorithms for processing probing queries are proposed, which compute the best approximate answer sets that consist of the sensing data with the smallest distance from given queries. All the algorithms utilize the dominating relationship to reduce the amount of data transmitted in sensor networks by filtering out the unnecessary data. Experimental results on real and synthetic data sets show that the proposed algorithms have high performance and energy efficiency.
    Our work 'Enabling ε-Approximate Querying in Sensor Networks' (VLDB 2009) studies the scenario where, due to the dynamic nature of sensor data, users are unable to determine in advance what error bounds can lead to affordable cost in approximate query processing. We propose a novel ε-approximate querying (EAQ) scheme to resolve the problem. EAQ is a uniform data access scheme underlying various queries in sensor networks. The core idea of EAQ is to introduce run-time iteration and refinement mechanisms to enable efficient, ε-approximate query processing in sensor networks. Thus it grants more flexibility to in-network query processing and minimizes energy consumption through communicating data up to a just-sufficient level. To ensure bounded overall cost for the iteration and refinement procedures of the EAQ scheme, we develop a novel data shuffling algorithm. The algorithm converts sensed datasets into special representations called MVA. From prefixes of MVA, we can recover approximate versions of the entire dataset, where all individual data items have guaranteed error bounds. The EAQ scheme supports efficient and flexible processing of various queries including spatial window query, value range query, and queries with QoS constraints. The effectiveness and efficiency of the EAQ scheme are evaluated in a real sensor network testbed.
    Even in case the users know exactly what result inaccuracies they can tolerate, many prior query processing techniques still cannot meet arbitrary precision requirements given by users. Most notably, many aggregational query processing methods can only support fixed error bounds.
    In 'Sampling based (ε, δ)-Approximate Aggregation Algorithm in Sensor Networks' (ICDCS 2009), we propose a uniform sampling based aggregation algorithm. We prove that for any ε (ε <= 0) and δ (0 <= δ <= 1), this algorithm returns the approximate aggregation result satisfying that the probability of the relative error of the results being larger than ε is less than δ. However, this algorithm is only suitable for static network. Considering the dynamic property of sensor networks, we further proposed a Bernoulli sampling based algorithm in ¡®Bernoulli Sampling based (ε, δ)-Approximate Aggregation in Large-Scale Sensor Networks' (INFOCOM 2010). We prove that this algorithm also can meet the requirement of any precision, and is suitable for both static and dynamic networks. Besides, two sample data adaptive algorithms are also provided. One is to adapt the sample with the varying of precision requirement. The other is to adapt the sample with the varying of the sensed data in networks. The theoretical analysis and experiments show that all proposed algorithms have high performance in terms of accuracy and energy cost.