📄 icde_2004_elementary.txt
字号:
<proceedings><paper><title>Conference Officers</title><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>External Reviewer</title><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>Language Models for Information Retrieval</title><author><AuthorName>W. Bruce Croft</AuthorName><institute><InstituteName>University of Massachusetts, Amhers</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>One of the major challenges in the field of information retrieval (IR) is to specify a formal framework that both describes the important processes involved in finding relevant information, and successfully predicts which techniques will provide good effectiveness in terms of accuracy. A recent approach that has shown considerable promise uses generative models of text (language models) to describe the IR processes. We briefly review the major variations of the language model approach and how they have been used to develop a range of retrieval-related language technologies, including cross-lingual IR and distributed search. We also discuss how this approach could be used with structured data extracted from text.</abstract></paper><paper><title>Out-of-the-Box Data Engineering - Events in Heterogeneous Data Environments</title><author><AuthorName>Ramesh Jain</AuthorName><institute><InstituteName>Georgia Institute of Technology, Atlant</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Data has changed significantly over the last few decades. Computing systems that initially dealt with data and computation rapidly moved to information and communication. The next step on the evolutionary scale is insight and experience. Applications are demanding the use of live, spatio-temporal, heterogeneous data. Data engineering must keep pace by designing experiential environments that let users apply their senses to observe data and information about an event and to interact with aspects of the event that are of particular interest. I call this out-of-the-box data engineering because it means we must think beyond many of our timeworn perspectives and technical approaches.</abstract></paper><paper><title>Flux: An Adaptive Partitioning Operator for Continuous Query Systems</title><author><AuthorName>Mehul A. Shah</AuthorName><institute><InstituteName>U.C. Berkele</InstituteName><country></country></institute></author><author><AuthorName>Joseph M. Hellerstein</AuthorName><institute><InstituteName>U.C. Berkele</InstituteName><country></country></institute></author><author><AuthorName>Sirish Chandrasekaran</AuthorName><institute><InstituteName>U.C. Berkele</InstituteName><country></country></institute></author><author><AuthorName>Michael J. Franklin</AuthorName><institute><InstituteName>U.C. Berkele</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>The long-running nature of continuous queries poses new scalability challenges for dataflow processing. CQ systems execute pipelined dataflows that may be shared across multiple queries. The scalability of these dataflows is limited by their constituent, stateful operators - e.g. windowed joins or grouping operators. To scale such operators, a natural solution is to partition them across a shared-nothing platform. But in the CQ context, traditional, static techniques for partitioned parallelism can exhibit detrimental imbalances as workload and runtime conditions evolve. Long-running CQ dataflows must continue to function robustly in the face of these imbalances. To address this challenge, we introduce a dataflow operator called Flux that encapsulates adaptive state partitioning and dataflow routing. Flux is placed between producer-consumer stages in a dataflow pipeline to repartition stateful operators while the pipeline is still executing. We present the Flux architecture, along with repartitioning policies that can be used for CQ operators under shifting processing and memory loads. We show that the Flux mechanism and these policies can provide several factors improvement in throughput and orders of magnitude improvement in average latency over the static case.</abstract></paper><paper><title>Coordinated Management of Cascaded Caches for Efficient Content Distribution</title><author><AuthorName>Xueyan Tang</AuthorName><institute><InstituteName>Hong Kong University of Science and Technolog</InstituteName><country></country></institute></author><author><AuthorName>Samuel T. Chanson</AuthorName><institute><InstituteName>Hong Kong University of Science and Technolog</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Large-scale content delivery systems such as the web often deploy multiple caches at different locations to reduce access latency and network traffic. These caches are usually organized in a cascaded fashion where requests not hitting a lower level cache are forwarded to a higher level cache. The performance of cascaded caching depends on how the cache contents are managed, including object placement and replacement schemes. This paper presents a general analytical framework for coordinated management of cascaded caches. The object placement problem is formulated as an optimization problem and the optimal locations for caching objects are computed by a dynamic programming algorithm. Based on the framework, we propose a novel caching scheme that incorporates both object placement and replacement strategies. The proposed scheme makes caching decisions for the set of caches lying on the delivery path of a request in a coordinated fashion. Simulation experiments based on real traces from web caches have been conducted under two different cascaded caching architectures: en-route caching and hierarchical caching. The results show that for both architectures, the proposed scheme significantly outperforms existing schemes that consider object placement or replacement at individual caches only.</abstract></paper><paper><title>Designing a Super-Peer Network</title><author><AuthorName>Beverly Yang</AuthorName><institute><InstituteName>Stanford Universit</InstituteName><country></country></institute></author><author><AuthorName>Hector Garcia-Molina</AuthorName><institute><InstituteName>Stanford Universit</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>A super-peer is a node in a peer-to-peer network that operates both as a server to a set of clients, and as an equal in a network of super-peers. Super-peer networks strike a balance between the efficiency of centralized search, and the autonomy, load balancing and robustness to attacks provided by distributed search. Furthermore, they take advantage of the heterogeneity of capabilities (e.g., bandwidth, processing power) across peers, which recent studies have shown to be enormous. Hence, new and old P2P systems like KaZaA and Gnutella are adopting super-peers in their design. Despite their growing popularity, the behavior of super-peer networks is not well understood. For example, what are the potential drawbacks of super-peer networks? How can super-peers be made more reliable? How many clients should a super-peer take on to maximize efficiency? In this paper we examine super-peer networks in detail, gaining an understanding of their fundamental characteristics and performance tradeoffs. We also present practical guidelines and a general procedure for the design of an efficient super-peer network.</abstract></paper><paper><title>Indexing Weighted-Sequences in Large Databases</title><author><AuthorName>Haixun Wang</AuthorName><institute><InstituteName>IBM T.J. Watson Researc</InstituteName><country></country></institute></author><author><AuthorName>Chang-Shing Perng</AuthorName><institute><InstituteName>IBM T.J. Watson Researc</InstituteName><country></country></institute></author><author><AuthorName>Wei Fan</AuthorName><institute><InstituteName>IBM T.J. Watson Researc</InstituteName><country></country></institute></author><author><AuthorName>Sanghyun Park</AuthorName><institute><InstituteName>POSTEC</InstituteName><country></country></institute></author><author><AuthorName>Philip S. Yu</AuthorName><institute><InstituteName>IBM T.J. Watson Researc</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We present an index structure for managing weighted-sequences in large databases. A weighted-sequence is defined as a two-dimensional structure where each element in the sequence is associated with a weight. A series of network events, for instance, is a weighted-sequence in that each event has a timestamp. Querying a large sequence database by events' occurrence patterns is a first step towards understanding the temporal causal relationships among the events. The index structure proposed in this paper enables us to efficiently retrieve from the database all subsequences, possibly non-contiguous, that match a given query sequence both by events and by weights. The index method also takes into consideration the non-uniform frequency distribution of events in the sequence data. In addition, our method finds a broad range of applications in indexing scientific data consisting of multiple numerical columns for discovery of correlations among these columns. For instance, indexing a DNA micro-array that records expression levels of genes under different conditions enables us to search for genes whose responses to various experimental perturbations follow a given pattern. We demonstrate, using real-world data sets, that our method is effective and efficient.</abstract></paper><paper><title>Similarity Search in Sets and Categorical Data Using the Signature Tree</title><author><AuthorName>Nikos Mamoulis</AuthorName><institute><InstituteName>University of Hong Kong, Pokfulam Roa</InstituteName><country></country></institute></author><author><AuthorName>David W. Cheung</AuthorName><institute><InstituteName>University of Hong Kong, Pokfulam Roa</InstituteName><country></country></institute></author><author><AuthorName>Wang Lian</AuthorName><institute><InstituteName>University of Hong Kong, Pokfulam Roa</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Data mining applications analyze large collections of set data and high dimensional categorical data. Search on these data types is not restricted to the classic problems of mining association rules and classification, but similarity search is also a frequently applied operation. Access methods for multidimensional numerical data are inappropriate for this problem and specialized indexes are needed. We propose a method that represents set data as bitmaps (signatures) and organizes them into a hierarchical index, suitable for similarity search and other related query types. In contrast to a previous technique, the signature tree is dynamic and does not rely on hardwired constants. Experiments with synthetic and real datasets show that it is robust to different data characteristics, scalable to the database size and efficient for various queries.</abstract></paper><paper><title>An Adaptive and Efficient Dimensionality Reduction Algorithm for High-Dimensional Indexing</title><author><AuthorName>Hui Jin</AuthorName><institute><InstituteName>The University of Michigan, Ann Arbo</InstituteName><country></country></institute></author><author><AuthorName>Beng Chin Ooi</AuthorName><institute><InstituteName>National University of Singapor</InstituteName><country></country></institute></author><author><AuthorName>Heng Tao Shen</AuthorName><institute><InstituteName>National University of Singapor</InstituteName><country></country></institute></author><author><AuthorName>Cui Yu</AuthorName><institute><InstituteName>Monmouth University, N</InstituteName><country></country></institute></author><author><AuthorName>Ao Ying Zhou</AuthorName><institute><InstituteName>Fudan University, Chin</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>The notorious iodimensionality curseln is a well-known phenomenon for any multi-dimensional indexes attempting to scale up to high dimensions. One well known approach to overcoming degradation in performance with respect to increasing dimensions is to reduce the dimensionality of the original dataset before constructing the index. However, identifying the correlation among the dimensions and effectively reducing them is a challenging task. In this paper, we present an adaptive Multi-level Mahalanobis-based Dimensionality Reduction (MMDR) technique for high-dimensional indexing. Our MMDR technique has three notable features compared to existing methods. First, it discovers elliptical clusters using only the low-dimensional subspaces. Second, data points in the different axis systems are indexed using a single B+-tree. Third, our technique is highly scalable in terms of data size and dimensionality. An extensive performance study using both real and synthetic datasets was conducted, and the results show that our technique not only achieves higher precision, but also enables queries to be processed efficiently.</abstract></paper><paper><title>CLUSEQ: Efficient and Effective Sequence Clustering</title><author><AuthorName>Jiong Yang</AuthorName><institute><InstituteName>University of Illinois at Urbana-Champaig</InstituteName><country></country></institute></author><author><AuthorName>Wei Wang</AuthorName><institute><InstituteName>University of North Carolina at Chapel Hil</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Analyzing sequence data has become increasingly important recently in the area of biological sequences, text documents, web access logs, etc. In this paper, we investigate the problem of clustering sequences based on their sequential features. As a widely recognized technique, clustering has proven to be very useful in detecting unknown object categories and revealing hidden correlations among objects. One difficulty that prevents clustering from being performed extensively on sequence data (in categorical domain) is the lack of an effective yet efficient similarity measure. Therefore, we propose a novel model (CLUSEQ) for sequence cluster by exploring significant statistical properties possessed by the sequences. The conditional probability distribution (CPD) of the next symbol given a preceding segment is derived and used to characterize sequence behavior and to support the similarity measure. A variation of the suffix tree, namely probabilistic suffix tree, is employed to organize (the significant portion of) the CPD in a concise way. A novel algorithm is devised to efficiently discover clusters with high quality and is able to automatically adjust the number of clusters to its optimal range via a unique combination of successive new cluster generation and cluster consolidation. The performance of CLUSEQ has been demonstrated via extensive experiments on several real and synthetic sequence databases.</abstract></paper><paper><title>Querying Text Databases for Efficient Information Extraction</title><author><AuthorName>Eugene Agichtein</AuthorName><institute><InstituteName>Columbia Universit</InstituteName><country></country></institute></author><author><AuthorName>Luis Gravano</AuthorName><institute><InstituteName>Columbia Universit</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>A wealth of information is hidden within unstructured text. This information is often best exploited in structured or relational form, which is suited for sophisticated query processing, for integration with relational databases, and for data mining. Current information extraction techniques extract relations from a text database by examining every document in the database, or use filters to select promising documents for extraction. The exhaustive scanning approach is not practical or even feasible for large databases, and the current filtering techniques require human involvement to maintain and to adopt to new databases and domains. In this paper, we develop an automatic query-based technique to retrieve documents useful for the extraction of user-defined relations from large text databases, which can be adapted to new domains, databases, or target relations with minimal human effort. We report a thorough experimental evaluation over a large newspaper archive that shows that we significantly improve the efficiency of the extraction process by focusing only on promising documents.</abstract></paper><paper><title>Distance Based Indexing for String Proximity Search</title><author><AuthorName>S. Cenk Sahinalp</AuthorName><institute><InstituteName>Case Western Reserve Universit</InstituteName><country></country></institute></author><author><AuthorName>Murat Tasan</AuthorName><institute><InstituteName>Case Western Reserve Universit</InstituteName><country></country></institute></author><author><AuthorName>Jai Macker</AuthorName><institute><InstituteName>Case Western Reserve Universit</InstituteName><country></country></institute></author><author><AuthorName>Z. Meral Ozsoyoglu</AuthorName><institute><InstituteName>Case Western Reserve Universit</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>In many database applications involving string data, it is common to have near neighbor queries (asking for strings that are similar to a query string) or nearest neighbor queries (asking for strings that are most similar to a query string). The similarity between strings is defined in terms of a distance function determined by the application domain. The most popular string distance measures are based on (a weighted) count of (i) character edit or (ii) block edit operations to transform one string into the other. Examples include the Levenshtein edit distance and the recently introduced compression distance. The main goal in this paper is to develop efficient near(est) neighbor search tools that work for both character and block edit distances. Our premise is that distance-based indexing methods, which are originally designed for metric distances can be modified for string distance measures, provided that they form almost metrics. We show that several distance measures, such as the compression distance and weighted character edit distance are almost metrics. In order to analyze the performance of distance based indexing methods (in particular VP trees) for strings, we then develop a model based on distribution of pairwise distances. Based on this model we show how to modify VP trees to improve their performance on string data, providing tradeoffs between search time and space. We test our theoretical results on synthetic data sets and protein strings.</abstract></paper><paper><title>Navigation- vs. Index-Based XML Multi-Query Processing</title><author><AuthorName>Nicolas Bruno</AuthorName><institute><InstituteName>Columbia Universit</InstituteName><country></country></institute></author><author><AuthorName>Luis Gravano</AuthorName><institute><InstituteName>Columbia Universit</InstituteName><country></country></institute></author><author><AuthorName>Nick Koudas</AuthorName><institute><InstituteName>AT&T Labs-Researc</InstituteName><country></country></institute></author><author><AuthorName>Divesh Srivastava</AuthorName><institute><InstituteName>AT&T Labs-Researc</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>XML path queries form the basis of complex filtering of XML data. Most current XML path query processing techniques can be divided in two groups. Navigation-based algorithms compute results by analyzing an input document one tag at a time. In contrast, index-based algorithms take advantage of precomputed numbering schemes over the input XML document. In this paper we introduce a new index-based technique, Index-Filter, to answer multiple XML path queries. Index-Filter uses indexes built over the document tags to avoid processing large portions of the input document that are guaranteed not to be part of any match. We analyze Index-Filter and compare it against Y-Filter, a state-of-the-art navigation-based technique. We show that both techniques have their advantages, and we discuss the scenarios under which each technique is superior to the other one. In particular, we show that while most XML path query processing techniques work off SAX events, in some cases it pays off to preprocess the input document, augmenting it with auxiliary information that can be used to evaluate the queries faster. We present experimental results over real and synthetic XML documents that validate our claims.</abstract></paper><paper><title>Supporting Ancillary Values From User Defined Functions in Oracle</title><author><AuthorName>Ravi Murthy</AuthorName><institute><InstituteName>Oracle Corporation, C</InstituteName><country></country></institute></author><author><AuthorName>Ying Hu</AuthorName><institute><InstituteName>Oracle Corporation, C</InstituteName><country></country></institute></author><author><AuthorName>Seema Sundara</AuthorName><institute><InstituteName>Oracle Corporation, C</InstituteName><country></country></institute></author><author><AuthorName>Timothy Chorma</AuthorName><institute><InstituteName>Oracle Corporation, C</InstituteName><country></country></institute></author><author><AuthorName>Nipun Agarwal</AuthorName><institute><InstituteName>Oracle Corporation, C</InstituteName><country></country></institute></author><author><AuthorName>Jagannathan Srinivasan</AuthorName><institute><InstituteName>Oracle Corporation, C</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Most commercial SQL database systems support user-defined functions that can be used in WHERE clause filters, SELECT list items, or in sorting/grouping clauses. Often, user-defined functions are used as inexact search filters and then the filtered rows are sorted by a relevance measure. This is commonplace in web search engines, multimedia, and personalization applications. In this paper, we refer to the values, such as relevance measure, associated with the filtered rows as ancillary values, and address the problem of efficiently and expressively supporting queries involving them in Oracle. In our approach, the filtering operator is designated as the primary operator, and the associated ancillary values are modeled by additional operators that are declared to be ancillary to the primary operator. An ancillary operator can represent any auxiliary value for the filtered rows, including relevance values (e.g. a score which describes how well a document matches the text search query) and additional properties (e.g. the nature of spatial relationship for objects that overlap a given region). The query execution is optimized by allowing the primary and ancillary operator invocations to share computations via a shared context. Also, queries involving ancillary values can exploit user defined indexes and their capability to return results in the order of ancillary values. The paper presents the key concepts, describes our implementation scheme and optimization techniques, and discusses alternative approaches for supporting ancillary values. Finally, we provide an experimental study that illustrates the scalability and effectiveness of our approach.</abstract></paper><paper><title>Efficient Computation of Subqueries in Complex OLAP</title><author><AuthorName>Michael O. Akinde</AuthorName><institute><InstituteName>Swedish Meteorological & Hydrological Institut</InstituteName><country></country></institute></author><author><AuthorName>Michael H. Bohlen</AuthorName><institute><InstituteName>Aalborg Universit</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Expressing complex OLAP queries involving nested expressions using normal group-by, aggregation, and joins can be extremely difficult. This paper proposes a technique that translates nested query expressions into an algebra extended with a complex OLAP operator. The GMDJ is an operator with a simple and easy to optimize implementation that is particularly useful for OLAP computations because the size of intermediate results is bound by the size of the base-value argument relation. We show that all SQL subqueries can be expressed in the algebra using GMDJs. This not only makes it easy to integrate subqueries into any query engine that supports GMDJs, but also gives access to a broad range of OLAP optimization strategies for evaluating subqueries. We discuss the coalescing of GMDJs and the completion of tuples, two GMDJ optimizations that are particularly relevant to subquery processing. Our experimental results demonstrate the validity and efficiency of our approach for computing subquery expressions.</abstract></paper><paper><title>A Comparison of Three Methods for Join View Maintenance in Parallel RDBMS</title><author><AuthorName>Gang Luo</AuthorName><institute><InstituteName>University of Wisconsin-Madiso</InstituteName><country></country></institute></author><author><AuthorName>Jeffrey F. Naughton</AuthorName><institute><InstituteName>University of Wisconsin-Madiso</InstituteName><country></country></institute></author><author><AuthorName>Curt J. Ellmann</AuthorName><institute><InstituteName>NCR Advance Development La</InstituteName><country></country></institute></author><author><AuthorName>Michael W. Watzke</AuthorName><institute><InstituteName>NCR Advance Development La</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>In a typical data warehouse, materialized views are used to speed up query execution. Upon updates to the base relations in the warehouse, these materialized views must also be maintained. The need to maintain these materialized views can have a negative impact on performance that is exacerbated in parallel RDBMSs, since simple single-node updates to base relations can give rise to expensive all-node operations for materialized view maintenance. In this paper, we present a comparison of three materialized join view maintenance methods in a parallel RDBMS, which we refer to as the naive, auxiliary relation, and global index methods. The last two methods improve performance at the cost of using more space. The results of this study show that the method of choice depends on the environment, in particular, the update activity on base relations and the amount of available storage space.</abstract></paper><paper><title>Efficient Maintenance of Materialized Top-k Views</title><author><AuthorName>Ke Yi</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><author><AuthorName>Hai Yu</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><author><AuthorName>Jun Yang</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><author><AuthorName>Gangqiang Xia</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><author><AuthorName>Yuguo Chen</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We tackle the problem of maintaining materialized top-k views in this paper. Top-k queries, including MIN and MAX as important special cases, occur frequently in common database workloads. A top-k view can be materialized to improve query performance, but in general it is not self-maintainable unless it contains all tuples in the base table. Deletions and updates on the base table may cause tuples to leave the top-k view, resulting in expensive queries over the base table to &quot;refill&quot; the view. In this paper, we propose an algorithm that reduces the frequency of refills by maintaining a top-k' view instead of a top-k view, where k' changes at runtime between k and some kmax &#8805; k. We show that in most practical cases, our algorithm can reduce the expected amortized cost of refill queries to O(1) while still keeping the view small. The optimal value of kmax depends on the update pattern and the costs of querying the base table and updating the view. Compared with the simple approach of maintaining either the top-k view itself or a copy of the base table, our algorithm can provide orders-of-magnitude improvements in performance with appropriate kmax values. We show how to choose kmax dynamically to adapt to the actual system workload and performance at runtime, without requiring accurate prior knowledge.</abstract></paper><paper><title>Efficient Creation of Statistics over Query Expressions</title><author><AuthorName>Nicolas Bruno</AuthorName><institute><InstituteName>Columbia Universit</InstituteName><country></country></institute></author><author><AuthorName>Surajit Chaudhuri</AuthorName><institute><InstituteName>Microsoft Researc</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Query optimizers use base-table statistics to derive statistics on the sub-plans that are enumerated during optimization. In practice, traditional optimizers rely on a number of simplifying assumptions, which can compromise the accuracy of cardinality estimates. To address this limitation, we had earlier introduced SITs, which are statistics built over query expressions, and we explained how a traditional optimizer can judiciously use SITs to sidestep the problem of inaccurate estimates. A significant challenge that was not addressed was how to build SITs efficiently in a database system. In this paper we present a family of techniques to create SITs. These techniques differ from each other in the trade-off they present between accuracy and efficiency of creation. We also present techniques to efciently create multiple SITs by taking advantage of the commonalities among their generating query expressions.</abstract></paper><paper><title>Multicasting a Changing Repository</title><author><AuthorName>Wang Lam</AuthorName><institute><InstituteName>Stanford Universit</InstituteName><country></country></institute></author><author><AuthorName>Hector Garcia-Molina</AuthorName><institute><InstituteName>Stanford Universit</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Web crawlers generate significant loads on Web servers, and are difficult to operate. Instead of repeatedly running crawlers at many &quot;client&quot; sites, we propose a central crawler and Web repository that multicasts appropriate subsets of the central repository, and their subsequent changes, to subscribing clients. Loads at Web servers are reduced because a single crawler visits the servers, as opposed to all the client crawlers. In this paper we model and evaluate such a central Web multicast facility for subscriber clients, and for mixes of subscriber and one-time downloader clients. We consider different performance metrics and multicast algorithms for such a multicast facility, and develop guidelines for its design under various conditions.</abstract></paper><paper><title>Data Integration by Bi-Directional Schema Transformation Rules</title><author><AuthorName>Peter McBrien</AuthorName><institute><InstituteName>Imperial College, Londo</InstituteName><country></country></institute></author><author><AuthorName>Alexandra Poulovassilis</AuthorName><institute><InstituteName>Univ. of Londo</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>In this paper we describe a new approach to data integration which subsumes the previous approaches of local as view (LAV) and global as view (GAV). Our method, which we term both as view (BAV), is based on the use of reversible schema transformation sequences. We show how LAV and GAV view definitions can be fully derived from BAV schema transformation sequences, and how BAV transformation sequences may be partially derived from LAV or GAV view definitions. We also show how BAV supports the evolution of both global and local schemas, and we discuss ongoing implementation of the BAV approach within the AutoMed project.</abstract></paper><paper><title>Energy Efficient Index for Querying Location-Dependent Data in Mobile Broadcast Environments</title><author><AuthorName>Jianliang Xu</AuthorName><institute><InstituteName>HK Univ. of Sci. & Tech</InstituteName><country></country></institute></author><author><AuthorName>Baibua Zheng</AuthorName><institute><InstituteName>HK Univ. of Sci. & Tech</InstituteName><country></country></institute></author><author><AuthorName>Wang-Chien Lee</AuthorName><institute><InstituteName>Penn State Universit</InstituteName><country></country></institute></author><author><AuthorName>Dik Lun Lee</AuthorName><institute><InstituteName>HK Univ. of Sci. & Tech</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We are witnessing in recent years growing interest for location-dependent information services among mobile users. This paper examines the issue of processing location-dependent queries in a mobile broadcast environment. Different from a traditional environment, mobile users are concerned with not only access latencies but also power conservation. The planar point location algorithms and conventional spatial index structures are shown inefficient. In this paper, we propose a new index data structure, called D-tree, for querying location-dependent data in mobile broadcast environments. The basic idea is to index data regions based on the divisions between them. We describe how to construct the binary D-tree index, how to process location-dependent queries based on this index structure, and how to page the D-tree to fit the packet capacity. The performance of the D-tree is evaluated using both synthetic and real datasets. Experimental results show that the proposed D-tree provides a much better overall performance than the well-known existing schemes such as the R*-tree.</abstract></paper><paper><title>XR-Tree: Indexing XML Data for Efficient Structural Joins</title><author><AuthorName>Haifeng Jiang</AuthorName><institute><InstituteName>Hong Kong University of Science and Technolog</InstituteName><country></country></institute></author><author><AuthorName>Hongjun Lu</AuthorName><institute><InstituteName>Hong Kong University of Science and Technolog</InstituteName><country></country></institute></author><author><AuthorName>Wei Wang</AuthorName><institute><InstituteName>Hong Kong University of Science and Technolog</InstituteName><country></country></institute></author><author><AuthorName>Beng Chin Ooi</AuthorName><institute><InstituteName>National University of Singapor</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>XML documents are typically queried with a combination of value search and structure search. While querying by values can leverage traditional database technologies, evaluating structural relationship, specifically parent-child or ancestor-descendant relationship, between XML element sets has imposed a great challenge on efficient XML query processing. This paper proposes XR-tree, namely, XML Region Tree, which is a dynamic external memory index structure specially designed for strictly nested XML data. The unique feature of XR-tree is that, for a given element, all its ancestors (or descendants) in an element set indexed by an XR-tree can be identified with optimal worst case I/O cost. We then propose a new structural join algorithm that can evaluate the structural relationship between two XR-tree indexed element sets by effectively skipping ancestors and descendants that do not participate in the join. Our extensive performance study shows that the XR-tree based join algorithm significantly outperforms previous algorithms.</abstract></paper><paper><title>Joining Massive High-Dimensional Datasets</title><author><AuthorName>Tamer Kahveci</AuthorName><institute><InstituteName>University of California, Santa Barbar</InstituteName><country></country></institute></author><author><AuthorName>Christian A. Lang</AuthorName><institute><InstituteName>University of California, Santa Barbar</InstituteName><country></country></institute></author><author><AuthorName>Ambuj K. Singh</AuthorName><institute><InstituteName>University of California, Santa Barbar</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We consider the problem of joining massive datasets. We propose two techniques for minimizing disk I/O cost of join operations for both spatial and sequence data. Our techniques optimize the available buffer space using a global view of the datasets. We build a boolean matrix on the pages of the given datasets using a lower bounding distance predictor. The marked entries of this matrix represent candidate page pairs to be joined. Our first technique joins the marked pages iteratively. Our second technique clusters the marked entries using rectangular dense regions that have minimal perimeter and fit into buffer. These clusters are then ordered so that the total number of common pages between consecutive clusters is maximal. The clusters are then read from disk and joined. Our experimental results on various real datasets show that our techniques are 2 to 86 times faster than the competing techniques for spatial datasets, and 13 to 133 times faster than the competing techniques for sequence datasets.</abstract></paper><paper><title>Ranked Join Indices</title><author><AuthorName>Panayiotis Tsaparas</AuthorName><institute><InstituteName>University of Toront</InstituteName><country></country></institute></author><author><AuthorName>Themistoklis Palpanas</AuthorName><institute><InstituteName>University of Toront</InstituteName><country></country></institute></author><author><AuthorName>Yannis Kotidis</AuthorName><institute><InstituteName>AT&T Labs-Researc</InstituteName><country></country></institute></author><author><AuthorName>Nick Koudas</AuthorName><institute><InstituteName>AT&T Labs-Researc</InstituteName><country></country></institute></author><author><AuthorName>Divesh Srivastava</AuthorName><institute><InstituteName>AT&T Labs-Researc</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>A plethora of data sources contain data entities that could be ordered according to a variety of attributes associated with the entities. Such orderings result effectively in a ranking of the entities according to the values in the attribute domain. Commonly, users correlate such sources for query processing purposes through join operations. In query processing, it is desirable to incorporate user preferences towards specific attributes or their values. A way to incorporate such preferences is by utilizing scoring functions that combine user preferences and attribute values and return a numerical score for each tuple in the join result. Then, a target query, which we refer to as top-k join query, seeks to identify the k tuples in the join result with the highest scores. In this paper, we propose a novel technique, which we refer to as ranked join index, to efficiently answer top-k join queries for arbitrary, user specified, preferences and a large class of scoring functions. Our rank join index requires small space (compared to the entire join result) and provides gurantees for its performance. Moreover, our proposal provides a graceful tradeoff between its space requirements and worst case search performance. We supplement our analytical results with a thorough experimental evaluation using a variety of real and synthetic data sets, demonstrating that, in comparison to other viable approaches, our technique offers significantly performance benefits.</abstract></paper><paper><title>Pushing Aggregate Constraints by Divide-and-Approximate</title><author><AuthorName>Ke Wang</AuthorName><institute><InstituteName>Simon Fraser Universit</InstituteName><country></country></institute></author><author><AuthorName>Yuelong Jiang</AuthorName><institute><InstituteName>Simon Fraser Universit</InstituteName><country></country></institute></author><author><AuthorName>Jeffrey Xu Yu</AuthorName><institute><InstituteName>Chinese University of Hong Kon</InstituteName><country></country></institute></author><author><AuthorName>Guozhu Dong</AuthorName><institute><InstituteName>Wright State Universit</InstituteName><country></country></institute></author><author><AuthorName>Jiawei Han</AuthorName><institute><InstituteName>University of Illinois at Urbana-Champaig</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Iceberg-cube mining is to compute the GROUP BY partitions, for all GROUP BY dimension lists, that satisfy a given aggregate constraint. Previous works have pushed anti-monotone constraints into iceberg-cube mining. However, many useful constraints are not anti-monotone. In this paper, we propose a novel strategy for pushing general aggregate constraints, called Divide-and-Approximate. This strategy divides the search space and approximates the constraint in subspaces by a pushable constraint. As the strategy is recursively applied, the approximation approaches the given constraint and the pruning tights up. We show that all constraints defined by SQL aggregates, arithmetic operators and comparison operators can be pushed by Divide-and-Approximate. We present an efficient implementation for an important subclass and evaluate it on both synthetic and real life databases.</abstract></paper><paper><title>SWAT: Hierarchical Stream Summarization in Large Networks</title><author><AuthorName>Ahmet Bulut</AuthorName><institute><InstituteName>University of California Santa Barbara, C</InstituteName><country></country></institute></author><author><AuthorName>Ambuj K. Singh</AuthorName><institute><InstituteName>University of California Santa Barbara, C</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>The problem of statistics and aggregate maintenance over data streams has gained popularity in recent years especially in telecommunications network monitoring, trend-related analysis, web-click streams, stock tickers, and other time-variant data. The amount of data generated in such applications can become too large to store, or if stored too large to scan multiple times. We consider queries over data streams that are biased towards the more recent values. We develop a technique that summarizes a dynamic stream incrementally at multiple resolutions. This approximation can be used to answer point queries, range queries, and inner product queries. Moreover, the precision of answers can be changed adaptively by a client. Later, we extend the above technique to work in a distributed setting, specifically in a large network where a central site summarizes the stream and clients ask queries. We minimize the message overhead by deciding what and where to replicate by using an adaptive replication scheme. We maintain a hierarchy of approximations that change adaptively based on the query and update rates. We show experimentally that our technique performs better than existing techniques: up to 50 times better in terms of approximation quality, up to four orders of magnitude times better in response time, and up to five times better in terms of message complexity.</abstract></paper><paper><title>LOCI: Fast Outlier Detection Using the Local Correlation Integral</title><author><AuthorName>Spiros Papadimitriou</AuthorName><institute><InstituteName>Carnegie Mellon Universit</InstituteName><country></country></institute></author><author><AuthorName>Hiroyuki Kitagawa</AuthorName><institute><InstituteName>University of Tsukub</InstituteName><country></country></institute></author><author><AuthorName>Phillip B. Gibbons</AuthorName><institute><InstituteName>Intel Research Pittsburg</InstituteName><country></country></institute></author><author><AuthorName>Christos Faloutsos</AuthorName><institute><InstituteName>Carnegie Mellon Universit</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Outlier detection is an integral part of data mining and has attracted much attention recently [8, 15, 19]. In this paper, we propose a new method for evaluating outlier-ness, which we call the Local Correlation Integral (LOCI). As with the best previous methods, LOCI is highly effective for detecting outliers and groups of outliers (a.k.a. micro-clusters). In addition, it offers the following advantages and novelties: (a) It provides an automatic, data-dictated cut-off to determine whether a point is an outlier-in contrast, previous methods force users to pick cut-offs, without any hints as to what cut-off value is best for a given dataset. (b) It can provide a LOCI plot for each point; this plot summarizes a wealth of information about the data in the vicinity of the point, determining clusters, micro-clusters, their diameters and their inter-cluster distances. None of the existing outlier-detection methods can match this feature, because they output only a single number for each point: its outlier-ness score. (c) Out LOCI method can be computed as quickly as the best previous methods. (d) Moreover, LOCI leads to a practically linear approximate method, aLOCI (for approximate LOCI), which provides fast highly-accurate outlier detection. To the best of our knowledge, this is the first work to use approximate computations to speed up outlier detection.</abstract></paper><paper><title>Combining Hierarchy Encoding and Pre-Grouping: Intelligent Grouping in Star Join Processing</title><author><AuthorName>Roland Pieringer</AuthorName><institute><InstituteName>Transaction Software GmbH, German</InstituteName><country></country></institute></author><author><AuthorName>Klaus Elhardt</AuthorName><institute><InstituteName>Transaction Software GmbH, German</InstituteName><country></country></institute></author><author><AuthorName>Frank Ramsak</AuthorName><institute><InstituteName>Bayerisches Forschungszentrum f黵 Wissensbasierte Systeme, German</InstituteName><country></country></institute></author><author><AuthorName>Volker Markl</AuthorName><institute><InstituteName>IBM Almaden Research Center, San Jose, C</InstituteName><country></country></institute></author><author><AuthorName>Robert Fenk</AuthorName><institute><InstituteName>Bayerisches Forschungszentrum f黵 Wissensbasierte Systeme, German</InstituteName><country></country></institute></author><author><AuthorName>Rudolf Bayer</AuthorName><institute><InstituteName>TU-M黱chen, German</InstituteName><country></country></institute></author><author><AuthorName>Nikos Karayannidis</AuthorName><institute><InstituteName>National Technical University of Athens, Hella</InstituteName><country></country></institute></author><author><AuthorName>Aris Tsois</AuthorName><institute><InstituteName>National Technical University of Athens, Hella</InstituteName><country></country></institute></author><author><AuthorName>Timos Sellis</AuthorName><institute><InstituteName>National Technical University of Athens, Hella</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Efficient star query processing is crucial for a performant data warehouse (DW) implementation and much work is available on physical optimization (e.g., indexing and schema design) and logical optimization (e.g., pre-aggregated materialized views with query rewriting). One important step in the query processing phase is, however, still a bottleneck: the residual join of results from the fact table with the dimension tables in combination with grouping and aggregation. This phase typically consumes between 50% and 80% of the overall processing time. In typical DW scenarios pre-grouping methods only have a limited effect as the grouping is usually specified on the hierarchy levels of the dimension tables and not on the fact table itself. In this paper, we suggest a combination of hierarchical clustering and pre-grouping as we have implemented in the relational DBMS Transbase. Exploiting hierarchy semantics for the pre-grouping of fact table result tuples is several times faster than conventional query processing. The reason for this is that hierarchical pre-grouping reduces the number of join operations significantly. With this method even queries covering a large part of the fact table can be executed within a time span acceptable for interactive query processing.</abstract></paper><paper><title>Evaluating Window Joins over Unbounded Streams</title><author><AuthorName>Jaewoo Kang</AuthorName><institute><InstituteName>University of Wisconsin-Madiso</InstituteName><country></country></institute></author><author><AuthorName>Jeffrey F. Naughton</AuthorName><institute><InstituteName>University of Wisconsin-Madiso</InstituteName><country></country></institute></author><author><AuthorName>Stratis D. Viglas</AuthorName><institute><InstituteName>University of Wisconsin-Madiso</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We investigate algorithms for evaluating sliding window joins over pairs of unbounded streams. We introduce a unit-time-basis cost model to analyze the expected performance of these algorithms. Using this cost model, we propose strategies for maximizing the efficiency of processing joins in three scenarios. First, we consider the case where one stream is much faster than the other. We show that asymmetric combinations of join algorithms, (e.g., hash join on one input, nested-loops join on the other) can outperform symmetric join algorithm implementations. Second, we investigate the case where system resources are insufficient to keep up with the input streams. We show that we can maximize the number of join result tuples produced in this case by properly allocating computing resources across the two input streams. Finally, we investigate strategies for maximizing the number of result tuples produced when memory is limited, and show that proper memory allocation across the two input streams can result in significantly lower resource usage and/or more result tuples produced.</abstract></paper><paper><title>Using State Modules for Adaptive Query Processing</title><author><AuthorName>Vijayshankar Raman</AuthorName><institute><InstituteName>IBM Almaden Research Cente</InstituteName><country></country></institute></author><author><AuthorName>Amol Deshpande</AuthorName><institute><InstituteName>University of California, Berkele</InstituteName><country></country></institute></author><author><AuthorName>Joseph M. Hellerstein</AuthorName><institute><InstituteName>University of California, Berkele</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We present a query architecture in which join operators are decomposed into their constituent data structures (State Modules, or SteMs), and dataflow among these SteMs is managed adaptively by an eddy routing operator [2]. Breaking the encapsulation of joins serves two purposes. First, it allows the eddy to observe multiple physical operations embedded in a join algorithm, allowing for better calibration and control of these operations. Second, the SteM on a relation serves as a shared materialization point, enabling multiple competing access methods to share results, which can be leveraged by multiple competing join algorithms. Our architecture extends prior work significantly, allowing continuously adaptive decisions for most major aspects of traditional query optimization: choice of access methods and join algorithms, ordering of operators, and choice of a query spanning tree. SteMs introduce significant routing flexibility to the eddy, enabling more opportunities for adaptation, but also introducing the possibility of incorrect query results. We present constraints on eddy routing through SteMs that ensure correctness while preserving a great deal of flexibility. We also demonstrate the benefits of our architecture via experiments in the Telegraph dataflow system [26]. We show that even a simple routing policy allows significant flexibility in adaptation, including novel effects like automatic &quot;hybridization&quot; of multiple algorithms for a single join.</abstract></paper><paper><title>Keyword Proximity Search on XML Graphs</title><author><AuthorName>Vagelis Hristidis</AuthorName><institute><InstituteName>University of California, San Dieg</InstituteName><country></country></institute></author><author><AuthorName>Yannis Papakonstantinou</AuthorName><institute><InstituteName>University of California, San Dieg</InstituteName><country></country></institute></author><author><AuthorName>Andrey Balmin</AuthorName><institute><InstituteName>University of California, San Dieg</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>XKeyword provides efficient keyword proximity queries on large XML graph databases. A query is simply a list of keywords and does not require any schema or query language knowledge for its formulation. XKeyword is built on a relational database and, hence, can accommodate very large graphs. Query evaluation is optimized by using the graph's schema. In particular, XKeyword consists of two stages. In the preprocessing stage a set of keyword indices are built along with indexed path relations that describe particular patterns of paths in the graph. In the query processing stage plans are developed that use a near optimal set of path relations to efficiently locate the keyword query results. The results are presented graphically using the novel idea of interactive result graphs, which are populated on-demand according to the user's navigation and allow efficient information discovery. We provide theoretical and experimental points for the selection of the appropriate set of precomputed path relations. We also propose and experimentally evaluate algorithms to minimize the number of queries sent to the database to output the top-K results.</abstract></paper><paper><title>XPath Query Evaluation: Improving Time and Space Efficiency</title><author><AuthorName>Georg Gottlob</AuthorName><institute><InstituteName>Technische Universit鋞 Wien, Vienna, Austri</InstituteName><country></country></institute></author><author><AuthorName>Christoph Koch</AuthorName><institute><InstituteName>Technische Universit鋞 Wien, Vienna, Austri</InstituteName><country></country></institute></author><author><AuthorName>Reinhard Pichler</AuthorName><institute><InstituteName>Technische Universit鋞 Wien, Vienna, Austri</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Contemporary XPath query engines evaluate queries in time exponential in the sizes of input queries, a fact that has gone unnoticed for a long time. Recently, the first main-memory evaluation algorithm for XPath 1.0 with polynomial time combined complexity, i.e., which runs in polynomial time both with respect to the size of the data and the queries, has been published (cf. [11]). In this paper, we present several important improvements and extensions of that work, including new XPath processing algorithms with improved time and space efficiency. Moreover, we define a very large and practically relevant fragment of XPath for which a further optimized form of query evaluation is possible. Apart from its immediate relevance for XPath query processing, our work also sheds new light at those features of XPath 1.0 which are most costly relative to their practical usefulness.</abstract></paper><paper><title>PBiTree Coding and Efficient Processing of Containment Joins</title><author><AuthorName>Wei Wang</AuthorName><institute><InstituteName>The Hong Kong University of Sci. & Tech</InstituteName><country></country></institute></author><author><AuthorName>Haifeng Jiang</AuthorName><institute><InstituteName>The Hong Kong University of Sci. & Tech</InstituteName><country></country></institute></author><author><AuthorName>Hongjun Lu</AuthorName><institute><InstituteName>The Hong Kong University of Sci. & Tech</InstituteName><country></country></institute></author><author><AuthorName>Jeffrey Xu Yu</AuthorName><institute><InstituteName>The Chinese University of Hong Kon</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>This paper addresses issues related to containment join processing in tree-structured data such as XML documents. A containment join takes two sets of XML node elements as input and returns pairs of elements such that the containment relationship holds between them. While there are previous algorithms for processing containments joins, they require both element sets either sorted or indexed. This paper proposes a novel and complete containment query processing framework based on a new coding scheme, PBiTree code. The PBiTree code allows us to determine the ancestor-descendant relationship between two elements from their PBiTree-based codes efficiently. We present algorithms in the framework that are optimized for various combinations of settings. In particular, the newly proposed partitioning based algorithms can process containment joins efficiently without sorting or indexes. Experimental results indicate that the containment join processing algorithms based on the proposed coding scheme outperform existing algorithms significantly.</abstract></paper><paper><title>Representing Web Graphs</title><author><AuthorName>Sriram Raghavan</AuthorName><institute><InstituteName>Stanford Universit</InstituteName><country></country></institute></author><author><AuthorName>Hector Garcia-Molina</AuthorName><institute><InstituteName>Stanford Universit</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>A Web repository is a large special-purpose collection of Web pages and associated indexes. Many useful queries and computations over such repositories involve traversal and navigation of the Web graph. However, efficient traversal of huge Web graphs containing several hundred million vertices and a few billion edges is a challenging problem. An additional complication is the lack of a schema to describe the structure of Web graphs. As a result, naive graph representation schemes can significantly increase query execution time and limit the usefulness of Web repositories. In this paper, we propose a novel representation for Web graphs, called an S-Node representation. We demonstrate that S-Node representations are highly space-efficient, enabling in-memory processing of very large Web graphs. In addition, we present detailed experiments that show that S-Node representations can significantly reduce query execution times when compared with other schemes for representing Web graphs.</abstract></paper><paper><title>Selectivity Estimation for Predictive Spatio-Temporal Queries</title><author><AuthorName>Yufei Tao</AuthorName><institute><InstituteName>Carnegie Mellon Universit</InstituteName><country></country></institute></author><author><AuthorName>Jimeng Sun</AuthorName><institute><InstituteName>Hong Kong University of Science and Technolog</InstituteName><country></country></institute></author><author><AuthorName>Dimitris Papadias</AuthorName><institute><InstituteName>Hong Kong University of Science and Technolog</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>This paper proposes a cost model for selectivity estimation of predictive spatio-temporal window queries. Initially, we focus on uniform data proposing formulae that capture both points and rectangles, and any type of object/query mobility combination (i.e., dynamic objects, dynamic queries or both). Then, we apply the model to non-uniform datasets by introducing spatio-temporal histograms, which in addition to the spatial, also consider the velocity distributions during partitioning. The advantages of our techniques are (i) high accuracy (1-2 orders of magnitude lower error than previous techniques), (ii) ability to handle all query types, and (iii) efficient handling of updates.</abstract></paper><paper><title>Capturing Sensor-Generated Time Series with Quality Guarantees</title><author><AuthorName>Iosif Lazaridis</AuthorName><institute><InstituteName>University of California, Irvin</InstituteName><country></country></institute></author><author><AuthorName>Sharad Mehrotra</AuthorName><institute><InstituteName>University of California, Irvin</InstituteName><country></country></institute></author><year>2004</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We are interested in capturing time series generated by small wireless electronic sensors. Battery-operated sensors must avoid heavy use of their wireless radio which is a key cause of energy dissipation. When many sensors transmit, the resources of the recipient of the data are taxed; hence, limiting communication will benefit the recipient as well. In our paper we show how time series generated by sensors can be captured and stored in a database system (archive). Sensors compress time series instead of sending them in raw form. We propose an optimal on-line algorithm for constructing a piecewise constant approximation (PCA) of a time series which guarantees that the compressed representation satisfies an error bound on the L\infty distance. In addition to the capture task, we often want to estimate the values of a time series ahead of time, e.g., to answer real-time queries. To acheive this, sensors may fit predictive models on observed data, sending parameters of these models to the archive. We exploit the interplay between prediction and compression in a unified framework that avoids duplicating effort and leads to reduced communication.</abstract></paper><paper><title>Structural Join Order Selection for XML Query Optimization</title><author><AuthorName>Yuqing Wu</AuthorName><institute><InstituteName>Univ. of Michiga</InstituteName><country></country></institute></author><author><AuthorName>Jignesh M. Patel</AuthorName><institute><InstituteName>Univ. of Michiga</InstituteName><country></country
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -