📄 icde_2005_elementary.txt
字号:
<proceedings><paper><title>Welcome from the Program Chairs</title><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>Program Committee</title><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>External Reviewers</title><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>&quot;One Size Fits All&quot;: An Idea Whose Time Has Come and Gone</title><author><AuthorName>Michael Stonebraker</AuthorName><institute><InstituteName>StreamBase Systems, Inc</InstituteName><country></country></institute></author><author><AuthorName>U&#287;ur 莈tintemel</AuthorName><institute><InstituteName>Brown University and StreamBase Systems, Inc</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>The last 25 years of commercial DBMS development&#10;can be summed up in a single phrase: &quot;One size fits all&quot;.&#10;This phrase refers to the fact that the traditional DBMS&#10;architecture (originally designed and optimized for&#10;business data processing) has been used to support many&#10;data-centric applications with widely varying&#10;characteristics and requirements.In this paper, we argue that this concept is no longer&#10;applicable to the database market, and that the&#10;commercial world will fracture into a collection of&#10;independent database engines, some of which may be&#10;unified by a common front-end parser. We use examples&#10;from the stream-processing market and the data-warehouse&#10;market to bolster our claims. We also briefly&#10;discuss other markets for which the traditional&#10;architecture is a poor fit and argue for a critical&#10;rethinking of the current factoring of systems services&#10;into products.</abstract></paper><paper><title>IC Tag Based Traceability: System and Solutions</title><author><AuthorName>Yoji Taniguchi</AuthorName><institute><InstituteName>Hitachi, Ltd</InstituteName><country></country></institute></author><author><AuthorName>Nobutoshi Sagawa</AuthorName><institute><InstituteName>Hitachi, Ltd</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>An increasing number of companies want to improve&#10;product traceability for several reasons: to meet stricter&#10;government regulations about food and medical safety, to&#10;cope with ever-stronger consumer demands to know&#10;exactly what they are buying, and to improve and protect&#10;the company's brand value through more transparent&#10;business operations. Two aspects of traceability are&#10;technically important: (1) techniques for tracing the&#10;events associated with the goods a company handles at all&#10;necessary points of the business operation, possibly&#10;through the use of IC tags and tag readers; and (2) ways&#10;to store, manage, and use the collected logs of events&#10;either to cope with problems or to improve business&#10;processes.In this paper, we first review currently available&#10;traceability systems by considering examples from real-world&#10;situations. After that, we discuss the likely&#10;directions and possibilities of next-generation traceability&#10;systems.</abstract></paper><paper><title>Effective Computation of Biased Quantiles over Data Streams</title><author><AuthorName>Graham Cormode</AuthorName><institute><InstituteName>Bell Labs, Lucent Technologie</InstituteName><country></country></institute></author><author><AuthorName>Flip Korn</AuthorName><institute><InstituteName>AT&T Labs-Researc</InstituteName><country></country></institute></author><author><AuthorName>S. Muthukrishnan</AuthorName><institute><InstituteName>Rutgers Universit</InstituteName><country></country></institute></author><author><AuthorName>Divesh Srivastava</AuthorName><institute><InstituteName>AT&T Labs-Researc</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Skew is prevalent in many data sources such as IP traffic&#10;streams. To continually summarize the distribution of such data, a&#10;high-biased set of quantiles (e.g., 50th, 90th and 99th percentiles)&#10;with finer error guarantees at higher ranks (e.g., errors of 5,&#10;1 and 0.1 percent, respectively) is more useful than uniformly&#10;distributed quantiles (e.g., 25th, 50th and 75th percentiles) with&#10;uniform error guarantees. In this paper, we address the following&#10;two problems. First, can we compute quantiles with finer&#10;error guarantees for the higher ranks of the data distribution&#10;effectively, using less space and computation time than computing&#10;all quantiles uniformly at the finest error? Second, if specific&#10;quantiles and their error bounds are requested a priori, can the&#10;necessary space usage and computation time be reduced?We answer both questions in the affirmative by formalizing&#10;them as the &quot;high-biased&quot; and the &quot;targeted&quot; quantiles problems,&#10;respectively, and presenting algorithms with provable guarantees,&#10;that perform significantly better than previously known&#10;solutions for these problems. We implemented our algorithms in&#10;the Gigascope data stream management system, and evaluated alternate&#10;approaches for maintaining the relevant summary structures.&#10;Our experimental results on real and synthetic IP data&#10;streams complement our theoretical analyses, and highlight the&#10;importance of lightweight, non-blocking implementations when&#10;maintaining summary structures over high-speed data streams.</abstract></paper><paper><title>Range-Efficient Computation of F&#8320; over Massive Data Streams</title><author><AuthorName>A. Pavan</AuthorName><institute><InstituteName>Iowa State Universit</InstituteName><country></country></institute></author><author><AuthorName>Srikanta Tirthapura</AuthorName><institute><InstituteName>Iowa State Universit</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Efficient one-pass computation of F&#8320;, the number of distinct&#10;elements in a data stream, is a fundamental problem&#10;arising in various contexts in databases and networking. We&#10;consider the problem of efficiently estimating F&#8320; of a data&#10;stream where each element of the stream is an interval of integers.We present a randomized algorithm which gives an (\varepsilon ,\delta )&#10;approximation of F&#8320;, with the following time complexity (n&#10;is the size of the universe of the items): (1) The amortized&#10;processing time per interval is 0(\log \frac{1}{\delta }Log\frac{n}{\varepsilon }). (2) The time&#10;to answer a query for F&#8320; is 0(\log {1 \mathord{\left/ {\vphantom {1 {\delta )}}} \right. \kern-\nulldelimiterspace} {\delta )}}. The workspace used&#10;is 0(\frac{1}{{\varepsilon ^2 }}\log \frac{1}{\delta }\log n) bits.Our algorithm improves upon a previous algorithm&#10;by Bar-Yossef, Kumar and Sivakumar [5], which requires&#10;0(\frac{1}{{\varepsilon ^5 }}\log \frac{1}{\delta }\log ^5 n) processing time per item. Our algorithm can be used to compute the max-dominance norm&#10;of a stream of multiple signals, and significantly improves&#10;upon the current best bounds due to Cormode and&#10;Muthukrishnan [11]. This also provides efficient and novel&#10;solutions for data aggregation problems in sensor networks&#10;studied by Nath and Gibbons [22] and Considine et.&#10;al. [8].</abstract></paper><paper><title>A Unified Framework for Monitoring Data Streams in Real Time</title><author><AuthorName>Ahmet Bulut</AuthorName><institute><InstituteName>University of California at Santa Barbar</InstituteName><country></country></institute></author><author><AuthorName>Ambuj K. Singh</AuthorName><institute><InstituteName>University of California at Santa Barbar</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Online monitoring of data streams poses a challenge in&#10;many data-centric applications, such as telecommunications&#10;networks, traffic management, trend-related analysis, web-click&#10;streams, intrusion detection, and sensor networks. Mining&#10;techniques employed in these applications have to be &#10;efficient in terms of space usage and per-item processing time&#10;while providing a high quality of answers to (1) aggregate&#10;monitoring queries, such as finding surprising levels of a&#10;data stream, detecting bursts, and to (2) similarity queries,&#10;such as detecting correlations and finding interesting patterns.&#10;The most important aspect of these tasks is their need&#10;for flexible query lengths, i.e., it is difficult to set the appropriate&#10;lengths a priori. For example, bursts of events can&#10;occur at variable temporal modalities from hours to days&#10;to weeks. Correlated trends can occur at various temporal&#10;scales. The system has to discover &quot;interesting&quot; behavior&#10;online and monitor over flexible window sizes. In this paper,&#10;we propose a multi-resolution indexing scheme, which&#10;handles variable length queries efficiently. We demonstrate&#10;the effectiveness of our framework over existing techniques&#10;through an extensive set of experiments.</abstract></paper><paper><title>Corpus-Based Schema Matching</title><author><AuthorName>Jayant Madhavan</AuthorName><institute><InstituteName>University of Washingto</InstituteName><country></country></institute></author><author><AuthorName>Philip A. Bernstein</AuthorName><institute><InstituteName>Microsoft Researc</InstituteName><country></country></institute></author><author><AuthorName>AnHai Doan</AuthorName><institute><InstituteName>University of Illinois at Urbana-Champaig</InstituteName><country></country></institute></author><author><AuthorName>Alon Halevy</AuthorName><institute><InstituteName>University of Washingto</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Schema Matching is the problem of identifying corresponding&#10;elements in different schemas. Discovering these&#10;correspondences or matches is inherently difficult to automate.&#10;Past solutions have proposed a principled combination&#10;of multiple algorithms. However, these solutions&#10;sometimes perform rather poorly due to the lack of &#10;sufficient evidence in the schemas being matched. In this paper&#10;we show how a corpus of schemas and mappings can&#10;be used to augment the evidence about the schemas being&#10;matched, so they can be matched better. Such a corpus typically&#10;contains multiple schemas that model similar concepts&#10;and hence enables us to learn variations in the elements&#10;and their properties. We exploit such a corpus in two&#10;ways. First, we increase the evidence about each element&#10;being matched by including evidence from similar elements&#10;in the corpus. Second, we learn statistics about elements&#10;and their relationships and use them to infer constraints&#10;that we use to prune candidate mappings. We also describe&#10;how to use known mappings to learn the importance of domain&#10;and generic constraints. We present experimental results&#10;that demonstrate corpus-based matching outperforms&#10;direct matching (without the benefit of a corpus) in multiple&#10;domains.</abstract></paper><paper><title>Schema Matching Using Duplicates</title><author><AuthorName>Alexander Bilke</AuthorName><institute><InstituteName>Technische Universit鋞 Berli</InstituteName><country></country></institute></author><author><AuthorName>Felix Naumann</AuthorName><institute><InstituteName>Humboldt-Universit鋞 zu Berli</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Most data integration applications require a matching&#10;between the schemas of the respective data sets. We show&#10;how the existence of duplicates within these data sets can be&#10;exploited to automatically identify matching attributes. We&#10;describe an algorithm that first discovers duplicates among&#10;data sets with unaligned schemas and then uses these duplicates&#10;to perform schema matching between schemas with&#10;opaque column names.Discovering duplicates among data sets with unaligned&#10;schemas is more difficult than in the usual setting, because&#10;it is not clear which fields in one object should be compared&#10;with which fields in the other. We have developed a new&#10;algorithm that efficiently finds the most likely duplicates in&#10;such a setting. Now, our schema matching algorithm is able&#10;to identify corresponding attributes by comparing data values&#10;within those duplicate records. An experimental study&#10;on real-world data shows the effectiveness of this approach.</abstract></paper><paper><title>Representing and Querying Data Transformations</title><author><AuthorName>Yannis Velegrakis</AuthorName><institute><InstituteName>AT&T Labs - Researc</InstituteName><country></country></institute></author><author><AuthorName>Ren閑 J. Miller</AuthorName><institute><InstituteName>University of Toront</InstituteName><country></country></institute></author><author><AuthorName>John Mylopoulos</AuthorName><institute><InstituteName>University of Toront</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Modern information systems often store data that has&#10;been transformed and integrated from a variety of sources.&#10;This integration may obscure the original source semantics&#10;of data items. For many tasks, it is important to be&#10;able to determine not only where data items originated,&#10;but also why they appear in the integration as they do and&#10;through what transformation they were derived. This problem&#10;is known as data provenance. In this work, we consider&#10;data provenance at the schema and mapping level. In particular,&#10;we consider how to answer questions such as &quot;what&#10;schema elements in the source(s) contributed to this value&quot;,&#10;or &quot;through what transformations or mappings was this&#10;value derived?&quot; Towards this end, we elevate schemas and&#10;mappings to first-class citizens that are stored in a repository&#10;and are associated with the actual data values. An&#10;extended query language, called MXQL, is also developed&#10;that allows meta-data to be queried as regular data and we&#10;describe its implementation. scenario.</abstract></paper><paper><title>Bypass Caching: Making Scientific Databases Good Network Citizens</title><author><AuthorName>Tanu Malik</AuthorName><institute><InstituteName>Johns Hopkins Universit</InstituteName><country></country></institute></author><author><AuthorName>Randal Burns</AuthorName><institute><InstituteName>Johns Hopkins Universit</InstituteName><country></country></institute></author><author><AuthorName>Amitabh Chaudhary</AuthorName><institute><InstituteName>University of Notre Dam</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Scientific database federations are geographically distributed&#10;and network bound. Thus, they could benefit&#10;from proxy caching. However, existing caching techniques&#10;are not suitable for their workloads, which compare&#10;and join large data sets. Existing techniques reduce parallelism&#10;by conducting distributed queries in a single cache&#10;and lose the data reduction benefits of performing selections&#10;at each database. We develop the bypass-yield formulation&#10;of caching, which reduces network traffic in&#10;wide-area database federations, while preserving parallelism&#10;and data reduction. Bypass-yield caching is altruistic;&#10;caches minimize the overall network traffic generated&#10;by the federation, rather than focusing on local performance.&#10;We present an adaptive, workload-driven algorithm&#10;for managing a bypass-yield cache. We also develop&#10;on-line algorithms that make no assumptions about workload:&#10;a k-competitive deterministic algorithm and a randomized&#10;algorithm with minimal space complexity. We&#10;verify the efficacy of bypass-yield caching by running workload&#10;traces collected from the Sloan Digital Sky Survey&#10;through a prototype implementation.</abstract></paper><paper><title>Asymmetric Batch Incremental View Maintenance</title><author><AuthorName>Hao He</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><author><AuthorName>Junyi Xie</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><author><AuthorName>Jun Yang</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><author><AuthorName>Hai Yu</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Incremental view maintenance has found a growing&#10;number of applications recently, including data warehousing,&#10;continuous query processing, publish/subscribe systems,&#10;etc. Batch processing of base table modifications,&#10;when applicable, can be much more efficient than processing&#10;individual modifications one at a time. In this paper, we&#10;tackle the problem of finding the most efficient batch incremental&#10;maintenance strategy under a refresh response time&#10;constraint; that is, at any point in time, the system, upon&#10;request, must be able to bring the view up to date within&#10;a specified amount of time. The traditional approach is&#10;to process all batched modifications relevant to the view&#10;whenever the constraint is violated. However, we observe&#10;that there often exists natural asymmetry among different&#10;components of the maintenance cost; for example, &#10;modifications on one base table might be cheaper to process than&#10;those on another base table because of some index. We exploit&#10;such asymmetries using an unconventional strategy&#10;that selectively processes modifications on some base tables&#10;while keeping batching others. We present a series&#10;of analytical results leading to the development of practical&#10;algorithms that approximate an &quot;oracle algorithm&quot;&#10;with perfect knowledge of the future. With experiments on&#10;a TPC-R database, we demonstrate that our strategy offers&#10;substantial performance gains over traditional deferred&#10;view maintenance techniques.</abstract></paper><paper><title>Adaptive Caching for Continuous Queries</title><author><AuthorName>Shivnath Babu</AuthorName><institute><InstituteName>Stanford Universit</InstituteName><country></country></institute></author><author><AuthorName>Kamesh Munagala</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><author><AuthorName>Jennifer Widom</AuthorName><institute><InstituteName>Stanford Universit</InstituteName><country></country></institute></author><author><AuthorName>Rajeev Motwani</AuthorName><institute><InstituteName>Stanford Universit</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We address the problem of executing continuous multiway&#10;join queries in unpredictable and volatile environments.&#10;Our query class captures windowed join queries in&#10;data stream systems as well as conventional maintenance&#10;of materialized join views. Our adaptive approach handles&#10;streams of updates whose rates and data characteristics&#10;may change over time, as well as changes in system&#10;conditions such as memory availability. In this paper&#10;we focus specifically on the problem of adaptive placement&#10;and removal of caches to optimize join performance.&#10;Our approach automatically considers conventional tree-shaped&#10;join plans with materialized subresults at every intermediate&#10;node, subresult-free MJoins, and the entire spectrum&#10;between them. We provide algorithms for selecting&#10;caches, monitoring their cost and benefits in current conditions,&#10;allocating memory to caches, and adapting as conditions&#10;change. All of our algorithms are implemented in the&#10;STREAM prototype Data Stream Management System and&#10;a thorough experimental evaluation is included.</abstract></paper><paper><title>Snapshot Queries: Towards Data-Centric Sensor Networks</title><author><AuthorName>Yannis Kotidis</AuthorName><institute><InstituteName>AT&T Labs-Researc</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>In this paper we introduce the idea of snapshot queries&#10;for energy efficient data acquisition in sensor networks.&#10;Network nodes generate models of their surrounding environment&#10;that are used for electing, using a localized algorithm,&#10;a small set of representative nodes in the network.&#10;These representative nodes constitute a network snapshot&#10;and can be used to provide quick approximate answers to&#10;user queries while reducing substantially the energy consumption&#10;in the network. We present a detailed experimental&#10;study of our framework and algorithms, varying multiple&#10;parameters like the available memory of the sensor nodes,&#10;their transmission range, the network message loss etc. Depending&#10;on the configuration, snapshot queries provide a&#10;reduction of up to 90% in the number of nodes that need to&#10;participate in a user query.</abstract></paper><paper><title>Exploiting Correlated Attributes in Acquisitional Query Processing</title><author><AuthorName>Amol Deshpande</AuthorName><institute><InstituteName>University of Marylan</InstituteName><country></country></institute></author><author><AuthorName>Carlos Guestrin</AuthorName><institute><InstituteName>Carnegi Mellon Universit</InstituteName><country></country></institute></author><author><AuthorName>Wei Hong</AuthorName><institute><InstituteName>Intel Research Berkele</InstituteName><country></country></institute></author><author><AuthorName>Samuel Madden</AuthorName><institute><InstituteName>Massachussets Institute of Technolog</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Sensor networks and other distributed information systems&#10;(such as the Web) must frequently access data that has a high&#10;per-attribute acquisition cost, in terms of energy, latency, or&#10;computational resources. When executing queries that contain&#10;several predicates over such expensive attributes, we observe&#10;that it can be beneficial to use correlations to automatically introduce&#10;low-cost attributes whose observation will allow the query&#10;processor to better estimate the selectivity of these expensive&#10;predicates. In particular, we show how to build conditional&#10;plans that branch into one or more sub-plans, each with a different&#10;ordering for the expensive query predicates, based on the&#10;runtime observation of low-cost attributes. We frame the problem&#10;of constructing the optimal conditional plan for a given user&#10;query and set of candidate low-cost attributes as an optimization&#10;problem. We describe an exponential time algorithm for finding&#10;such optimal plans, and describe a polynomial-time heuristic&#10;for identifying conditional plans that perform well in practice.&#10;We also show how to compactly model conditional probability&#10;distributions needed to identify correlations and build these&#10;plans. We evaluate our algorithms against several real-world&#10;sensor-network data sets, showing several-times performance&#10;increases for a variety of queries versus traditional optimization&#10;techniques.</abstract></paper><paper><title>Data Triage: An Adaptive Architecture for Load Shedding in TelegraphCQ</title><author><AuthorName>Frederick Reiss</AuthorName><institute><InstituteName>University of California at Berkele</InstituteName><country></country></institute></author><author><AuthorName>Joseph M. Hellerstein</AuthorName><institute><InstituteName>Intel Research Berkele</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Many of the data sources used in stream query processing&#10;are known to exhibit bursty behavior. Data in a burst often&#10;has different characteristics than steady-state data, and&#10;therefore may be of particular interest. In this paper, we&#10;describe the Data Triage architecture that we are adding to&#10;TelegraphCQ to provide low latency results with good accuracy&#10;under such bursts.</abstract></paper><paper><title>AutoLag: Automatic Discovery of Lag Correlations in Stream Data</title><author><AuthorName>Yasushi Sakurai</AuthorName><institute><InstituteName>NTT Cyber Space Laboratorie</InstituteName><country></country></institute></author><author><AuthorName>Spiros Papadimitriou</AuthorName><institute><InstituteName>Carnegie Mellon Universit</InstituteName><country></country></institute></author><author><AuthorName>Christos Faloutsos</AuthorName><institute><InstituteName>Carnegie Mellon Universit</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>Adaptive Processing of Top-k Queries in XML</title><author><AuthorName>Am閘ie Marian</AuthorName><institute><InstituteName>Columbia Universit</InstituteName><country></country></institute></author><author><AuthorName>Sihem Amer-Yahia</AuthorName><institute><InstituteName>AT&T Labs-Researc</InstituteName><country></country></institute></author><author><AuthorName>Nick Koudas</AuthorName><institute><InstituteName>AT&T Labs-Researc</InstituteName><country></country></institute></author><author><AuthorName>Divesh Srivastava</AuthorName><institute><InstituteName>AT&T Labs-Researc</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>The ability to compute top-k matches to XML queries is&#10;gaining importance due to the increasing number of large&#10;XML repositories. The efficiency of top-k query evaluation&#10;relies on using scores to prune irrelevant answers as early&#10;as possible in the evaluation process. In this context, evaluating&#10;the same query plan for all answers might be too rigid&#10;because, at any time in the evaluation, answers have gone&#10;through the same number and sequence of operations, which&#10;limits the speed at which scores grow. Therefore, adaptive&#10;query processing that permits different plans for different&#10;partial matches and maximizes the best scores is more appropriate.&#10;In this paper, we propose an architecture and&#10;adaptive algorithms for efficiently computing top-k matches&#10;to XML queries. Our techniques can be used to evaluate&#10;both exact and approximate matches where approximation&#10;is defined by relaxing XPath axes. In order to compute the&#10;scores of query answers, we extend the traditional tf*idf&#10;measure to account for document structure. We conduct extensive&#10;experiments on a variety of benchmark data and&#10;queries, and demonstrate the usefulness of the adaptive approach&#10;for computing top-k queries in XML.</abstract></paper><paper><title>Progressive Distributed Top-k Retrieval in Peer-to-Peer Networks</title><author><AuthorName>Wolf-Tilo Balke</AuthorName><institute><InstituteName>University of California at Berkele</InstituteName><country></country></institute></author><author><AuthorName>Wolfgang Nejdl</AuthorName><institute><InstituteName>University of Hannove</InstituteName><country></country></institute></author><author><AuthorName>Wolf Siberski</AuthorName><institute><InstituteName>University of Hannove</InstituteName><country></country></institute></author><author><AuthorName>Uwe Thaden</AuthorName><institute><InstituteName>University of Hannove</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Query processing in traditional information management&#10;systems has moved from an exact match model to&#10;more flexible paradigms allowing cooperative retrieval&#10;by aggregating the database objects&#8217; degree of match for&#10;each different query predicate and returning the best&#10;matching objects only. In peer-to-peer systems such&#10;strategies are even more important, given the potentially&#10;large number of peers, which may contribute to the results.&#10;Yet current peer-to-peer research has barely started&#10;to investigate such approaches. In this paper we will discuss&#10;the benefits of best match/top-k queries in the context&#10;of distributed peer-to-peer information infrastructures&#10;and show how to extend the limited query processing in&#10;current peer-to-peer networks by allowing the distributed&#10;processing of top-k queries, while maintaining a minimum&#10;of data traffic. Relying on a super-peer backbone&#10;organized in the HyperCuP topology we will show how to&#10;use local indexes for optimizing the necessary query routing&#10;and how to process intermediate results in inner network&#10;nodes at the earliest possible point in time cutting&#10;down the necessary data traffic within the network. Our&#10;algorithm is based on dynamically collected query statistics&#10;only, no continuous index update processes are necessary,&#10;allowing it to scale easily to large numbers of&#10;peers, as well as dynamic additions/deletions of peers.&#10;We will show our approach to always deliver correct result&#10;sets and to be optimal in terms of necessary object&#10;accesses and data traffic. Finally, we present simulation&#10;results for both static and dynamic network environments.</abstract></paper><paper><title>Optimizing Access Cost for Top-k Queries over Web Sources: A Unified Cost-Based Approach</title><author><AuthorName>Seung-won Hwang</AuthorName><institute><InstituteName>University of Illinois at Urbana-Champaig</InstituteName><country></country></institute></author><author><AuthorName>Kevin Chen-Chuan Chang</AuthorName><institute><InstituteName>University of Illinois at Urbana-Champaig</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>A Framework for High-Accuracy Privacy-Preserving Mining</title><author><AuthorName>Shipra Agrawal</AuthorName><institute><InstituteName>Indian Institute of Science - Bangalor</InstituteName><country></country></institute></author><author><AuthorName>Jayant R. Haritsa</AuthorName><institute><InstituteName>Indian Institute of Science - Bangalor</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>To preserve client privacy in the data mining process, a&#10;variety of techniques based on random perturbation of individual&#10;data records have been proposed recently. In this&#10;paper, we present FRAPP, a generalized matrix-theoretic&#10;framework of random perturbation, which facilitates a systematic&#10;approach to the design of perturbation mechanisms&#10;for privacy-preserving mining. Specifically, FRAPP is used&#10;to demonstrate that (a) the prior techniques differ only in&#10;their choices for the perturbation matrix elements, and (b)&#10;a symmetric perturbation matrix with minimal condition&#10;number can be identified, maximizing the accuracy even&#10;under strict privacy guarantees. We also propose a novel&#10;perturbation mechanism wherein the matrix elements are&#10;themselves characterized as random variables, and demonstrate&#10;that this feature provides significant improvements in&#10;privacy at only a marginal cost in accuracy.The quantitative utility of FRAPP, which applies to&#10;random-perturbation-based privacy-preserving mining in&#10;general, is evaluated specifically with regard to frequent-itemset&#10;mining on a variety of real datasets. Our experimental&#10;results indicate that, for a given privacy requirement,&#10;substantially lower errors are incurred, with respect&#10;to both itemset identity and itemset support, as compared to&#10;the prior techniques.</abstract></paper><paper><title>Top-Down Specialization for Information and Privacy Preservation</title><author><AuthorName>Benjamin C. M. Fung</AuthorName><institute><InstituteName>Simon Fraser Universit</InstituteName><country></country></institute></author><author><AuthorName>Ke Wang</AuthorName><institute><InstituteName>Simon Fraser Universit</InstituteName><country></country></institute></author><author><AuthorName>Philip S. Yu</AuthorName><institute><InstituteName>IBMT. J. Watson Research Cente</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Releasing person-specific data in its most specific state&#10;poses a threat to individual privacy. This paper presents a&#10;practical and efficient algorithm for determining a generalized&#10;version of data that masks sensitive information and&#10;remains useful for modelling classification. The generalization&#10;of data is implemented by specializing or detailing the&#10;level of information in a top-down manner until a minimum&#10;privacy requirement is violated. This top-down specialization&#10;is natural and efficient for handling both categorical&#10;and continuous attributes. Our approach exploits the fact&#10;that data usually contains redundant structures for &#10;classification. While generalization may eliminate some structures,&#10;other structures emerge to help. Our results show that&#10;quality of classification can be preserved even for highly restrictive&#10;privacy requirements. This work has great applicability&#10;to both public and private sectors that share information&#10;for mutual benefits and productivity.</abstract></paper><paper><title>Data Privacy through Optimal k-Anonymization</title><author><AuthorName>Roberto J. Bayardo</AuthorName><institute><InstituteName>IBM Almaden Research Cente</InstituteName><country></country></institute></author><author><AuthorName>Rakesh Agrawal</AuthorName><institute><InstituteName>IBM Almaden Research Cente</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Data de-identification reconciles the demand for release of&#10;data for research purposes and the demand for privacy&#10;from individuals. This paper proposes and evaluates an&#10;optimization algorithm for the powerful de-identification&#10;procedure known as k-anonymization. A k-anonymized&#10;dataset has the property that each record is indistinguishable&#10;from at least k - 1 others. Even simple restrictions of&#10;optimized k-anonymity are NP-hard, leading to significant&#10;computational challenges. We present a new approach to&#10;exploring the space of possible anonymizations that tames&#10;the combinatorics of the problem, and develop data-management&#10;strategies to reduce reliance on expensive operations&#10;such as sorting. Through experiments on real census&#10;data, we show the resulting algorithm can find optimal &#10;k-anonymizations under two representative cost measures&#10;and a wide range of k. We also show that the algorithm&#10;can produce good anonymizations in circumstances where&#10;the input data or input parameters preclude finding an&#10;optimal solution in reasonable time. Finally, we use the&#10;algorithm to explore the effects of different coding&#10;approaches and problem variations on anonymization&#10;quality and performance. To our knowledge, this is the first&#10;result demonstrating optimal k-anonymization of a nontrivial&#10;dataset under a general model of the problem.&#10;</abstract></paper><paper><title>A Comparative Evaluation of Transparent Scaling Techniques for Dynamic Content Servers</title><author><AuthorName>C. Amza</AuthorName><institute><InstituteName>University of Toront</InstituteName><country></country></institute></author><author><AuthorName>A. L. Cox</AuthorName><institute><InstituteName>Rice Universit</InstituteName><country></country></institute></author><author><AuthorName>W. Zwaenepoel</AuthorName><institute><InstituteName>EPF</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We study several transparent techniques for scaling dynamic&#10;content web sites, and we evaluate their relative impact&#10;when used in combination. Full transparency implies&#10;strong data consistency as perceived by the user, no &#10;modifications to existing dynamic content site tiers and no additional&#10;programming effort from the user or site administrator&#10;upon deployment.We study strategies for scheduling and load balancing&#10;queries on a cluster of replicated database back-ends. We&#10;also investigate transparent query caching as a means of&#10;enhancing database replication.Our work shows that, on an experimental platform with&#10;up to 8 database replicas, the various techniques work in&#10;synergy to improve overall scaling for the e-commerce TPCW&#10;benchmark. We rank the techniques necessary for high&#10;performance in order of impact as follows. Key among the&#10;strategies are scheduling strategies, such as conflict-aware&#10;scheduling, that minimize consistency maintainance over-heads.&#10;The choice of load balancing strategy is less important.&#10;Transparent query result caching increases performance&#10;significantly at any given cluster size for a mostly-read&#10;workload. Its benefits are limited for write-intensive&#10;workloads, where content-aware scheduling is the only&#10;scaling option.</abstract></paper><paper><title>SemCast: Semantic Multicast for Content-Based Data Dissemination</title><author><AuthorName>Olga Papaemmanouil</AuthorName><institute><InstituteName>Brown Universit</InstituteName><country></country></institute></author><author><AuthorName>U&#287;ur 莈tintemel</AuthorName><institute><InstituteName>Brown Universit</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We address the problem of content-based&#10;dissemination of highly-distributed, high-volume data&#10;streams for stream-based monitoring applications and&#10;large-scale data delivery. Existing content-based&#10;dissemination approaches commonly rely on distributed&#10;filtering trees that require filtering at all brokers on the&#10;tree. We present a new semantic multicast approach that&#10;eliminates the need for content-based filtering at interior&#10;brokers and facilitates fine-grained control over the&#10;construction of efficient dissemination trees. The central&#10;idea is to split the incoming data streams (based on their&#10;contents, rates, and destinations) and then spread the&#10;pieces across multiple channels, each of which is&#10;implemented as an independent dissemination tree. We&#10;present the basic design and evaluation of SemCast, an&#10;overlay-network based system that implements this&#10;semantic multicast approach. Through a detailed&#10;simulation study and realistic network topologies, we&#10;demonstrate that SemCast significantly improves the&#10;efficiency of dissemination compared to traditional&#10;approaches.</abstract></paper><paper><title>A Distributed Quadtree Index for Peer-to-Peer Settings</title><author><AuthorName>Egemen Tanin</AuthorName><institute><InstituteName>University of Melbourn</InstituteName><country></country></institute></author><author><AuthorName>Aaron Harwood</AuthorName><institute><InstituteName>University of Melbourn</InstituteName><country></country></institute></author><author><AuthorName>Hanan Samet</AuthorName><institute><InstituteName>University of Maryland at College Par</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>DUP: Dynamic-Tree Based Update Propagation in Peer-to-Peer Networks</title><author><AuthorName>Liangzhong Yin</AuthorName><institute><InstituteName>Pennsylvania State Universit</InstituteName><country></country></institute></author><author><AuthorName>Guohong Cao</AuthorName><institute><InstituteName>Pennsylvania State Universit</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>In peer-to-peer networks, indices are used to map data id&#10;to nodes that host the data. The performance of data access&#10;can be improved by actively pushing indices to interested&#10;nodes. This paper proposes the Dynamic-tree based Update&#10;Propagation (DUP) scheme, which builds the update propagation&#10;tree to facilitate the propagation of indices. Because&#10;the update propagation tree only involves nodes that are essential&#10;for update propagation, the overhead of DUP is very&#10;small and the query latency is significantly reduced.</abstract></paper><paper><title>Vectorizing and Querying Large XML Repositories</title><author><AuthorName>Peter Buneman</AuthorName><institute><InstituteName>University of Edinburg</InstituteName><country></country></institute></author><author><AuthorName>Byron Choi</AuthorName><institute><InstituteName>University of Edinburg</InstituteName><country></country></institute></author><author><AuthorName>Wenfei Fan</AuthorName><institute><InstituteName>University of Edinburg</InstituteName><country></country></institute></author><author><AuthorName>Robert Hutchison</AuthorName><institute><InstituteName>University of Edinburg</InstituteName><country></country></institute></author><author><AuthorName>Robert Mann</AuthorName><institute><InstituteName>University of Edinburg</InstituteName><country></country></institute></author><author><AuthorName>Stratis D. Viglas</AuthorName><institute><InstituteName>University of Edinburg</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Vertical partitioning is a well-known technique for optimizing&#10;query performance in relational databases. An extreme form of&#10;this technique, which we call vectorization, is to store each column&#10;separately. We use a generalization of vectorization as the&#10;basis for a native XML store. The idea is to decompose an XML&#10;document into a set of vectors that contain the data values and a&#10;compressed skeleton that describes the structure. In order to query&#10;this representation and produce results in the same vectorized format,&#10;we consider a practical fragment of XQuery and introduce&#10;the notion of query graphs and a novel graph reduction algorithm&#10;that allows us to leverage relational optimization techniques as&#10;well as to reduce the unnecessary loading of data vectors and decompression&#10;of skeletons. A preliminary experimental study based&#10;on some scientific and synthetic XML data repositories in the order&#10;of gigabytes supports the claim that these techniques are scalable&#10;and have the potential to provide performance comparable with&#10;established relational database technology.</abstract></paper><paper><title>IMAX: Incremental Maintenance of Schema-Based XML Statistics</title><author><AuthorName>Maya Ramanath</AuthorName><institute><InstituteName>Indian Institute of Scienc</InstituteName><country></country></institute></author><author><AuthorName>Lingzhi Zhang</AuthorName><institute><InstituteName>OGI/OHS</InstituteName><country></country></institute></author><author><AuthorName>Juliana Freire</AuthorName><institute><InstituteName>OGI/OHS</InstituteName><country></country></institute></author><author><AuthorName>Jayant R. Haritsa</AuthorName><institute><InstituteName>Indian Institute of Scienc</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Current approaches for estimating the cardinality of&#10;XML queries are applicable to a static scenario wherein&#10;the underlying XML data does not change subsequent to&#10;the collection of statistics on the repository. However, in&#10;practice, many XML-based applications are dynamic and&#10;involve frequent updates to the data. In this paper, we investigate&#10;efficient strategies for incrementally maintaining statistical&#10;summaries as and when updates are applied to the&#10;data. Specifically, we propose algorithms that handle both&#10;the addition of new documents as well as random insertions&#10;in the existing document trees. We also show, through a detailed&#10;performance evaluation, that our incremental techniques&#10;are significantly faster than the naive recomputation&#10;approach; and that estimation accuracy can be maintained&#10;even with a fixed memory budget.</abstract></paper><paper><title>BOXes: Efficient Maintenance of Order-Based Labeling for Dynamic XML Data</title><author><AuthorName>Adam Silberstein</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><author><AuthorName>Hao He</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><author><AuthorName>Ke Yi</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><author><AuthorName>Jun Yang</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Order-based element labeling for tree-structured XML&#10;data is an important technique in XML processing. It lies at&#10;the core of many fundamental XML operations such as containment&#10;join and twig matching. While labeling for static&#10;XML documents is well understood, less is known about&#10;how to maintain accurate labeling for dynamic XML documents,&#10;when elements and subtrees are inserted and deleted.&#10;Most existing approaches do not work well for arbitrary update&#10;patterns; they either produce unacceptably long labels&#10;or incur enormous relabeling costs. We present two novel&#10;I/O-efficient data structures, W-BOX and B-BOX, that &#10;efficiently maintain labeling for large, dynamic XML documents.&#10;We show analytically and experimentally that both, despite&#10;consuming minimal amounts of storage, gracefully handle&#10;arbitrary update patterns without sacrificing lookup &#10;efficiency. The two structures together provide a nice tradeoff&#10;between update and lookup costs: W-BOX has logarithmic&#10;amortized update cost and constant worst-case lookup&#10;cost, while B-BOX has constant amortized update cost and&#10;logarithmic worst-case lookup cost. We further propose techniques&#10;to eliminate the lookup cost for read-heavy workloads.</abstract></paper><paper><title>Efficient Inverted Lists and Query Algorithms for Structured Value Ranking in Update-Intensive Relational Databases</title><author><AuthorName>Lin Guo</AuthorName><institute><InstituteName>Cornell Universit</InstituteName><country></country></institute></author><author><AuthorName>Jayavel Shanmugasundaram</AuthorName><institute><InstituteName>Cornell Universit</InstituteName><country></country></institute></author><author><AuthorName>Kevin Beyer</AuthorName><institute><InstituteName>IBM Almaden Research Cente</InstituteName><country></country></institute></author><author><AuthorName>Eugene Shekita</AuthorName><institute><InstituteName>IBM Almaden Research Cente</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We propose a new ranking paradigm for relational&#10;databases called Structured Value Ranking (SVR). SVR uses&#10;structured data values to score (rank) the results of keyword&#10;search queries over text columns. Our main contribution&#10;is a new family of inverted list indices and associated&#10;query algorithms that can support SVR efficiently in&#10;update-intensive databases, where the structured data values&#10;(and hence the scores of documents) change frequently.&#10;Our experimental results on real and synthetic data sets using&#10;BerkeleyDB show that we can support SVR efficiently in&#10;relational databases.</abstract></paper><paper><title>Compressing Bitmap Indices by Data Reorganization</title><author><AuthorName>Ali P韓ar</AuthorName><institute><InstituteName>Lawrence Berkeley National Laborator</InstituteName><country></country></institute></author><author><AuthorName>Tao Tao</AuthorName><institute><InstituteName>University of Illinois at Urbana-Champaig</InstituteName><country></country></institute></author><author><AuthorName>Hakan Ferhatosmanoglu</AuthorName><institute><InstituteName>Ohio State Universit</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Many scientific applications generate massive volumes&#10;of data through observations or computer simulations,&#10;bringing up the need for effective indexing&#10;methods for efficient storage and retrieval of scientific&#10;data. Unlike conventional databases, scientific data is&#10;mostly read-only and its volume can reach to the order&#10;of petabytes, making a compact index structure vital.&#10;Bitmap indexing has been successfully applied to scientific&#10;databases by exploiting the fact that scientific data&#10;are enumerated or numerical. Bitmap indices can be&#10;compressed with variants of run length encoding for a&#10;compact index structure. However even this may not&#10;be enough for the enormous data generated in some&#10;applications such as high energy physics. In this paper,&#10;we study how to reorganize bitmap tables for improved&#10;compression rates. Our algorithms are used just as a&#10;preprocessing step, thus there is no need to revise the&#10;current indexing techniques and the query processing&#10;algorithms. We introduce the tuple reordering problem,&#10;which aims to reorganize database tuples for optimal&#10;compression rates. We propose Gray code ordering algorithm&#10;for this NP-Complete problem, which is an inplace&#10;algorithm, and runs in linear time in the order&#10;of the size of the database. We also discuss how the&#10;tuple reordering problem can be reduced to the traveling&#10;salesperson problem. Our experimental results on&#10;real data sets show that the compression ratio can be&#10;improved by a factor of 2 to 10.</abstract></paper><paper><title>Practical Data Management Techniques for Vehicle Tracking Data</title><author><AuthorName>Sotiris Brakatsoulas</AuthorName><institute><InstituteName>Research Academic Computer Technology Institut</InstituteName><country></country></institute></author><author><AuthorName>Dieter Pfoser</AuthorName><institute><InstituteName>Research Academic Computer Technology Institut</InstituteName><country></country></institute></author><author><AuthorName>Nectaria Tryfona</AuthorName><institute><InstituteName>Research Academic Computer Technology Institut</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>A novel data source for assessing traffic conditions is&#10;floating car data (FCD) in the form of vehicle tracking&#10;data, or, in database terms, trajectory data. This work&#10;proposes practical data management techniques including&#10;data pre-processing, data modeling and indexing to&#10;support the analysis and the data mining of vehicle&#10;tracking data</abstract></paper><paper><title>On Discovery of Extremely Low-Dimensional Clusters Using Semi-Supervised Projected Clustering</title><author><AuthorName>Kevin Y. Yip</AuthorName><institute><InstituteName>University of Hong Kon</InstituteName><country></country></institute></author><author><AuthorName>David W. Cheung</AuthorName><institute><InstituteName>University of Hong Kon</InstituteName><country></country></institute></author><author><AuthorName>Michael K. Ng</AuthorName><institute><InstituteName>University of Hong Kon</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Recent studies suggest that projected clusters with extremely&#10;low dimensionality exist in many real datasets. A&#10;number of projected clustering algorithms have been proposed&#10;in the past several years, but few can identify clusters&#10;with dimensionality lower than 10% of the total number&#10;of dimensions, which are commonly found in some real&#10;datasets such as gene expression profiles. In this paper we&#10;propose a new algorithm that can accurately identify projected&#10;clusters with relevant dimensions as few as 5% of the&#10;total number of dimensions. It makes use of a robust objective&#10;function that combines object clustering and dimension&#10;selection into a single optimization problem. The algorithm&#10;can also utilize domain knowledge in the form of labeled&#10;objects and labeled dimensions to improve its clustering accuracy.&#10;We believe this is the first semi-supervised projected&#10;clustering algorithm. Both theoretical analysis and experimental&#10;results show that by using a small amount of input&#10;knowledge, possibly covering only a portion of the underlying&#10;classes, the new algorithm can be further improved to&#10;accurately detect clusters with only 1% of the dimensions&#10;being relevant. The algorithm is also useful in getting a target&#10;set of clusters when there are multiple possible groupings&#10;of the objects.</abstract></paper><paper><title>Clustering Aggregation</title><author><AuthorName>Aristides Gionis</AuthorName><institute><InstituteName>University of Helsink</InstituteName><country></country></institute></author><author><AuthorName>Heikki Mannila</AuthorName><institute><InstituteName>University of Helsink</InstituteName><country></country></institute></author><author><AuthorName>Panayiotis Tsaparas</AuthorName><institute><InstituteName>University of Helsink</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We consider the following problem: given a set of clusterings,&#10;find a clustering that agrees as much as possible&#10;with the given clusterings. This problem, clustering aggregation,&#10;appears naturally in various contexts. For example,&#10;clustering categorical data is an instance of the problem:&#10;each categorical variable can be viewed as a clustering of&#10;the input rows. Moreover, clustering aggregation can be&#10;used as a meta-clustering method to improve the robustness&#10;of clusterings. The problem formulation does not require a-priori&#10;information about the number of clusters, and it gives&#10;a naturalway for handlingmissing values. We give a formal&#10;statement of the clustering-aggregation problem, we discuss&#10;related work, and we suggest a number of algorithms. For&#10;several of the methods we provide theoretical guarantees on&#10;the quality of the solutions. We also show how sampling can&#10;be used to scale the algorithms for large data sets. We give&#10;an extensive empirical evaluation demonstrating the usefulness&#10;of the problem and of the solutions.</abstract></paper><paper><title>Mining Cross-Graph Quasi-Cliques in Gene Expression and Protein Interaction Data</title><author><AuthorName>Jian Pei</AuthorName><institute><InstituteName>Simon Fraser Universit</InstituteName><country></country></institute></author><author><AuthorName>Daxin Jiang</AuthorName><institute><InstituteName>State University of New York at Buffal</InstituteName><country></country></institute></author><author><AuthorName>Aidong Zhang</AuthorName><institute><InstituteName>State University of New York at Buffal</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>Mining Closed Relational Graphs with Connectivity Constraints</title><author><AuthorName>Xifeng Yan</AuthorName><institute><InstituteName>University of Illinois at Urbana-Champaig</InstituteName><country></country></institute></author><author><AuthorName>X. Jasmine Zhou</AuthorName><institute><InstituteName>University of Southern Californi</InstituteName><country></country></institute></author><author><AuthorName>Jiawei Han</AuthorName><institute><InstituteName>University of Illinois at Urbana-Champaig</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>Efficient Creation and Incremental Maintenance of the HOPI Index for Complex XML Document Collections</title><author><AuthorName>Ralf Schenkel</AuthorName><institute><InstituteName>Max-Planck-Institut f黵 Informati</InstituteName><country></country></institute></author><author><AuthorName>Anja Theobald</AuthorName><institute><InstituteName>Max-Planck-Institut f黵 Informati</InstituteName><country></country></institute></author><author><AuthorName>Gerhard Weikum</AuthorName><institute><InstituteName>Max-Planck-Institut f黵 Informati</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>The HOPI index, a connection index for XML documents&#10;based on the concept of a 2-hop cover, provides space- and&#10;time-efficient reachability tests along the ancestor, descendant,&#10;and link axes to support path expressions with wildcards&#10;in XML search engines.This paper presents enhanced algorithms for building&#10;HOPI, shows how to augment the index with distance information,&#10;and discusses incremental index maintenance. Our&#10;experiments show substantial improvements over the existing&#10;divide-and-conquer algorithm for index creation, low&#10;space overhead for including distance information in the&#10;index, and efficient updates.</abstract></paper><paper><title>On the Sequencing of Tree Structures for XML Indexing</title><author><AuthorName>Haixun Wang</AuthorName><institute><InstituteName>IBM T. J. Watson Research Cente</InstituteName><country></country></institute></author><author><AuthorName>Xiaofeng Meng</AuthorName><institute><InstituteName>Renmin University of Chin</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Sequence-based XML indexing aims at avoiding expensive&#10;join operations in query processing. It transforms structured&#10;XML data into sequences so that a structured query can be&#10;answered holistically through subsequence matching. In this&#10;paper, we address the problem of query equivalence with respect&#10;to this transformation, and we introduce a performance-oriented&#10;principle for sequencing tree structures. With query&#10;equivalence, XML queries can be performed through subsequence&#10;matching without join operations, post-processing, or&#10;other special handling for problems such as false alarms. We&#10;identify a class of sequencing methods for this purpose, and we&#10;present a novel subsequence matching algorithm that observe&#10;query equivalence. Still, query equivalence is just a prerequisite&#10;for sequence-based XML indexing. Our goal is to find&#10;the best sequencing strategy with regard to the time and space&#10;complexity in indexing and querying XML data. To this end,&#10;we introduce a performance-oriented principle to guide the sequencing&#10;of tree structures. For any given XML dataset, the&#10;principle finds an optimal sequencing strategy according to its&#10;schema and its data distribution. We present a novel method&#10;that realizes this principle. In our experiments, we show the&#10;advantages of sequence-based indexing over traditional XML&#10;indexing methods, and we compare several sequencing strategies&#10;and demonstrate the benefit of the performance-oriented&#10;sequencing principle.</abstract></paper><paper><title>VLEI Code: An Efficient Labeling Method for Handling XML Documents in an RDB</title><author><AuthorName>Kazuhito Kobayashi</AuthorName><institute><InstituteName>Tokyo Institute of Technolog</InstituteName><country></country></institute></author><author><AuthorName>Wenxin Wenxin</AuthorName><institute><InstituteName>Tokyo Institute of Technolog</InstituteName><country></country></institute></author><author><AuthorName>Dai Kobayashi</AuthorName><institute><InstituteName>Tokyo Institute of Technolog</InstituteName><country></country></institute></author><aut
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -