📄 icde_2005_elementary.txt

📁 利用lwp：：get写的
💻 TXT
📖 第 1 页 / 共 5 页
字号:
<proceedings><paper><title>Welcome from the Program Chairs</title><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>Program Committee</title><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>External Reviewers</title><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>&amp;quot;One Size Fits All&amp;quot;: An Idea Whose Time Has Come and Gone</title><author><AuthorName>Michael Stonebraker</AuthorName><institute><InstituteName>StreamBase Systems, Inc</InstituteName><country></country></institute></author><author><AuthorName>U&amp;#287;ur 莈tintemel</AuthorName><institute><InstituteName>Brown University and StreamBase Systems, Inc</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>The last 25 years of commercial DBMS development&amp;#10;can be summed up in a single phrase: &amp;quot;One size fits all&amp;quot;.&amp;#10;This phrase refers to the fact that the traditional DBMS&amp;#10;architecture (originally designed and optimized for&amp;#10;business data processing) has been used to support many&amp;#10;data-centric applications with widely varying&amp;#10;characteristics and requirements.In this paper, we argue that this concept is no longer&amp;#10;applicable to the database market, and that the&amp;#10;commercial world will fracture into a collection of&amp;#10;independent database engines, some of which may be&amp;#10;unified by a common front-end parser. We use examples&amp;#10;from the stream-processing market and the data-warehouse&amp;#10;market to bolster our claims. We also briefly&amp;#10;discuss other markets for which the traditional&amp;#10;architecture is a poor fit and argue for a critical&amp;#10;rethinking of the current factoring of systems services&amp;#10;into products.</abstract></paper><paper><title>IC Tag Based Traceability: System and Solutions</title><author><AuthorName>Yoji Taniguchi</AuthorName><institute><InstituteName>Hitachi, Ltd</InstituteName><country></country></institute></author><author><AuthorName>Nobutoshi Sagawa</AuthorName><institute><InstituteName>Hitachi, Ltd</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>An increasing number of companies want to improve&amp;#10;product traceability for several reasons: to meet stricter&amp;#10;government regulations about food and medical safety, to&amp;#10;cope with ever-stronger consumer demands to know&amp;#10;exactly what they are buying, and to improve and protect&amp;#10;the company's brand value through more transparent&amp;#10;business operations. Two aspects of traceability are&amp;#10;technically important: (1) techniques for tracing the&amp;#10;events associated with the goods a company handles at all&amp;#10;necessary points of the business operation, possibly&amp;#10;through the use of IC tags and tag readers; and (2) ways&amp;#10;to store, manage, and use the collected logs of events&amp;#10;either to cope with problems or to improve business&amp;#10;processes.In this paper, we first review currently available&amp;#10;traceability systems by considering examples from real-world&amp;#10;situations. After that, we discuss the likely&amp;#10;directions and possibilities of next-generation traceability&amp;#10;systems.</abstract></paper><paper><title>Effective Computation of Biased Quantiles over Data Streams</title><author><AuthorName>Graham Cormode</AuthorName><institute><InstituteName>Bell Labs, Lucent Technologie</InstituteName><country></country></institute></author><author><AuthorName>Flip Korn</AuthorName><institute><InstituteName>AT&amp;T Labs-Researc</InstituteName><country></country></institute></author><author><AuthorName>S. Muthukrishnan</AuthorName><institute><InstituteName>Rutgers Universit</InstituteName><country></country></institute></author><author><AuthorName>Divesh Srivastava</AuthorName><institute><InstituteName>AT&amp;T Labs-Researc</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Skew is prevalent in many data sources such as IP traffic&amp;#10;streams. To continually summarize the distribution of such data, a&amp;#10;high-biased set of quantiles (e.g., 50th, 90th and 99th percentiles)&amp;#10;with finer error guarantees at higher ranks (e.g., errors of 5,&amp;#10;1 and 0.1 percent, respectively) is more useful than uniformly&amp;#10;distributed quantiles (e.g., 25th, 50th and 75th percentiles) with&amp;#10;uniform error guarantees. In this paper, we address the following&amp;#10;two problems. First, can we compute quantiles with finer&amp;#10;error guarantees for the higher ranks of the data distribution&amp;#10;effectively, using less space and computation time than computing&amp;#10;all quantiles uniformly at the finest error? Second, if specific&amp;#10;quantiles and their error bounds are requested a priori, can the&amp;#10;necessary space usage and computation time be reduced?We answer both questions in the affirmative by formalizing&amp;#10;them as the &amp;quot;high-biased&amp;quot; and the &amp;quot;targeted&amp;quot; quantiles problems,&amp;#10;respectively, and presenting algorithms with provable guarantees,&amp;#10;that perform significantly better than previously known&amp;#10;solutions for these problems. We implemented our algorithms in&amp;#10;the Gigascope data stream management system, and evaluated alternate&amp;#10;approaches for maintaining the relevant summary structures.&amp;#10;Our experimental results on real and synthetic IP data&amp;#10;streams complement our theoretical analyses, and highlight the&amp;#10;importance of lightweight, non-blocking implementations when&amp;#10;maintaining summary structures over high-speed data streams.</abstract></paper><paper><title>Range-Efficient Computation of F&amp;#8320; over Massive Data Streams</title><author><AuthorName>A. Pavan</AuthorName><institute><InstituteName>Iowa State Universit</InstituteName><country></country></institute></author><author><AuthorName>Srikanta Tirthapura</AuthorName><institute><InstituteName>Iowa State Universit</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Efficient one-pass computation of F&amp;#8320;, the number of distinct&amp;#10;elements in a data stream, is a fundamental problem&amp;#10;arising in various contexts in databases and networking. We&amp;#10;consider the problem of efficiently estimating F&amp;#8320; of a data&amp;#10;stream where each element of the stream is an interval of integers.We present a randomized algorithm which gives an (\varepsilon ,\delta )&amp;#10;approximation of F&amp;#8320;, with the following time complexity (n&amp;#10;is the size of the universe of the items): (1) The amortized&amp;#10;processing time per interval is 0(\log \frac{1}{\delta }Log\frac{n}{\varepsilon }). (2) The time&amp;#10;to answer a query for F&amp;#8320; is 0(\log {1 \mathord{\left/ {\vphantom {1 {\delta )}}} \right. \kern-\nulldelimiterspace} {\delta )}}. The workspace used&amp;#10;is 0(\frac{1}{{\varepsilon ^2 }}\log \frac{1}{\delta }\log n) bits.Our algorithm improves upon a previous algorithm&amp;#10;by Bar-Yossef, Kumar and Sivakumar [5], which requires&amp;#10;0(\frac{1}{{\varepsilon ^5 }}\log \frac{1}{\delta }\log ^5 n) processing time per item. Our algorithm can be used to compute the max-dominance norm&amp;#10;of a stream of multiple signals, and significantly improves&amp;#10;upon the current best bounds due to Cormode and&amp;#10;Muthukrishnan [11]. This also provides efficient and novel&amp;#10;solutions for data aggregation problems in sensor networks&amp;#10;studied by Nath and Gibbons [22] and Considine et.&amp;#10;al. [8].</abstract></paper><paper><title>A Unified Framework for Monitoring Data Streams in Real Time</title><author><AuthorName>Ahmet Bulut</AuthorName><institute><InstituteName>University of California at Santa Barbar</InstituteName><country></country></institute></author><author><AuthorName>Ambuj K. Singh</AuthorName><institute><InstituteName>University of California at Santa Barbar</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Online monitoring of data streams poses a challenge in&amp;#10;many data-centric applications, such as telecommunications&amp;#10;networks, traffic management, trend-related analysis, web-click&amp;#10;streams, intrusion detection, and sensor networks. Mining&amp;#10;techniques employed in these applications have to be &amp;#10;efficient in terms of space usage and per-item processing time&amp;#10;while providing a high quality of answers to (1) aggregate&amp;#10;monitoring queries, such as finding surprising levels of a&amp;#10;data stream, detecting bursts, and to (2) similarity queries,&amp;#10;such as detecting correlations and finding interesting patterns.&amp;#10;The most important aspect of these tasks is their need&amp;#10;for flexible query lengths, i.e., it is difficult to set the appropriate&amp;#10;lengths a priori. For example, bursts of events can&amp;#10;occur at variable temporal modalities from hours to days&amp;#10;to weeks. Correlated trends can occur at various temporal&amp;#10;scales. The system has to discover &amp;quot;interesting&amp;quot; behavior&amp;#10;online and monitor over flexible window sizes. In this paper,&amp;#10;we propose a multi-resolution indexing scheme, which&amp;#10;handles variable length queries efficiently. We demonstrate&amp;#10;the effectiveness of our framework over existing techniques&amp;#10;through an extensive set of experiments.</abstract></paper><paper><title>Corpus-Based Schema Matching</title><author><AuthorName>Jayant Madhavan</AuthorName><institute><InstituteName>University of Washingto</InstituteName><country></country></institute></author><author><AuthorName>Philip A. Bernstein</AuthorName><institute><InstituteName>Microsoft Researc</InstituteName><country></country></institute></author><author><AuthorName>AnHai Doan</AuthorName><institute><InstituteName>University of Illinois at Urbana-Champaig</InstituteName><country></country></institute></author><author><AuthorName>Alon Halevy</AuthorName><institute><InstituteName>University of Washingto</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Schema Matching is the problem of identifying corresponding&amp;#10;elements in different schemas. Discovering these&amp;#10;correspondences or matches is inherently difficult to automate.&amp;#10;Past solutions have proposed a principled combination&amp;#10;of multiple algorithms. However, these solutions&amp;#10;sometimes perform rather poorly due to the lack of &amp;#10;sufficient evidence in the schemas being matched. In this paper&amp;#10;we show how a corpus of schemas and mappings can&amp;#10;be used to augment the evidence about the schemas being&amp;#10;matched, so they can be matched better. Such a corpus typically&amp;#10;contains multiple schemas that model similar concepts&amp;#10;and hence enables us to learn variations in the elements&amp;#10;and their properties. We exploit such a corpus in two&amp;#10;ways. First, we increase the evidence about each element&amp;#10;being matched by including evidence from similar elements&amp;#10;in the corpus. Second, we learn statistics about elements&amp;#10;and their relationships and use them to infer constraints&amp;#10;that we use to prune candidate mappings. We also describe&amp;#10;how to use known mappings to learn the importance of domain&amp;#10;and generic constraints. We present experimental results&amp;#10;that demonstrate corpus-based matching outperforms&amp;#10;direct matching (without the benefit of a corpus) in multiple&amp;#10;domains.</abstract></paper><paper><title>Schema Matching Using Duplicates</title><author><AuthorName>Alexander Bilke</AuthorName><institute><InstituteName>Technische Universit鋞 Berli</InstituteName><country></country></institute></author><author><AuthorName>Felix Naumann</AuthorName><institute><InstituteName>Humboldt-Universit鋞 zu Berli</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Most data integration applications require a matching&amp;#10;between the schemas of the respective data sets. We show&amp;#10;how the existence of duplicates within these data sets can be&amp;#10;exploited to automatically identify matching attributes. We&amp;#10;describe an algorithm that first discovers duplicates among&amp;#10;data sets with unaligned schemas and then uses these duplicates&amp;#10;to perform schema matching between schemas with&amp;#10;opaque column names.Discovering duplicates among data sets with unaligned&amp;#10;schemas is more difficult than in the usual setting, because&amp;#10;it is not clear which fields in one object should be compared&amp;#10;with which fields in the other. We have developed a new&amp;#10;algorithm that efficiently finds the most likely duplicates in&amp;#10;such a setting. Now, our schema matching algorithm is able&amp;#10;to identify corresponding attributes by comparing data values&amp;#10;within those duplicate records. An experimental study&amp;#10;on real-world data shows the effectiveness of this approach.</abstract></paper><paper><title>Representing and Querying Data Transformations</title><author><AuthorName>Yannis Velegrakis</AuthorName><institute><InstituteName>AT&amp;T Labs - Researc</InstituteName><country></country></institute></author><author><AuthorName>Ren閑 J. Miller</AuthorName><institute><InstituteName>University of Toront</InstituteName><country></country></institute></author><author><AuthorName>John Mylopoulos</AuthorName><institute><InstituteName>University of Toront</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Modern information systems often store data that has&amp;#10;been transformed and integrated from a variety of sources.&amp;#10;This integration may obscure the original source semantics&amp;#10;of data items. For many tasks, it is important to be&amp;#10;able to determine not only where data items originated,&amp;#10;but also why they appear in the integration as they do and&amp;#10;through what transformation they were derived. This problem&amp;#10;is known as data provenance. In this work, we consider&amp;#10;data provenance at the schema and mapping level. In particular,&amp;#10;we consider how to answer questions such as &amp;quot;what&amp;#10;schema elements in the source(s) contributed to this value&amp;quot;,&amp;#10;or &amp;quot;through what transformations or mappings was this&amp;#10;value derived?&amp;quot; Towards this end, we elevate schemas and&amp;#10;mappings to first-class citizens that are stored in a repository&amp;#10;and are associated with the actual data values. An&amp;#10;extended query language, called MXQL, is also developed&amp;#10;that allows meta-data to be queried as regular data and we&amp;#10;describe its implementation. scenario.</abstract></paper><paper><title>Bypass Caching: Making Scientific Databases Good Network Citizens</title><author><AuthorName>Tanu Malik</AuthorName><institute><InstituteName>Johns Hopkins Universit</InstituteName><country></country></institute></author><author><AuthorName>Randal Burns</AuthorName><institute><InstituteName>Johns Hopkins Universit</InstituteName><country></country></institute></author><author><AuthorName>Amitabh Chaudhary</AuthorName><institute><InstituteName>University of Notre Dam</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Scientific database federations are geographically distributed&amp;#10;and network bound. Thus, they could benefit&amp;#10;from proxy caching. However, existing caching techniques&amp;#10;are not suitable for their workloads, which compare&amp;#10;and join large data sets. Existing techniques reduce parallelism&amp;#10;by conducting distributed queries in a single cache&amp;#10;and lose the data reduction benefits of performing selections&amp;#10;at each database. We develop the bypass-yield formulation&amp;#10;of caching, which reduces network traffic in&amp;#10;wide-area database federations, while preserving parallelism&amp;#10;and data reduction. Bypass-yield caching is altruistic;&amp;#10;caches minimize the overall network traffic generated&amp;#10;by the federation, rather than focusing on local performance.&amp;#10;We present an adaptive, workload-driven algorithm&amp;#10;for managing a bypass-yield cache. We also develop&amp;#10;on-line algorithms that make no assumptions about workload:&amp;#10;a k-competitive deterministic algorithm and a randomized&amp;#10;algorithm with minimal space complexity. We&amp;#10;verify the efficacy of bypass-yield caching by running workload&amp;#10;traces collected from the Sloan Digital Sky Survey&amp;#10;through a prototype implementation.</abstract></paper><paper><title>Asymmetric Batch Incremental View Maintenance</title><author><AuthorName>Hao He</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><author><AuthorName>Junyi Xie</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><author><AuthorName>Jun Yang</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><author><AuthorName>Hai Yu</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Incremental view maintenance has found a growing&amp;#10;number of applications recently, including data warehousing,&amp;#10;continuous query processing, publish/subscribe systems,&amp;#10;etc. Batch processing of base table modifications,&amp;#10;when applicable, can be much more efficient than processing&amp;#10;individual modifications one at a time. In this paper, we&amp;#10;tackle the problem of finding the most efficient batch incremental&amp;#10;maintenance strategy under a refresh response time&amp;#10;constraint; that is, at any point in time, the system, upon&amp;#10;request, must be able to bring the view up to date within&amp;#10;a specified amount of time. The traditional approach is&amp;#10;to process all batched modifications relevant to the view&amp;#10;whenever the constraint is violated. However, we observe&amp;#10;that there often exists natural asymmetry among different&amp;#10;components of the maintenance cost; for example, &amp;#10;modifications on one base table might be cheaper to process than&amp;#10;those on another base table because of some index. We exploit&amp;#10;such asymmetries using an unconventional strategy&amp;#10;that selectively processes modifications on some base tables&amp;#10;while keeping batching others. We present a series&amp;#10;of analytical results leading to the development of practical&amp;#10;algorithms that approximate an &amp;quot;oracle algorithm&amp;quot;&amp;#10;with perfect knowledge of the future. With experiments on&amp;#10;a TPC-R database, we demonstrate that our strategy offers&amp;#10;substantial performance gains over traditional deferred&amp;#10;view maintenance techniques.</abstract></paper><paper><title>Adaptive Caching for Continuous Queries</title><author><AuthorName>Shivnath Babu</AuthorName><institute><InstituteName>Stanford Universit</InstituteName><country></country></institute></author><author><AuthorName>Kamesh Munagala</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><author><AuthorName>Jennifer Widom</AuthorName><institute><InstituteName>Stanford Universit</InstituteName><country></country></institute></author><author><AuthorName>Rajeev Motwani</AuthorName><institute><InstituteName>Stanford Universit</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We address the problem of executing continuous multiway&amp;#10;join queries in unpredictable and volatile environments.&amp;#10;Our query class captures windowed join queries in&amp;#10;data stream systems as well as conventional maintenance&amp;#10;of materialized join views. Our adaptive approach handles&amp;#10;streams of updates whose rates and data characteristics&amp;#10;may change over time, as well as changes in system&amp;#10;conditions such as memory availability. In this paper&amp;#10;we focus specifically on the problem of adaptive placement&amp;#10;and removal of caches to optimize join performance.&amp;#10;Our approach automatically considers conventional tree-shaped&amp;#10;join plans with materialized subresults at every intermediate&amp;#10;node, subresult-free MJoins, and the entire spectrum&amp;#10;between them. We provide algorithms for selecting&amp;#10;caches, monitoring their cost and benefits in current conditions,&amp;#10;allocating memory to caches, and adapting as conditions&amp;#10;change. All of our algorithms are implemented in the&amp;#10;STREAM prototype Data Stream Management System and&amp;#10;a thorough experimental evaluation is included.</abstract></paper><paper><title>Snapshot Queries: Towards Data-Centric Sensor Networks</title><author><AuthorName>Yannis Kotidis</AuthorName><institute><InstituteName>AT&amp;T Labs-Researc</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>In this paper we introduce the idea of snapshot queries&amp;#10;for energy efficient data acquisition in sensor networks.&amp;#10;Network nodes generate models of their surrounding environment&amp;#10;that are used for electing, using a localized algorithm,&amp;#10;a small set of representative nodes in the network.&amp;#10;These representative nodes constitute a network snapshot&amp;#10;and can be used to provide quick approximate answers to&amp;#10;user queries while reducing substantially the energy consumption&amp;#10;in the network. We present a detailed experimental&amp;#10;study of our framework and algorithms, varying multiple&amp;#10;parameters like the available memory of the sensor nodes,&amp;#10;their transmission range, the network message loss etc. Depending&amp;#10;on the configuration, snapshot queries provide a&amp;#10;reduction of up to 90% in the number of nodes that need to&amp;#10;participate in a user query.</abstract></paper><paper><title>Exploiting Correlated Attributes in Acquisitional Query Processing</title><author><AuthorName>Amol Deshpande</AuthorName><institute><InstituteName>University of Marylan</InstituteName><country></country></institute></author><author><AuthorName>Carlos Guestrin</AuthorName><institute><InstituteName>Carnegi Mellon Universit</InstituteName><country></country></institute></author><author><AuthorName>Wei Hong</AuthorName><institute><InstituteName>Intel Research Berkele</InstituteName><country></country></institute></author><author><AuthorName>Samuel Madden</AuthorName><institute><InstituteName>Massachussets Institute of Technolog</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Sensor networks and other distributed information systems&amp;#10;(such as the Web) must frequently access data that has a high&amp;#10;per-attribute acquisition cost, in terms of energy, latency, or&amp;#10;computational resources. When executing queries that contain&amp;#10;several predicates over such expensive attributes, we observe&amp;#10;that it can be beneficial to use correlations to automatically introduce&amp;#10;low-cost attributes whose observation will allow the query&amp;#10;processor to better estimate the selectivity of these expensive&amp;#10;predicates. In particular, we show how to build conditional&amp;#10;plans that branch into one or more sub-plans, each with a different&amp;#10;ordering for the expensive query predicates, based on the&amp;#10;runtime observation of low-cost attributes. We frame the problem&amp;#10;of constructing the optimal conditional plan for a given user&amp;#10;query and set of candidate low-cost attributes as an optimization&amp;#10;problem. We describe an exponential time algorithm for finding&amp;#10;such optimal plans, and describe a polynomial-time heuristic&amp;#10;for identifying conditional plans that perform well in practice.&amp;#10;We also show how to compactly model conditional probability&amp;#10;distributions needed to identify correlations and build these&amp;#10;plans. We evaluate our algorithms against several real-world&amp;#10;sensor-network data sets, showing several-times performance&amp;#10;increases for a variety of queries versus traditional optimization&amp;#10;techniques.</abstract></paper><paper><title>Data Triage: An Adaptive Architecture for Load Shedding in TelegraphCQ</title><author><AuthorName>Frederick Reiss</AuthorName><institute><InstituteName>University of California at Berkele</InstituteName><country></country></institute></author><author><AuthorName>Joseph M. Hellerstein</AuthorName><institute><InstituteName>Intel Research Berkele</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Many of the data sources used in stream query processing&amp;#10;are known to exhibit bursty behavior. Data in a burst often&amp;#10;has different characteristics than steady-state data, and&amp;#10;therefore may be of particular interest. In this paper, we&amp;#10;describe the Data Triage architecture that we are adding to&amp;#10;TelegraphCQ to provide low latency results with good accuracy&amp;#10;under such bursts.</abstract></paper><paper><title>AutoLag: Automatic Discovery of Lag Correlations in Stream Data</title><author><AuthorName>Yasushi Sakurai</AuthorName><institute><InstituteName>NTT Cyber Space Laboratorie</InstituteName><country></country></institute></author><author><AuthorName>Spiros Papadimitriou</AuthorName><institute><InstituteName>Carnegie Mellon Universit</InstituteName><country></country></institute></author><author><AuthorName>Christos Faloutsos</AuthorName><institute><InstituteName>Carnegie Mellon Universit</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>Adaptive Processing of Top-k Queries in XML</title><author><AuthorName>Am閘ie Marian</AuthorName><institute><InstituteName>Columbia Universit</InstituteName><country></country></institute></author><author><AuthorName>Sihem Amer-Yahia</AuthorName><institute><InstituteName>AT&amp;T Labs-Researc</InstituteName><country></country></institute></author><author><AuthorName>Nick Koudas</AuthorName><institute><InstituteName>AT&amp;T Labs-Researc</InstituteName><country></country></institute></author><author><AuthorName>Divesh Srivastava</AuthorName><institute><InstituteName>AT&amp;T Labs-Researc</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>The ability to compute top-k matches to XML queries is&amp;#10;gaining importance due to the increasing number of large&amp;#10;XML repositories. The efficiency of top-k query evaluation&amp;#10;relies on using scores to prune irrelevant answers as early&amp;#10;as possible in the evaluation process. In this context, evaluating&amp;#10;the same query plan for all answers might be too rigid&amp;#10;because, at any time in the evaluation, answers have gone&amp;#10;through the same number and sequence of operations, which&amp;#10;limits the speed at which scores grow. Therefore, adaptive&amp;#10;query processing that permits different plans for different&amp;#10;partial matches and maximizes the best scores is more appropriate.&amp;#10;In this paper, we propose an architecture and&amp;#10;adaptive algorithms for efficiently computing top-k matches&amp;#10;to XML queries. Our techniques can be used to evaluate&amp;#10;both exact and approximate matches where approximation&amp;#10;is defined by relaxing XPath axes. In order to compute the&amp;#10;scores of query answers, we extend the traditional tf*idf&amp;#10;measure to account for document structure. We conduct extensive&amp;#10;experiments on a variety of benchmark data and&amp;#10;queries, and demonstrate the usefulness of the adaptive approach&amp;#10;for computing top-k queries in XML.</abstract></paper><paper><title>Progressive Distributed Top-k Retrieval in Peer-to-Peer Networks</title><author><AuthorName>Wolf-Tilo Balke</AuthorName><institute><InstituteName>University of California at Berkele</InstituteName><country></country></institute></author><author><AuthorName>Wolfgang Nejdl</AuthorName><institute><InstituteName>University of Hannove</InstituteName><country></country></institute></author><author><AuthorName>Wolf Siberski</AuthorName><institute><InstituteName>University of Hannove</InstituteName><country></country></institute></author><author><AuthorName>Uwe Thaden</AuthorName><institute><InstituteName>University of Hannove</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Query processing in traditional information management&amp;#10;systems has moved from an exact match model to&amp;#10;more flexible paradigms allowing cooperative retrieval&amp;#10;by aggregating the database objects&amp;#8217; degree of match for&amp;#10;each different query predicate and returning the best&amp;#10;matching objects only. In peer-to-peer systems such&amp;#10;strategies are even more important, given the potentially&amp;#10;large number of peers, which may contribute to the results.&amp;#10;Yet current peer-to-peer research has barely started&amp;#10;to investigate such approaches. In this paper we will discuss&amp;#10;the benefits of best match/top-k queries in the context&amp;#10;of distributed peer-to-peer information infrastructures&amp;#10;and show how to extend the limited query processing in&amp;#10;current peer-to-peer networks by allowing the distributed&amp;#10;processing of top-k queries, while maintaining a minimum&amp;#10;of data traffic. Relying on a super-peer backbone&amp;#10;organized in the HyperCuP topology we will show how to&amp;#10;use local indexes for optimizing the necessary query routing&amp;#10;and how to process intermediate results in inner network&amp;#10;nodes at the earliest possible point in time cutting&amp;#10;down the necessary data traffic within the network. Our&amp;#10;algorithm is based on dynamically collected query statistics&amp;#10;only, no continuous index update processes are necessary,&amp;#10;allowing it to scale easily to large numbers of&amp;#10;peers, as well as dynamic additions/deletions of peers.&amp;#10;We will show our approach to always deliver correct result&amp;#10;sets and to be optimal in terms of necessary object&amp;#10;accesses and data traffic. Finally, we present simulation&amp;#10;results for both static and dynamic network environments.</abstract></paper><paper><title>Optimizing Access Cost for Top-k Queries over Web Sources: A Unified Cost-Based Approach</title><author><AuthorName>Seung-won Hwang</AuthorName><institute><InstituteName>University of Illinois at Urbana-Champaig</InstituteName><country></country></institute></author><author><AuthorName>Kevin Chen-Chuan Chang</AuthorName><institute><InstituteName>University of Illinois at Urbana-Champaig</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>A Framework for High-Accuracy Privacy-Preserving Mining</title><author><AuthorName>Shipra Agrawal</AuthorName><institute><InstituteName>Indian Institute of Science - Bangalor</InstituteName><country></country></institute></author><author><AuthorName>Jayant R. Haritsa</AuthorName><institute><InstituteName>Indian Institute of Science - Bangalor</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>To preserve client privacy in the data mining process, a&amp;#10;variety of techniques based on random perturbation of individual&amp;#10;data records have been proposed recently. In this&amp;#10;paper, we present FRAPP, a generalized matrix-theoretic&amp;#10;framework of random perturbation, which facilitates a systematic&amp;#10;approach to the design of perturbation mechanisms&amp;#10;for privacy-preserving mining. Specifically, FRAPP is used&amp;#10;to demonstrate that (a) the prior techniques differ only in&amp;#10;their choices for the perturbation matrix elements, and (b)&amp;#10;a symmetric perturbation matrix with minimal condition&amp;#10;number can be identified, maximizing the accuracy even&amp;#10;under strict privacy guarantees. We also propose a novel&amp;#10;perturbation mechanism wherein the matrix elements are&amp;#10;themselves characterized as random variables, and demonstrate&amp;#10;that this feature provides significant improvements in&amp;#10;privacy at only a marginal cost in accuracy.The quantitative utility of FRAPP, which applies to&amp;#10;random-perturbation-based privacy-preserving mining in&amp;#10;general, is evaluated specifically with regard to frequent-itemset&amp;#10;mining on a variety of real datasets. Our experimental&amp;#10;results indicate that, for a given privacy requirement,&amp;#10;substantially lower errors are incurred, with respect&amp;#10;to both itemset identity and itemset support, as compared to&amp;#10;the prior techniques.</abstract></paper><paper><title>Top-Down Specialization for Information and Privacy Preservation</title><author><AuthorName>Benjamin C. M. Fung</AuthorName><institute><InstituteName>Simon Fraser Universit</InstituteName><country></country></institute></author><author><AuthorName>Ke Wang</AuthorName><institute><InstituteName>Simon Fraser Universit</InstituteName><country></country></institute></author><author><AuthorName>Philip S. Yu</AuthorName><institute><InstituteName>IBMT. J. Watson Research Cente</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Releasing person-specific data in its most specific state&amp;#10;poses a threat to individual privacy. This paper presents a&amp;#10;practical and efficient algorithm for determining a generalized&amp;#10;version of data that masks sensitive information and&amp;#10;remains useful for modelling classification. The generalization&amp;#10;of data is implemented by specializing or detailing the&amp;#10;level of information in a top-down manner until a minimum&amp;#10;privacy requirement is violated. This top-down specialization&amp;#10;is natural and efficient for handling both categorical&amp;#10;and continuous attributes. Our approach exploits the fact&amp;#10;that data usually contains redundant structures for &amp;#10;classification. While generalization may eliminate some structures,&amp;#10;other structures emerge to help. Our results show that&amp;#10;quality of classification can be preserved even for highly restrictive&amp;#10;privacy requirements. This work has great applicability&amp;#10;to both public and private sectors that share information&amp;#10;for mutual benefits and productivity.</abstract></paper><paper><title>Data Privacy through Optimal k-Anonymization</title><author><AuthorName>Roberto J. Bayardo</AuthorName><institute><InstituteName>IBM Almaden Research Cente</InstituteName><country></country></institute></author><author><AuthorName>Rakesh Agrawal</AuthorName><institute><InstituteName>IBM Almaden Research Cente</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Data de-identification reconciles the demand for release of&amp;#10;data for research purposes and the demand for privacy&amp;#10;from individuals. This paper proposes and evaluates an&amp;#10;optimization algorithm for the powerful de-identification&amp;#10;procedure known as k-anonymization. A k-anonymized&amp;#10;dataset has the property that each record is indistinguishable&amp;#10;from at least k - 1 others. Even simple restrictions of&amp;#10;optimized k-anonymity are NP-hard, leading to significant&amp;#10;computational challenges. We present a new approach to&amp;#10;exploring the space of possible anonymizations that tames&amp;#10;the combinatorics of the problem, and develop data-management&amp;#10;strategies to reduce reliance on expensive operations&amp;#10;such as sorting. Through experiments on real census&amp;#10;data, we show the resulting algorithm can find optimal &amp;#10;k-anonymizations under two representative cost measures&amp;#10;and a wide range of k. We also show that the algorithm&amp;#10;can produce good anonymizations in circumstances where&amp;#10;the input data or input parameters preclude finding an&amp;#10;optimal solution in reasonable time. Finally, we use the&amp;#10;algorithm to explore the effects of different coding&amp;#10;approaches and problem variations on anonymization&amp;#10;quality and performance. To our knowledge, this is the first&amp;#10;result demonstrating optimal k-anonymization of a nontrivial&amp;#10;dataset under a general model of the problem.&amp;#10;</abstract></paper><paper><title>A Comparative Evaluation of Transparent Scaling Techniques for Dynamic Content Servers</title><author><AuthorName>C. Amza</AuthorName><institute><InstituteName>University of Toront</InstituteName><country></country></institute></author><author><AuthorName>A. L. Cox</AuthorName><institute><InstituteName>Rice Universit</InstituteName><country></country></institute></author><author><AuthorName>W. Zwaenepoel</AuthorName><institute><InstituteName>EPF</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We study several transparent techniques for scaling dynamic&amp;#10;content web sites, and we evaluate their relative impact&amp;#10;when used in combination. Full transparency implies&amp;#10;strong data consistency as perceived by the user, no &amp;#10;modifications to existing dynamic content site tiers and no additional&amp;#10;programming effort from the user or site administrator&amp;#10;upon deployment.We study strategies for scheduling and load balancing&amp;#10;queries on a cluster of replicated database back-ends. We&amp;#10;also investigate transparent query caching as a means of&amp;#10;enhancing database replication.Our work shows that, on an experimental platform with&amp;#10;up to 8 database replicas, the various techniques work in&amp;#10;synergy to improve overall scaling for the e-commerce TPCW&amp;#10;benchmark. We rank the techniques necessary for high&amp;#10;performance in order of impact as follows. Key among the&amp;#10;strategies are scheduling strategies, such as conflict-aware&amp;#10;scheduling, that minimize consistency maintainance over-heads.&amp;#10;The choice of load balancing strategy is less important.&amp;#10;Transparent query result caching increases performance&amp;#10;significantly at any given cluster size for a mostly-read&amp;#10;workload. Its benefits are limited for write-intensive&amp;#10;workloads, where content-aware scheduling is the only&amp;#10;scaling option.</abstract></paper><paper><title>SemCast: Semantic Multicast for Content-Based Data Dissemination</title><author><AuthorName>Olga Papaemmanouil</AuthorName><institute><InstituteName>Brown Universit</InstituteName><country></country></institute></author><author><AuthorName>U&amp;#287;ur 莈tintemel</AuthorName><institute><InstituteName>Brown Universit</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We address the problem of content-based&amp;#10;dissemination of highly-distributed, high-volume data&amp;#10;streams for stream-based monitoring applications and&amp;#10;large-scale data delivery. Existing content-based&amp;#10;dissemination approaches commonly rely on distributed&amp;#10;filtering trees that require filtering at all brokers on the&amp;#10;tree. We present a new semantic multicast approach that&amp;#10;eliminates the need for content-based filtering at interior&amp;#10;brokers and facilitates fine-grained control over the&amp;#10;construction of efficient dissemination trees. The central&amp;#10;idea is to split the incoming data streams (based on their&amp;#10;contents, rates, and destinations) and then spread the&amp;#10;pieces across multiple channels, each of which is&amp;#10;implemented as an independent dissemination tree. We&amp;#10;present the basic design and evaluation of SemCast, an&amp;#10;overlay-network based system that implements this&amp;#10;semantic multicast approach. Through a detailed&amp;#10;simulation study and realistic network topologies, we&amp;#10;demonstrate that SemCast significantly improves the&amp;#10;efficiency of dissemination compared to traditional&amp;#10;approaches.</abstract></paper><paper><title>A Distributed Quadtree Index for Peer-to-Peer Settings</title><author><AuthorName>Egemen Tanin</AuthorName><institute><InstituteName>University of Melbourn</InstituteName><country></country></institute></author><author><AuthorName>Aaron Harwood</AuthorName><institute><InstituteName>University of Melbourn</InstituteName><country></country></institute></author><author><AuthorName>Hanan Samet</AuthorName><institute><InstituteName>University of Maryland at College Par</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>DUP: Dynamic-Tree Based Update Propagation in Peer-to-Peer Networks</title><author><AuthorName>Liangzhong Yin</AuthorName><institute><InstituteName>Pennsylvania State Universit</InstituteName><country></country></institute></author><author><AuthorName>Guohong Cao</AuthorName><institute><InstituteName>Pennsylvania State Universit</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>In peer-to-peer networks, indices are used to map data id&amp;#10;to nodes that host the data. The performance of data access&amp;#10;can be improved by actively pushing indices to interested&amp;#10;nodes. This paper proposes the Dynamic-tree based Update&amp;#10;Propagation (DUP) scheme, which builds the update propagation&amp;#10;tree to facilitate the propagation of indices. Because&amp;#10;the update propagation tree only involves nodes that are essential&amp;#10;for update propagation, the overhead of DUP is very&amp;#10;small and the query latency is significantly reduced.</abstract></paper><paper><title>Vectorizing and Querying Large XML Repositories</title><author><AuthorName>Peter Buneman</AuthorName><institute><InstituteName>University of Edinburg</InstituteName><country></country></institute></author><author><AuthorName>Byron Choi</AuthorName><institute><InstituteName>University of Edinburg</InstituteName><country></country></institute></author><author><AuthorName>Wenfei Fan</AuthorName><institute><InstituteName>University of Edinburg</InstituteName><country></country></institute></author><author><AuthorName>Robert Hutchison</AuthorName><institute><InstituteName>University of Edinburg</InstituteName><country></country></institute></author><author><AuthorName>Robert Mann</AuthorName><institute><InstituteName>University of Edinburg</InstituteName><country></country></institute></author><author><AuthorName>Stratis D. Viglas</AuthorName><institute><InstituteName>University of Edinburg</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Vertical partitioning is a well-known technique for optimizing&amp;#10;query performance in relational databases. An extreme form of&amp;#10;this technique, which we call vectorization, is to store each column&amp;#10;separately. We use a generalization of vectorization as the&amp;#10;basis for a native XML store. The idea is to decompose an XML&amp;#10;document into a set of vectors that contain the data values and a&amp;#10;compressed skeleton that describes the structure. In order to query&amp;#10;this representation and produce results in the same vectorized format,&amp;#10;we consider a practical fragment of XQuery and introduce&amp;#10;the notion of query graphs and a novel graph reduction algorithm&amp;#10;that allows us to leverage relational optimization techniques as&amp;#10;well as to reduce the unnecessary loading of data vectors and decompression&amp;#10;of skeletons. A preliminary experimental study based&amp;#10;on some scientific and synthetic XML data repositories in the order&amp;#10;of gigabytes supports the claim that these techniques are scalable&amp;#10;and have the potential to provide performance comparable with&amp;#10;established relational database technology.</abstract></paper><paper><title>IMAX: Incremental Maintenance of Schema-Based XML Statistics</title><author><AuthorName>Maya Ramanath</AuthorName><institute><InstituteName>Indian Institute of Scienc</InstituteName><country></country></institute></author><author><AuthorName>Lingzhi Zhang</AuthorName><institute><InstituteName>OGI/OHS</InstituteName><country></country></institute></author><author><AuthorName>Juliana Freire</AuthorName><institute><InstituteName>OGI/OHS</InstituteName><country></country></institute></author><author><AuthorName>Jayant R. Haritsa</AuthorName><institute><InstituteName>Indian Institute of Scienc</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Current approaches for estimating the cardinality of&amp;#10;XML queries are applicable to a static scenario wherein&amp;#10;the underlying XML data does not change subsequent to&amp;#10;the collection of statistics on the repository. However, in&amp;#10;practice, many XML-based applications are dynamic and&amp;#10;involve frequent updates to the data. In this paper, we investigate&amp;#10;efficient strategies for incrementally maintaining statistical&amp;#10;summaries as and when updates are applied to the&amp;#10;data. Specifically, we propose algorithms that handle both&amp;#10;the addition of new documents as well as random insertions&amp;#10;in the existing document trees. We also show, through a detailed&amp;#10;performance evaluation, that our incremental techniques&amp;#10;are significantly faster than the naive recomputation&amp;#10;approach; and that estimation accuracy can be maintained&amp;#10;even with a fixed memory budget.</abstract></paper><paper><title>BOXes: Efficient Maintenance of Order-Based Labeling for Dynamic XML Data</title><author><AuthorName>Adam Silberstein</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><author><AuthorName>Hao He</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><author><AuthorName>Ke Yi</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><author><AuthorName>Jun Yang</AuthorName><institute><InstituteName>Duke Universit</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Order-based element labeling for tree-structured XML&amp;#10;data is an important technique in XML processing. It lies at&amp;#10;the core of many fundamental XML operations such as containment&amp;#10;join and twig matching. While labeling for static&amp;#10;XML documents is well understood, less is known about&amp;#10;how to maintain accurate labeling for dynamic XML documents,&amp;#10;when elements and subtrees are inserted and deleted.&amp;#10;Most existing approaches do not work well for arbitrary update&amp;#10;patterns; they either produce unacceptably long labels&amp;#10;or incur enormous relabeling costs. We present two novel&amp;#10;I/O-efficient data structures, W-BOX and B-BOX, that &amp;#10;efficiently maintain labeling for large, dynamic XML documents.&amp;#10;We show analytically and experimentally that both, despite&amp;#10;consuming minimal amounts of storage, gracefully handle&amp;#10;arbitrary update patterns without sacrificing lookup &amp;#10;efficiency. The two structures together provide a nice tradeoff&amp;#10;between update and lookup costs: W-BOX has logarithmic&amp;#10;amortized update cost and constant worst-case lookup&amp;#10;cost, while B-BOX has constant amortized update cost and&amp;#10;logarithmic worst-case lookup cost. We further propose techniques&amp;#10;to eliminate the lookup cost for read-heavy workloads.</abstract></paper><paper><title>Efficient Inverted Lists and Query Algorithms for Structured Value Ranking in Update-Intensive Relational Databases</title><author><AuthorName>Lin Guo</AuthorName><institute><InstituteName>Cornell Universit</InstituteName><country></country></institute></author><author><AuthorName>Jayavel Shanmugasundaram</AuthorName><institute><InstituteName>Cornell Universit</InstituteName><country></country></institute></author><author><AuthorName>Kevin Beyer</AuthorName><institute><InstituteName>IBM Almaden Research Cente</InstituteName><country></country></institute></author><author><AuthorName>Eugene Shekita</AuthorName><institute><InstituteName>IBM Almaden Research Cente</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We propose a new ranking paradigm for relational&amp;#10;databases called Structured Value Ranking (SVR). SVR uses&amp;#10;structured data values to score (rank) the results of keyword&amp;#10;search queries over text columns. Our main contribution&amp;#10;is a new family of inverted list indices and associated&amp;#10;query algorithms that can support SVR efficiently in&amp;#10;update-intensive databases, where the structured data values&amp;#10;(and hence the scores of documents) change frequently.&amp;#10;Our experimental results on real and synthetic data sets using&amp;#10;BerkeleyDB show that we can support SVR efficiently in&amp;#10;relational databases.</abstract></paper><paper><title>Compressing Bitmap Indices by Data Reorganization</title><author><AuthorName>Ali P韓ar</AuthorName><institute><InstituteName>Lawrence Berkeley National Laborator</InstituteName><country></country></institute></author><author><AuthorName>Tao Tao</AuthorName><institute><InstituteName>University of Illinois at Urbana-Champaig</InstituteName><country></country></institute></author><author><AuthorName>Hakan Ferhatosmanoglu</AuthorName><institute><InstituteName>Ohio State Universit</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Many scientific applications generate massive volumes&amp;#10;of data through observations or computer simulations,&amp;#10;bringing up the need for effective indexing&amp;#10;methods for efficient storage and retrieval of scientific&amp;#10;data. Unlike conventional databases, scientific data is&amp;#10;mostly read-only and its volume can reach to the order&amp;#10;of petabytes, making a compact index structure vital.&amp;#10;Bitmap indexing has been successfully applied to scientific&amp;#10;databases by exploiting the fact that scientific data&amp;#10;are enumerated or numerical. Bitmap indices can be&amp;#10;compressed with variants of run length encoding for a&amp;#10;compact index structure. However even this may not&amp;#10;be enough for the enormous data generated in some&amp;#10;applications such as high energy physics. In this paper,&amp;#10;we study how to reorganize bitmap tables for improved&amp;#10;compression rates. Our algorithms are used just as a&amp;#10;preprocessing step, thus there is no need to revise the&amp;#10;current indexing techniques and the query processing&amp;#10;algorithms. We introduce the tuple reordering problem,&amp;#10;which aims to reorganize database tuples for optimal&amp;#10;compression rates. We propose Gray code ordering algorithm&amp;#10;for this NP-Complete problem, which is an inplace&amp;#10;algorithm, and runs in linear time in the order&amp;#10;of the size of the database. We also discuss how the&amp;#10;tuple reordering problem can be reduced to the traveling&amp;#10;salesperson problem. Our experimental results on&amp;#10;real data sets show that the compression ratio can be&amp;#10;improved by a factor of 2 to 10.</abstract></paper><paper><title>Practical Data Management Techniques for Vehicle Tracking Data</title><author><AuthorName>Sotiris Brakatsoulas</AuthorName><institute><InstituteName>Research Academic Computer Technology Institut</InstituteName><country></country></institute></author><author><AuthorName>Dieter Pfoser</AuthorName><institute><InstituteName>Research Academic Computer Technology Institut</InstituteName><country></country></institute></author><author><AuthorName>Nectaria Tryfona</AuthorName><institute><InstituteName>Research Academic Computer Technology Institut</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>A novel data source for assessing traffic conditions is&amp;#10;floating car data (FCD) in the form of vehicle tracking&amp;#10;data, or, in database terms, trajectory data. This work&amp;#10;proposes practical data management techniques including&amp;#10;data pre-processing, data modeling and indexing to&amp;#10;support the analysis and the data mining of vehicle&amp;#10;tracking data</abstract></paper><paper><title>On Discovery of Extremely Low-Dimensional Clusters Using Semi-Supervised Projected Clustering</title><author><AuthorName>Kevin Y. Yip</AuthorName><institute><InstituteName>University of Hong Kon</InstituteName><country></country></institute></author><author><AuthorName>David W. Cheung</AuthorName><institute><InstituteName>University of Hong Kon</InstituteName><country></country></institute></author><author><AuthorName>Michael K. Ng</AuthorName><institute><InstituteName>University of Hong Kon</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Recent studies suggest that projected clusters with extremely&amp;#10;low dimensionality exist in many real datasets. A&amp;#10;number of projected clustering algorithms have been proposed&amp;#10;in the past several years, but few can identify clusters&amp;#10;with dimensionality lower than 10% of the total number&amp;#10;of dimensions, which are commonly found in some real&amp;#10;datasets such as gene expression profiles. In this paper we&amp;#10;propose a new algorithm that can accurately identify projected&amp;#10;clusters with relevant dimensions as few as 5% of the&amp;#10;total number of dimensions. It makes use of a robust objective&amp;#10;function that combines object clustering and dimension&amp;#10;selection into a single optimization problem. The algorithm&amp;#10;can also utilize domain knowledge in the form of labeled&amp;#10;objects and labeled dimensions to improve its clustering accuracy.&amp;#10;We believe this is the first semi-supervised projected&amp;#10;clustering algorithm. Both theoretical analysis and experimental&amp;#10;results show that by using a small amount of input&amp;#10;knowledge, possibly covering only a portion of the underlying&amp;#10;classes, the new algorithm can be further improved to&amp;#10;accurately detect clusters with only 1% of the dimensions&amp;#10;being relevant. The algorithm is also useful in getting a target&amp;#10;set of clusters when there are multiple possible groupings&amp;#10;of the objects.</abstract></paper><paper><title>Clustering Aggregation</title><author><AuthorName>Aristides Gionis</AuthorName><institute><InstituteName>University of Helsink</InstituteName><country></country></institute></author><author><AuthorName>Heikki Mannila</AuthorName><institute><InstituteName>University of Helsink</InstituteName><country></country></institute></author><author><AuthorName>Panayiotis Tsaparas</AuthorName><institute><InstituteName>University of Helsink</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We consider the following problem: given a set of clusterings,&amp;#10;find a clustering that agrees as much as possible&amp;#10;with the given clusterings. This problem, clustering aggregation,&amp;#10;appears naturally in various contexts. For example,&amp;#10;clustering categorical data is an instance of the problem:&amp;#10;each categorical variable can be viewed as a clustering of&amp;#10;the input rows. Moreover, clustering aggregation can be&amp;#10;used as a meta-clustering method to improve the robustness&amp;#10;of clusterings. The problem formulation does not require a-priori&amp;#10;information about the number of clusters, and it gives&amp;#10;a naturalway for handlingmissing values. We give a formal&amp;#10;statement of the clustering-aggregation problem, we discuss&amp;#10;related work, and we suggest a number of algorithms. For&amp;#10;several of the methods we provide theoretical guarantees on&amp;#10;the quality of the solutions. We also show how sampling can&amp;#10;be used to scale the algorithms for large data sets. We give&amp;#10;an extensive empirical evaluation demonstrating the usefulness&amp;#10;of the problem and of the solutions.</abstract></paper><paper><title>Mining Cross-Graph Quasi-Cliques in Gene Expression and Protein Interaction Data</title><author><AuthorName>Jian Pei</AuthorName><institute><InstituteName>Simon Fraser Universit</InstituteName><country></country></institute></author><author><AuthorName>Daxin Jiang</AuthorName><institute><InstituteName>State University of New York at Buffal</InstituteName><country></country></institute></author><author><AuthorName>Aidong Zhang</AuthorName><institute><InstituteName>State University of New York at Buffal</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>Mining Closed Relational Graphs with Connectivity Constraints</title><author><AuthorName>Xifeng Yan</AuthorName><institute><InstituteName>University of Illinois at Urbana-Champaig</InstituteName><country></country></institute></author><author><AuthorName>X. Jasmine Zhou</AuthorName><institute><InstituteName>University of Southern Californi</InstituteName><country></country></institute></author><author><AuthorName>Jiawei Han</AuthorName><institute><InstituteName>University of Illinois at Urbana-Champaig</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>Efficient Creation and Incremental Maintenance of the HOPI Index for Complex XML Document Collections</title><author><AuthorName>Ralf Schenkel</AuthorName><institute><InstituteName>Max-Planck-Institut f黵 Informati</InstituteName><country></country></institute></author><author><AuthorName>Anja Theobald</AuthorName><institute><InstituteName>Max-Planck-Institut f黵 Informati</InstituteName><country></country></institute></author><author><AuthorName>Gerhard Weikum</AuthorName><institute><InstituteName>Max-Planck-Institut f黵 Informati</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>The HOPI index, a connection index for XML documents&amp;#10;based on the concept of a 2-hop cover, provides space- and&amp;#10;time-efficient reachability tests along the ancestor, descendant,&amp;#10;and link axes to support path expressions with wildcards&amp;#10;in XML search engines.This paper presents enhanced algorithms for building&amp;#10;HOPI, shows how to augment the index with distance information,&amp;#10;and discusses incremental index maintenance. Our&amp;#10;experiments show substantial improvements over the existing&amp;#10;divide-and-conquer algorithm for index creation, low&amp;#10;space overhead for including distance information in the&amp;#10;index, and efficient updates.</abstract></paper><paper><title>On the Sequencing of Tree Structures for XML Indexing</title><author><AuthorName>Haixun Wang</AuthorName><institute><InstituteName>IBM T. J. Watson Research Cente</InstituteName><country></country></institute></author><author><AuthorName>Xiaofeng Meng</AuthorName><institute><InstituteName>Renmin University of Chin</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Sequence-based XML indexing aims at avoiding expensive&amp;#10;join operations in query processing. It transforms structured&amp;#10;XML data into sequences so that a structured query can be&amp;#10;answered holistically through subsequence matching. In this&amp;#10;paper, we address the problem of query equivalence with respect&amp;#10;to this transformation, and we introduce a performance-oriented&amp;#10;principle for sequencing tree structures. With query&amp;#10;equivalence, XML queries can be performed through subsequence&amp;#10;matching without join operations, post-processing, or&amp;#10;other special handling for problems such as false alarms. We&amp;#10;identify a class of sequencing methods for this purpose, and we&amp;#10;present a novel subsequence matching algorithm that observe&amp;#10;query equivalence. Still, query equivalence is just a prerequisite&amp;#10;for sequence-based XML indexing. Our goal is to find&amp;#10;the best sequencing strategy with regard to the time and space&amp;#10;complexity in indexing and querying XML data. To this end,&amp;#10;we introduce a performance-oriented principle to guide the sequencing&amp;#10;of tree structures. For any given XML dataset, the&amp;#10;principle finds an optimal sequencing strategy according to its&amp;#10;schema and its data distribution. We present a novel method&amp;#10;that realizes this principle. In our experiments, we show the&amp;#10;advantages of sequence-based indexing over traditional XML&amp;#10;indexing methods, and we compare several sequencing strategies&amp;#10;and demonstrate the benefit of the performance-oriented&amp;#10;sequencing principle.</abstract></paper><paper><title>VLEI Code: An Efficient Labeling Method for Handling XML Documents in an RDB</title><author><AuthorName>Kazuhito Kobayashi</AuthorName><institute><InstituteName>Tokyo Institute of Technolog</InstituteName><country></country></institute></author><author><AuthorName>Wenxin Wenxin</AuthorName><institute><InstituteName>Tokyo Institute of Technolog</InstituteName><country></country></institute></author><author><AuthorName>Dai Kobayashi</AuthorName><institute><InstituteName>Tokyo Institute of Technolog</InstituteName><country></country></institute></author><aut
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -