📄 icde_2006_elementary.txt

📁 利用lwp：：get写的
💻 TXT
📖 第 1 页 / 共 5 页
字号:
<proceedings><paper><title>General Chairs</title><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>External Reviewers</title><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>MiniCount: Efficient Rewriting of COUNT-Queries Using Views</title><author><AuthorName>Vaclav Lin</AuthorName><institute><InstituteName>University of Economics, Czech Republi</InstituteName><country></country></institute></author><author><AuthorName>Vasilis Vassalos</AuthorName><institute><InstituteName>Athens University of Economics and Business, Greec</InstituteName><country></country></institute></author><author><AuthorName>Prodromos Malakasiotis</AuthorName><institute><InstituteName>Athens University of Economics and Business, Greec</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We present MiniCount, the first efficient sound and&amp;#10;complete algorithm for finding maximally contained&amp;#10;rewritings of conjunctive queries with count, using conjunctive&amp;#10;views with count and conjunctive views without&amp;#10;aggregation. An efficient and scalable solution to this&amp;#10;problem yields significant benefits for data warehousing&amp;#10;and decision support systems, as well as for powerful&amp;#10;data integration systems.We first present a naive rewriting&amp;#10;algorithm implicit in the recent theoretical results by&amp;#10;Cohen et al. [5] and identify three independent sources of&amp;#10;exponential complexity in the naive algorithm, including&amp;#10;an expensive containment check. Then we present&amp;#10;and discuss MiniCount and prove it sound and complete.&amp;#10;We also present an experimental study that shows Mini-&amp;#10;Count to be orders of magnitude faster than the naive&amp;#10;algorithm, and to be able to scale to large numbers of&amp;#10;views</abstract></paper><paper><title>Updates Through Views: A New Hope</title><author><AuthorName>Yannis Kotidis</AuthorName><institute><InstituteName>AT&amp;T Lab</InstituteName><country></country></institute></author><author><AuthorName>Divesh Srivastava</AuthorName><institute><InstituteName>AT&amp;T Lab</InstituteName><country></country></institute></author><author><AuthorName>Yannis Velegrakis</AuthorName><institute><InstituteName>AT&amp;T Lab</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Database views are extensively used to represent unmaterialized&amp;#10;tables. Applications rarely distinguish between a&amp;#10;materialized base table and a virtual view, thus, they may&amp;#10;issue update requests on the views. Since views are virtual,&amp;#10;update requests on them need to be translated to updates on&amp;#10;the base tables. Existing literature has shown the difficulty&amp;#10;of translating view updates in a side-effect free manner. To&amp;#10;address this problem, we propose a novel approach for separating&amp;#10;the data instance into a logical and a physical level.&amp;#10;This separation allows us to achieve side-effect free translations&amp;#10;of any kind of update on the view. Furthermore,&amp;#10;deletes on a view can be translated without affecting the&amp;#10;base tables. We describe the implementation of the framework&amp;#10;and present our experimental results</abstract></paper><paper><title>Learning from Aggregate Views</title><author><AuthorName>Bee-Chung Chen</AuthorName><institute><InstituteName>University of Wisconsi</InstituteName><country></country></institute></author><author><AuthorName>Lei Chen</AuthorName><institute><InstituteName>University of Wisconsi</InstituteName><country></country></institute></author><author><AuthorName>Raghu Ramakrishnan</AuthorName><institute><InstituteName>University of Wisconsi</InstituteName><country></country></institute></author><author><AuthorName>David R. Musicant</AuthorName><institute><InstituteName>Carleton Colleg</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>In this paper, we introduce a new class of data mining&amp;#10;problems called learning from aggregate views. In&amp;#10;contrast to the traditional problem of learning from a&amp;#10;single table of training examples, the new goal is to learn&amp;#10;from multiple aggregate views of the underlying data,&amp;#10;without access to the un-aggregated data. We motivate&amp;#10;this new problem, present a general problem framework,&amp;#10;develop learning methods for RFA (Restriction-Free&amp;#10;Aggregate) views defined using COUNT, SUM, AVG and&amp;#10;STDEV, and offer theoretical and experimental results that&amp;#10;characterize the proposed methods.</abstract></paper><paper><title>C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking</title><author><AuthorName>Dong Xin</AuthorName><institute><InstituteName>University of Illinois at Urbana-Champaig</InstituteName><country></country></institute></author><author><AuthorName>Zheng Shao</AuthorName><institute><InstituteName>University of Illinois at Urbana-Champaig</InstituteName><country></country></institute></author><author><AuthorName>Jiawei Han</AuthorName><institute><InstituteName>University of Illinois at Urbana-Champaig</InstituteName><country></country></institute></author><author><AuthorName>Hongyan Liu</AuthorName><institute><InstituteName>Tsinghua University, Chin</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>It is well recognized that data cubing often produces&amp;#10;huge outputs. Two popular efforts devoted to this problem&amp;#10;are (1) iceberg cube, where only significant cells&amp;#10;are kept, and (2) closed cube, where a group of cells&amp;#10;which preserve roll-up/drill-down semantics are losslessly&amp;#10;compressed to one cell. Due to its usability and&amp;#10;importance, efficient computation of closed cubes still&amp;#10;warrants a thorough study.&amp;#10;In this paper, we propose a new measure, called&amp;#10;closedness, for efficient closed data cubing. We show&amp;#10;that closedness is an algebraic measure and can be computed&amp;#10;efficiently and incrementally. Based on closedness&amp;#10;measure, we develop an an aggregation-based approach,&amp;#10;called C-Cubing (i.e., Closed-Cubing), and integrate&amp;#10;it into two successful iceberg cubing algorithms:&amp;#10;MM-Cubing and Star-Cubing. Our performance study&amp;#10;shows that C-Cubing runs almost one order of magnitude&amp;#10;faster than the previous approaches. We further&amp;#10;study how the performance of the alternative algorithms&amp;#10;of C-Cubing varies w.r.t the properties of the data sets.</abstract></paper><paper><title>A Primitive Operator for Similarity Joins in Data Cleaning</title><author><AuthorName>Surajit Chaudhuri</AuthorName><institute><InstituteName>Microsoft Researc</InstituteName><country></country></institute></author><author><AuthorName>Venkatesh Ganti</AuthorName><institute><InstituteName>Microsoft Researc</InstituteName><country></country></institute></author><author><AuthorName>Raghav Kaushik</AuthorName><institute><InstituteName>Microsoft Researc</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Data cleaning based on similarities involves identification&amp;#10;of &amp;quot;close&amp;quot; tuples, where closeness is evaluated using a&amp;#10;variety of similarity functions chosen to suit the domain and&amp;#10;application. Current approaches for efficiently implementing&amp;#10;such similarity joins are tightly tied to the chosen similarity&amp;#10;function. In this paper, we propose a new primitive&amp;#10;operator which can be used as a foundation to implement&amp;#10;similarity joins according to a variety of popular string similarity&amp;#10;functions, and notions of similarity which go beyond&amp;#10;textual similarity. We then propose efficient implementations&amp;#10;for this operator. In an experimental evaluation using real&amp;#10;datasets, we show that the implementation of similarity joins&amp;#10;using our operator is comparable to, and often substantially&amp;#10;better than, previous customized implementations for particular&amp;#10;similarity functions.</abstract></paper><paper><title>Techniques for Warehousing of Sample Data</title><author><AuthorName>Paul G. Brown</AuthorName><institute><InstituteName>IBM Almaden Research Cente</InstituteName><country></country></institute></author><author><AuthorName>Peter J. Haas</AuthorName><institute><InstituteName>IBM Almaden Research Cente</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We consider the problem of maintaining a warehouse of&amp;#10;sampled data that &amp;quot;shadows&amp;quot; a full-scale data warehouse,&amp;#10;in order to support quick approximate analytics and metadata&amp;#10;discovery. The full-scale warehouse comprises many&amp;#10;&amp;quot;data sets,&amp;quot; where a data set is a bag of values; the data sets&amp;#10;can vary enormously in size. The values constituting a data&amp;#10;set can arrive in batch or stream form. We provide and compare&amp;#10;several new algorithms for independent and parallel&amp;#10;uniform random sampling of data-set partitions, where the&amp;#10;partitions are created by dividing the batch or splitting the&amp;#10;stream. We also provide novel methods for merging samples&amp;#10;to create a uniform sample from an arbitrary union&amp;#10;of data-set partitions. Our sampling/merge methods are the&amp;#10;first to simultaneously support statistical uniformity, a priori&amp;#10;bounds on the sample footprint, and concise sample storage.&amp;#10;As partitions are rolled in and out of the warehouse, the&amp;#10;corresponding samples are rolled in and out of the sample&amp;#10;warehouse. In this manner our sampling methods approximate&amp;#10;the behavior of more sophisticated stream-sampling&amp;#10;methods, while also supporting parallel processing. Experiments&amp;#10;indicate that our methods are efficient and scalable,&amp;#10;and provide guidance for their application.</abstract></paper><paper><title>Working Models for Uncertain Data</title><author><AuthorName>Anish Das Sarma</AuthorName><institute><InstituteName>Stanford Universit</InstituteName><country></country></institute></author><author><AuthorName>Omar Benjelloun</AuthorName><institute><InstituteName>Stanford Universit</InstituteName><country></country></institute></author><author><AuthorName>Alon Halevy</AuthorName><institute><InstituteName>University of Washingto</InstituteName><country></country></institute></author><author><AuthorName>Jennifer Widom</AuthorName><institute><InstituteName>Stanford Universit</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>This paper explores an inherent tension in modeling and&amp;#10;querying uncertain data: simple, intuitive representations of&amp;#10;uncertain data capture many application requirements, but&amp;#10;these representations are generally incomplete&amp;#8213;standard operations&amp;#10;over the data may result in unrepresentable types of&amp;#10;uncertainty. Complete models are theoretically attractive, but&amp;#10;they can be nonintuitive and more complex than necessary&amp;#10;for many applications. To address this tension, we propose a&amp;#10;two-layer approach to managing uncertain data: an underlying&amp;#10;logical model that is complete, and one or more working&amp;#10;models that are easier to understand, visualize, and query,&amp;#10;but may lose some information. We explore the space of incomplete&amp;#10;working models, place several of them in a strict&amp;#10;hierarchy based on expressive power, and study their closure&amp;#10;properties. We describe how the two-layer approach is being&amp;#10;used in our prototype DBMS for uncertain data, and we identify&amp;#10;a number of interesting open problems to fully realize the&amp;#10;approach.</abstract></paper><paper><title>Reasoning About Approximate Match Query Results</title><author><AuthorName>Sudipto Guha</AuthorName><institute><InstituteName>U of Pennsylvani</InstituteName><country></country></institute></author><author><AuthorName>Nick Koudas</AuthorName><institute><InstituteName>U of Toront</InstituteName><country></country></institute></author><author><AuthorName>Divesh Srivastava</AuthorName><institute><InstituteName>AT&amp;T Labs Researc</InstituteName><country></country></institute></author><author><AuthorName>Xiaohui Yu</AuthorName><institute><InstituteName>U of Toront</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Join techniques deploying approximate match predicates&amp;#10;are fundamental data cleaning operations. A variety of predicates&amp;#10;have been utilized to quantify approximate match in such&amp;#10;operations and some have been embedded in a declarative data&amp;#10;cleaning framework. These techniques return pairs of tuples&amp;#10;from both relations, tagged with a score, signifying the degree&amp;#10;of similarity between the tuples in the pair according to the&amp;#10;specific approximate match predicate.&amp;#10;In this paper, we consider the problem of estimating various&amp;#10;parameters on the output of declarative approximate join&amp;#10;algorithms for planning purposes. Such algorithms are highly&amp;#10;time consuming, so precise knowledge of the result size as well&amp;#10;as its score distribution is a pressing concern. This knowledge&amp;#10;aids decisions as to which operations are more promising for&amp;#10;identifying highly similar tuples, which is a key operation for&amp;#10;data cleaning. We propose solution strategies that fully comply&amp;#10;with a declarative framework and analytically reason about&amp;#10;the quality of the estimates we obtain as well as the performance&amp;#10;of our strategies.We present the results of a detailed performance evaluation&amp;#10;of all strategies proposed. Our experimental results validate&amp;#10;our analytical expectations and shed additional light on the&amp;#10;quality and performance of our estimation framework. Our&amp;#10;study offers a set of simple, fully declarative techniques for&amp;#10;this problem, which can be readily deployed in data cleaning&amp;#10;systems.</abstract></paper><paper><title>The Gauss-Tree: Efficient Object Identification in Databases of Probabilistic Feature Vectors</title><author><AuthorName>Christian Bohm</AuthorName><institute><InstituteName>University of Munich, German</InstituteName><country></country></institute></author><author><AuthorName>Alexey Pryakhin</AuthorName><institute><InstituteName>University of Munich, German</InstituteName><country></country></institute></author><author><AuthorName>Matthias Schubert</AuthorName><institute><InstituteName>University of Munich, German</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>In applications of biometric databases the typical task&amp;#10;is to identify individuals according to features which are&amp;#10;not exactly known. Reasons for this inexactness are varying&amp;#10;measuring techniques or environmental circumstances.&amp;#10;Since these circumstances are not necessarily the same&amp;#10;when determining the features for different individuals, the&amp;#10;exactness might strongly vary between the individuals as&amp;#10;well as between the features. To identify individuals, similarity&amp;#10;search on feature vectors is applicable, but even the&amp;#10;use of adaptable distance measures is not capable to handle&amp;#10;objects having an individual level of exactness. Therefore,&amp;#10;we develop a comprehensive probabilistic theory in&amp;#10;which uncertain observations are modeled by probabilistic&amp;#10;feature vectors (pfv), i.e. feature vectors where the conventional&amp;#10;feature values are replaced by Gaussian probability&amp;#10;distribution functions. Each feature value of each object&amp;#10;is complemented by a variance value indicating its uncertainty.&amp;#10;We define two types of identification queries, k-mostlikely&amp;#10;identification and threshold identification. For efficient&amp;#10;query processing, we propose a novel index structure,&amp;#10;the Gauss-tree. Our experimental evaluation demonstrates&amp;#10;that pfv stored in a Gauss-tree significantly improve the result&amp;#10;quality compared to traditional feature vectors. Additionally,&amp;#10;we show that the Gauss-tree significantly speeds&amp;#10;up query times compared to competitive methods.</abstract></paper><paper><title>Finding Fastest Paths on A Road Network with Speed Patterns</title><author><AuthorName>Evangelos Kanoulas</AuthorName><institute><InstituteName>Northeastern Universit</InstituteName><country></country></institute></author><author><AuthorName>Yang Du</AuthorName><institute><InstituteName>Northeastern Universit</InstituteName><country></country></institute></author><author><AuthorName>Tian Xia</AuthorName><institute><InstituteName>Northeastern Universit</InstituteName><country></country></institute></author><author><AuthorName>Donghui Zhang</AuthorName><institute><InstituteName>Northeastern Universit</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>This paper proposes and solves the Time-Interval All&amp;#10;Fastest Path (allFP) query. Given a user-defined leaving or&amp;#10;arrival time interval I, a source node s and an end node e,&amp;#10;allFP asks for a set of all fastest paths from s to e, one for&amp;#10;each sub-interval of I. Note that the query algorithm should&amp;#10;find a partitioning of I into sub-intervals. Existing methods&amp;#10;can only be used to solve a very special case of the problem,&amp;#10;when the leaving time is a single time instant. A straightforward&amp;#10;solution to the allFP query is to run existing methods&amp;#10;many times, once for every time instant in I. This paper&amp;#10;proposes a solution based on novel extensions to the A* algorithm.&amp;#10;Instead of expanding the network many times, we&amp;#10;expand once. The travel time on a path is kept as a function&amp;#10;of leaving time. Methods to combine travel-time functions&amp;#10;are provided to expand a path. A novel lower-bound estimator&amp;#10;for travel time is proposed. Performance results reveal&amp;#10;that our method is more efficient and more accurate than&amp;#10;the discrete-time approach.</abstract></paper><paper><title>Approximation Techniques for Indexing the Earth Mover&amp;#8217;s Distance in Multimedia Databases</title><author><AuthorName>Ira Assent</AuthorName><institute><InstituteName>RWTH Aachen University, German</InstituteName><country></country></institute></author><author><AuthorName>Andrea Wenning</AuthorName><institute><InstituteName>EXAConsult GmbH, German</InstituteName><country></country></institute></author><author><AuthorName>Thomas Seidl</AuthorName><institute><InstituteName>RWTH Aachen University, German</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Todays abundance of storage coupled with digital technologies&amp;#10;in virtually any scientific or commercial application&amp;#10;such as medical and biological imaging or music&amp;#10;archives deal with tremendous quantities of images, videos&amp;#10;or audio files stored in large multimedia databases. For&amp;#10;content-based data mining and retrieval purposes suitable&amp;#10;similarity models are crucial. The Earth Mover&amp;#8217;s Distance&amp;#10;was introduced in Computer Vision to better approach human&amp;#10;perceptual similarities. Its computation, however, is&amp;#10;too complex for usage in interactive multimedia database&amp;#10;scenarios. In order to enable efficient query processing&amp;#10;in large databases, we propose an index-supported multistep&amp;#10;algorithm. We therefore develop new lower bounding&amp;#10;approximation techniques for the Earth Mover&amp;#8217;s Distance&amp;#10;which satisfy high quality criteria including completeness&amp;#10;(no false drops), index-suitability and fast computation. We&amp;#10;demonstrate the efficiency of our approach in extensive experiments&amp;#10;on large image databases</abstract></paper><paper><title>Indexing for Dynamic Abstract Regions</title><author><AuthorName>Joxan Jaffar</AuthorName><institute><InstituteName>National University of Singapor</InstituteName><country></country></institute></author><author><AuthorName>Roland H.C. Yap</AuthorName><institute><InstituteName>National University of Singapor</InstituteName><country></country></institute></author><author><AuthorName>Kenny Q. Zhu</AuthorName><institute><InstituteName>Microsoft Corporatio</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We propose a new main memory index structure for abstract&amp;#10;regions (objects) which may heavily overlap, the RCtree.&amp;#10;These objects are &amp;quot;dynamic&amp;quot; and may have short life&amp;#10;spans. The novelty is that rather than representing an object&amp;#10;by its minimum bounding rectangle (MBR), possibly&amp;#10;with pre-processed segmentation into many small MBRs,&amp;#10;we use the actual shape of the object to maintain the index.&amp;#10;This saves significant space for objects with large spatial&amp;#10;extents since pre-segmentation is not needed. We show&amp;#10;that the query performance of RC-tree is much better than&amp;#10;many indexing schemes on synthetic overlapping data sets.&amp;#10;The performance is also competitive on real-life GIS nonoverlapping&amp;#10;data sets.</abstract></paper><paper><title>Efficient Processing of Updates in Dynamic XML Data</title><author><AuthorName>Changqing Li</AuthorName><institute><InstituteName>National University of Singapor</InstituteName><country></country></institute></author><author><AuthorName>Tok Wang Ling</AuthorName><institute><InstituteName>National University of Singapor</InstituteName><country></country></institute></author><author><AuthorName>Min Hu</AuthorName><institute><InstituteName>National University of Singapor</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>It is important to process the updates when nodes are&amp;#10;inserted into or deleted from the XML tree. All the&amp;#10;existing labeling schemes have high update cost, thus in&amp;#10;this paper we propose a novel Compact Dynamic Binary&amp;#10;String (CDBS) encoding to efficiently process the&amp;#10;updates. CDBS has two important properties which form&amp;#10;the foundations of this paper: (1) CDBS supports that&amp;#10;codes can be inserted between any two consecutive CDBS&amp;#10;codes with the orders kept and without re-encoding the&amp;#10;existing codes; (2) CDBS is orthogonal to specific&amp;#10;labeling schemes, thus it can be applied broadly to&amp;#10;different labeling schemes or other applications to&amp;#10;efficiently process the updates. We report our&amp;#10;experimental results to show that our CDBS is superior to&amp;#10;previous approaches to process updates in terms of the&amp;#10;number of nodes to re-label and the time for updating.</abstract></paper><paper><title>A Complete and Efficient Algebraic Compiler for XQuery</title><author><AuthorName>Christopher Re</AuthorName><institute><InstituteName>University of Washingto</InstituteName><country></country></institute></author><author><AuthorName>Jerome Simeon</AuthorName><institute><InstituteName>IBM T.J. Watson Research Cente</InstituteName><country></country></institute></author><author><AuthorName>Mary Fernandez</AuthorName><institute><InstituteName>AT&amp;T Lab</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>As XQuery nears standardization, more sophisticated&amp;#10;XQuery applications are emerging, which often exploit the&amp;#10;entire language and are applied to non-trivial XML sources.&amp;#10;We propose an algebra and optimization techniques that are&amp;#10;suitable for building an XQuery compiler that is complete,&amp;#10;correct, and efficient. We describe the compilation rules for&amp;#10;the complete language into that algebra and present novel&amp;#10;optimization techniques that address the needs of complex&amp;#10;queries. These techniques include new query unnesting&amp;#10;rewritings and specialized join algorithms that account for&amp;#10;XQuery&amp;#8217;s complex predicate semantics. The algebra and&amp;#10;optimizations are implemented in the Galax XQuery engine,&amp;#10;and yield execution plans that are up to three orders of magnitude&amp;#10;faster than earlier versions of Galax.</abstract></paper><paper><title>Making Designer Schemas with Colors</title><author><AuthorName>Nuwee Wiwatwattana</AuthorName><institute><InstituteName>U of Michiga</InstituteName><country></country></institute></author><author><AuthorName>H. V. Jagadish</AuthorName><institute><InstituteName>U of Michiga</InstituteName><country></country></institute></author><author><AuthorName>Laks V. S. Lakshmanan</AuthorName><institute><InstituteName>U of British Columbi</InstituteName><country></country></institute></author><author><AuthorName>Divesh Srivastava</AuthorName><institute><InstituteName>AT&amp;T Lab</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>XML schema design has two opposing goals: elimination&amp;#10;of update anomalies requires that the schema be as&amp;#10;normalized as possible; yet higher query performance and&amp;#10;simpler query expression are often obtained through the&amp;#10;use of schemas that permit redundancy. In this paper, we&amp;#10;show that the recently proposed MCT data model, which extends&amp;#10;XML by adding colors, can be used to address this dichotomy&amp;#10;effectively. Specifically, we formalize the intuition&amp;#10;of anomaly avoidance in MCT using notions of node normal&amp;#10;and edge normal forms, and the goal of efficient query&amp;#10;processing using notions of association recoverability and&amp;#10;direct recoverability. We develop algorithms for transforming&amp;#10;design specifications given as ER diagrams into MCT&amp;#10;schemas that are in a node or edge normal form and satisfy&amp;#10;association or direct recoverability. Experimental results&amp;#10;using a wide variety of ER diagrams validate the benefits of&amp;#10;our design methodology.</abstract></paper><paper><title>Mining Actionable Patterns by Role Models</title><author><AuthorName>Ke Wang</AuthorName><institute><InstituteName>Simon Fraser Universit</InstituteName><country></country></institute></author><author><AuthorName>Yuelong Jiang</AuthorName><institute><InstituteName>Simon Fraser Universit</InstituteName><country></country></institute></author><author><AuthorName>Alexander Tuzhilin</AuthorName><institute><InstituteName>New York Universit</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Data mining promises to discover valid and potentially&amp;#10;useful patterns in data. Often, discovered patterns are not&amp;#10;useful to the user.&amp;quot;Actionability&amp;quot; addresses this problem in&amp;#10;that a pattern is deemed actionable if the user can act upon&amp;#10;it in her favor. We introduce the notion of &amp;quot;action&amp;quot; as a&amp;#10;domain-independent way to model the domain knowledge.&amp;#10;Given a data set about actionable features and an utility&amp;#10;measure, a pattern is actionable if it summarizes a population&amp;#10;that can be acted upon towards a more promising&amp;#10;population observed with a higher utility. We present several&amp;#10;pruning strategies taking into account the actionability&amp;#10;requirement to reduce the search space, and algorithms for&amp;#10;mining all actionable patterns as well as mining the top k&amp;#10;actionable patterns. We evaluate the usefulness of patterns&amp;#10;and the focus of search on a real-world application domain.</abstract></paper><paper><title>Systematic Approach for Optimizing Complex Mining Tasks on Multiple Databases</title><author><AuthorName>Ruoming Jin</AuthorName><institute><InstituteName>Kent State Universit</InstituteName><country></country></institute></author><author><AuthorName>Gagan Agrawal</AuthorName><institute><InstituteName>The Ohio State Universit</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Many real world applications involve not just a single&amp;#10;dataset, but a view of multiple datasets. These datasets may&amp;#10;be collected from different sources and/or at different time instances.&amp;#10;In such scenarios, comparing patterns or features&amp;#10;from different datasets and understanding their relationships&amp;#10;can be an extremely important part of the KDD process. This&amp;#10;paper considers the problem of optimizing a mining task over&amp;#10;multiple datasets, when it has been expressed using a highlevel&amp;#10;interface. Specifically, we make the following contributions:&amp;#10;1) We present an SQL-based mechanism for querying&amp;#10;frequent patterns across multiple datasets, and establish&amp;#10;an algebra for these queries. 2) We develop a systematic&amp;#10;method for enumerating query plans and present several algorithms&amp;#10;for finding optimized query plan which reduce execution&amp;#10;costs. 3) We evaluate our algorithms on real and synthetic&amp;#10;datasets, and show up to an order of magnitude performance&amp;#10;improvement</abstract></paper><paper><title>New Sampling-Based Estimators for OLAP Queries</title><author><AuthorName>Ruoming Jin</AuthorName><institute><InstituteName>Kent State Universit</InstituteName><country></country></institute></author><author><AuthorName>Leo Glimcher</AuthorName><institute><InstituteName>Ohio State Universit</InstituteName><country></country></institute></author><author><AuthorName>Chris Jermaine</AuthorName><institute><InstituteName>University of Florid</InstituteName><country></country></institute></author><author><AuthorName>Gagan Agrawal</AuthorName><institute><InstituteName>Ohio State Universit</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>One important way in which sampling for approximate query&amp;#10;processing in a database environment differs from traditional&amp;#10;applications of sampling is that in a database, it is feasible to&amp;#10;collect accurate summary statistics from the data in addition&amp;#10;to the sample. This paper describes a set of sampling-based&amp;#10;estimators for approximate query processing that make use&amp;#10;of simple summary statistics to to greatly increase the accuracy&amp;#10;of sampling-based estimators. Our estimators are&amp;#10;able to give tight probabilistic guarantees on estimation accuracy.&amp;#10;They are suitable for low or high dimensional data,&amp;#10;and work with categorical or numerical attributes. Furthermore,&amp;#10;the information used by our estimators can easily be&amp;#10;gathered in a single pass, making them suitable for use in a&amp;#10;streaming environment.</abstract></paper><paper><title>Efficiently Evaluating Order Preserving Similarity Queries over Historical Market-Basket Data</title><author><AuthorName>Reza Sherkat</AuthorName><institute><InstituteName>University of Albert</InstituteName><country></country></institute></author><author><AuthorName>Davood Rafiei</AuthorName><institute><InstituteName>University of Albert</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We introduce a new domain-independent framework for&amp;#10;formulating and efficiently evaluating similarity queries&amp;#10;over historical data, where given a history as a sequence&amp;#10;of timestamped observations and the pair-wise similarity&amp;#10;of observations, we want to find similar histories. For instance,&amp;#10;given a database of customer transactions and a&amp;#10;time period, we can find customers with similar purchasing&amp;#10;behaviors over this period. Our work is different from&amp;#10;the work on retrieving similar time series; it addresses the&amp;#10;general problem in which a history cannot be modeled as&amp;#10;a time series, hence the relevant conventional approaches&amp;#10;are not applicable. We derive a similarity measure for histories,&amp;#10;based on an aggregation of the similarities between&amp;#10;the observations of the two histories, and propose efficient&amp;#10;algorithms for finding an optimal alignment between two&amp;#10;histories. Given the non-metric nature of our measure, we&amp;#10;develop some upper bounds and an algorithm that makes&amp;#10;use of those bounds to prune histories that are guaranteed&amp;#10;not to be in the answer set. Our experimental results on&amp;#10;real and synthetic data confirm the effectiveness and efficiency&amp;#10;of our approach. For instance, when the minimum&amp;#10;length of a match is provided, our algorithm achieves up to&amp;#10;an order of magnitude speed-up over alternative methods.</abstract></paper><paper><title>End-biased Samples for Join Cardinality Estimation</title><author><AuthorName>Cristian Estan</AuthorName><institute><InstituteName>University of Wisconsin-Madiso</InstituteName><country></country></institute></author><author><AuthorName>Jeffrey F. Naughton</AuthorName><institute><InstituteName>University of Wisconsin-Madiso</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We present a new technique for using samples to estimate&amp;#10;join cardinalities. This technique, which we term&amp;#10;&amp;quot;end-biased samples,&amp;quot; is inspired by recent work in network&amp;#10;traffic measurement. It improves on random samples&amp;#10;by using coordinated pseudo-random samples and retaining&amp;#10;the sampled values in proportion to their frequency. We&amp;#10;show that end-biased samples always provide more accurate&amp;#10;estimates than random samples with the same sample&amp;#10;size. The comparison with histograms is more interesting&amp;#10;&amp;#8213; while end-biased histograms are somewhat better than&amp;#10;end-biased samples for uncorrelated data sets, end-biased&amp;#10;samples dominate by a large margin when the data is correlated.&amp;#10;Finally, we compare end-biased samples to the recently&amp;#10;proposed &amp;quot;skimmed sketches&amp;quot; and show that neither&amp;#10;dominates the other, that each has different and compelling&amp;#10;strengths and weaknesses. These results suggest that endbiased&amp;#10;samples may be a useful addition to the repertoire of&amp;#10;techniques used for data summarization.</abstract></paper><paper><title>Laws for Rewriting Queries Containing Division Operators</title><author><AuthorName>Ralf Rantzau</AuthorName><institute><InstituteName>IBM Almaden Research Cente</InstituteName><country></country></institute></author><author><AuthorName>Christoph Mangold</AuthorName><institute><InstituteName>Universitat Stuttgart, German</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Relational division, also known as small divide, is&amp;#10;derived operator of the relational algebra that realizes&amp;#10;many-to-one set containment test, where a set is represented&amp;#10;as a group of tuples: Small divide discovers which&amp;#10;sets in a dividend relation contain all elements of the set&amp;#10;stored in a divisor relation. The great divide operator extends&amp;#10;small divide by realizing many-to-many set containment&amp;#10;tests. It is also similar to the set containment join&amp;#10;operator for schemas that are not in first normal form.&amp;#10;Neither small nor great divide has been implemented in&amp;#10;commercial relational database systems although the operators&amp;#10;solve important problems and many efficient algorithms&amp;#10;for them exist. We present algebraic laws that allow&amp;#10;rewriting expressions containing small or great divide,&amp;#10;illustrate their importance for query optimization, and discuss&amp;#10;the use of great divide for frequent itemset discovery,&amp;#10;an important data mining primitive.A recent theoretic result shows that small divide must be&amp;#10;implemented by special purpose algorithms and not be simulated&amp;#10;by pure relational algebra expressions to achieve efficiency.&amp;#10;Consequently, an efficient implementation requires&amp;#10;that the optimizer treats small divide as a first-class operator&amp;#10;and possesses powerful algebraic laws for query rewriting.</abstract></paper><paper><title>R-trees with Update Memos</title><author><AuthorName>Xiaopeng Xiong</AuthorName><institute><InstituteName>Purdue Universit</InstituteName><country></country></institute></author><author><AuthorName>Walid G. Aref</AuthorName><institute><InstituteName>Purdue Universit</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>The problem of frequently updating multi-dimensional&amp;#10;indexes arises in many location-dependent applications.&amp;#10;While the R-tree and its variants are one of the dominant&amp;#10;choices for indexing multi-dimensional objects, the R-tree&amp;#10;exhibits inferior performance in the presence of frequent updates.&amp;#10;In this paper, we present an R-tree variant, termed&amp;#10;the RUM-tree (stands for R-tree with Update Memo) that&amp;#10;minimizes the cost of object updates. The RUM-tree processes&amp;#10;updates in a memo-based approach that avoids disk&amp;#10;accesses for purging old entries during an update process.&amp;#10;Therefore, the cost of an update operation in the RUM-tree&amp;#10;reduces to the cost of only an insert operation. The removal&amp;#10;of old object entries is carried out by a garbage cleaner inside&amp;#10;the RUM-tree. In this paper, we present the details of&amp;#10;the RUM-tree and study its properties. Theoretical analysis&amp;#10;and experimental evaluation demonstrate that the RUMtree&amp;#10;outperforms other R-tree variants by up to a factor of&amp;#10;eight in scenarios with frequent updates.</abstract></paper><paper><title>Compiled Query Execution Engine using JVM</title><author><AuthorName>Jun Rao</AuthorName><institute><InstituteName>IBM Almaden Research Cente</InstituteName><country></country></institute></author><author><AuthorName>Hamid Pirahesh</AuthorName><institute><InstituteName>IBM Almaden Research Cente</InstituteName><country></country></institute></author><author><AuthorName>C. Mohan</AuthorName><institute><InstituteName>IBM Almaden Research Cente</InstituteName><country></country></institute></author><author><AuthorName>Guy Lohman</AuthorName><institute><InstituteName>IBM Almaden Research Cente</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>A conventional query execution engine in a&amp;#10;database system essentially uses a SQL virtual&amp;#10;machine (SVM) to interpret a dataflow tree in&amp;#10;which each node is associated with a relational&amp;#10;operator. During query evaluation, a single tuple&amp;#10;at a time is processed and passed among the&amp;#10;operators. Such a model is popular because of its&amp;#10;efficiency for pipelined processing. However,&amp;#10;since each operator is implemented statically, it&amp;#10;has to be very generic in order to deal with all&amp;#10;possible queries. Such generality tends to&amp;#10;introduce significant runtime inefficiency,&amp;#10;especially in the context of memory-resident&amp;#10;systems, because the granularity of data&amp;#10;commercial system, using SVM.&amp;#10;processing (a tuple) is too small compared with&amp;#10;the associated overhead. Another disadvantage in&amp;#10;such an engine is that each operator code is&amp;#10;compiled statically, so query-specific&amp;#10;optimization cannot be applied.&amp;#10;To improve runtime efficiency, we propose a&amp;#10;compiled execution engine, which, for a given&amp;#10;query, generates new query-specific code on the&amp;#10;fly, and then dynamically compiles and executes&amp;#10;the code. The Java platform makes our approach&amp;#10;particularly interesting for several reasons: (1)&amp;#10;modern Java Virtual Machines (JVM) have Just-&amp;#10;In-Time (JIT) compilers that optimize code at&amp;#10;runtime based on the execution pattern, a key&amp;#10;feature that SVMs lack; (2) because of Java&amp;#8217;s&amp;#10;continued popularity, JVMs keep improving at a&amp;#10;faster pace than SVMs, allowing us to exploit&amp;#10;new advances in the Java runtime in the future;&amp;#10;(3) Java is a dynamic language, which makes it&amp;#10;convenient to load a piece of new code on the&amp;#10;fly. In this paper, we develop both an interpreted&amp;#10;and a compiled query execution engine in a&amp;#10;relational, Java-based, in-memory database&amp;#10;prototype, and perform an experimental study.&amp;#10;Our experimental results on the TPC-H data set&amp;#10;show that, despite both engines benefiting from&amp;#10;JIT, the compiled engine runs on average about&amp;#10;twice as fast as the interpreted one, and&amp;#10;significantly faster than an in-memory</abstract></paper><paper><title>\ell -Diversity: Privacy Beyond \kappa -Anonymity</title><author><AuthorName>Ashwin Machanavajjhala</AuthorName><institute><InstituteName>Cornell Universit</InstituteName><country></country></institute></author><author><AuthorName>Johannes Gehrke</AuthorName><institute><InstituteName>Cornell Universit</InstituteName><country></country></institute></author><author><AuthorName>Daniel Kifer</AuthorName><institute><InstituteName>Cornell Universit</InstituteName><country></country></institute></author><author><AuthorName>Muthuramakrishnan Venkitasubramaniam</AuthorName><institute><InstituteName>Cornell Universit</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Publishing data about individuals without revealing sensitive&amp;#10;information about them is an important problem. In&amp;#10;recent years, a new definition of privacy called \kappa-anonymity&amp;#10;has gained popularity. In a \kappa-anonymized dataset, each&amp;#10;record is indistinguishable from at least k-1 other records&amp;#10;with respect to certain &amp;quot;identifying&amp;quot; attributes.&amp;#10;In this paper we show with two simple attacks that a&amp;#10;\kappa-anonymized dataset has some subtle, but severe privacy&amp;#10;problems. First, we show that an attacker can discover the&amp;#10;values of sensitive attributes when there is little diversity&amp;#10;in those sensitive attributes. Second, attackers often have&amp;#10;background knowledge, and we show that \kappa-anonymity does&amp;#10;not guarantee privacy against attackers using background&amp;#10;knowledge. We give a detailed analysis of these two attacks&amp;#10;and we propose a novel and powerful privacy definition&amp;#10;called \ell-diversity. In addition to building a formal&amp;#10;foundation for \ell-diversity, we show in an experimental evaluation&amp;#10;that \ell-diversity is practical and can be implemented&amp;#10;efficiently.&amp;#10;&amp;#10;&amp;#10;&amp;#10;&amp;#10;&amp;#10;&amp;#10;&amp;#10; &amp;#10;&amp;#10;</abstract></paper><paper><title>Mondrian Multidimensional K-Anonymity</title><author><AuthorName>Kristen LeFevre</AuthorName><institute><InstituteName>University of Wisconsin, Madiso</InstituteName><country></country></institute></author><author><AuthorName>David J. DeWitt</AuthorName><institute><InstituteName>University of Wisconsin, Madiso</InstituteName><country></country></institute></author><author><AuthorName>Raghu Ramakrishnan</AuthorName><institute><InstituteName>University of Wisconsin, Madiso</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>K-Anonymity has been proposed as a mechanism for protecting&amp;#10;privacy in microdata publishing, and numerous recoding&amp;#10;&amp;quot;models&amp;quot; have been considered for achieving anonymity.&amp;#10;This paper proposes a new multidimensional&amp;#10;model, which provides an additional degree of flexibility not&amp;#10;seen in previous (single-dimensional) approaches. Often&amp;#10;this flexibility leads to higher-quality anonymizations, as&amp;#10;measured both by general-purpose metrics and more specific&amp;#10;notions of query answerability.&amp;#10;Optimal multidimensional anonymization is NP-hard&amp;#10;(like previous optimal anonymity problems). However,&amp;#10;we introduce a simple greedy approximation algorithm,&amp;#10;and experimental results show that this greedy algorithm&amp;#10;frequently leads to more desirable anonymizations than&amp;#10;exhaustive optimal algorithms for two single-dimensional&amp;#10;models.</abstract></paper><paper><title>Sovereign Joins</title><author><AuthorName>Rakesh Agrawal</AuthorName><institute><InstituteName>IBM Almaden Research Cente</InstituteName><country></country></institute></author><author><AuthorName>Dmitri Asonov</AuthorName><institute><InstituteName>IBM Almaden Research Cente</InstituteName><country></country></institute></author><author><AuthorName>Murat Kantarcioglu</AuthorName><institute><InstituteName>The University of Texas at Dalla</InstituteName><country></country></institute></author><author><AuthorName>Yaping Li</AuthorName><institute><InstituteName>University of California, Berkele</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We present a secure network service for sovereign information&amp;#10;sharing whose only trusted component is an off-theshelf&amp;#10;secure coprocessor. The participating data providers&amp;#10;send encrypted relations to the service that sends the encrypted&amp;#10;results to the recipients. The technical challenge in&amp;#10;implementing such a service arises from the limited capability&amp;#10;of the secure coprocessors: they have small memory,&amp;#10;no attached disk, and no facility for communicating directly&amp;#10;with other machines in the network. The internal state of an&amp;#10;ongoing computation within the secure coprocessor cannot&amp;#10;be seen from outside, but its interactions with the server can&amp;#10;be exploited by an adversary.&amp;#10;We formulate the problem of computing join in this&amp;#10;setting where the goal is to prevent information leakage&amp;#10;through patterns in I/O while maximizing performance. We&amp;#10;specify criteria for proving the security of a join algorithm&amp;#10;and provide provably safe algorithms. These algorithms&amp;#10;can be used to compute general joins involving arbitrary&amp;#10;predicates and multiple sovereign databases. We thus enable&amp;#10;a new class of applications requiring query processing&amp;#10;across sovereign entities such that nothing apart from the&amp;#10;result is revealed to the recipients.</abstract></paper><paper><title>Privacy Preserving Query Processing Using Third Parties</title><author><AuthorName>Fatih Emekci</AuthorName><institute><InstituteName>University of California Santa Barbar</InstituteName><country></country></institute></author><author><AuthorName>Divyakant Agrawal</AuthorName><institute><InstituteName>University of California Santa Barbar</InstituteName><country></country></institute></author><author><AuthorName>Amr El Abbadi</AuthorName><institute><InstituteName>University of Califonia Santa Barbar</InstituteName><country></country></institute></author><author><AuthorName>Aziz Gulbeden</AuthorName><institute><InstituteName>University of California Santa Barbar</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Data integration from multiple autonomous data sources&amp;#10;has emerged as an important practical problem. The key requirement&amp;#10;for such data integration is that owners of such&amp;#10;data need to cooperate in a competitive landscape in most&amp;#10;of the cases. The research challenge in developing a query&amp;#10;processing solution is that the answers to the queries need&amp;#10;to be provided while preserving the privacy of the data&amp;#10;sources. In general, allowing unrestricted read access to&amp;#10;the whole data may give rise to potential vulnerabilities as&amp;#10;well as may have legal implications. Therefore, there is a&amp;#10;need for privacy preserving database operations for querying&amp;#10;data residing at different parties. In this paper, we propose&amp;#10;a new query processing technique using third parties in&amp;#10;a peer-to-peer system. We propose and evaluate two different&amp;#10;protocols for various database operations. Our scheme&amp;#10;is able to answer queries without revealing any useful information&amp;#10;to the data sources or to the third parties. Analytical&amp;#10;comparison of the proposed approach with other recent proposals&amp;#10;for privacy-preserving data integration establishes&amp;#10;the superiority of the proposed approach in terms of query&amp;#10;response time</abstract></paper><paper><title>Efficient Batch Top-k Search for Dictionary-based Entity Recognition</title><author><AuthorName>Amit Chandel</AuthorName><institute><InstituteName>IIT Bomba</InstituteName><country></country></institute></author><author><AuthorName>P. C. Nagesh</AuthorName><institute><InstituteName>IIT Bomba</InstituteName><country></country></institute></author><author><AuthorName>Sunita Sarawagi</AuthorName><institute><InstituteName>IIT Bomba</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We consider the problem of speeding up Entity Recognition&amp;#10;systems that exploit existing large databases of structured&amp;#10;entities to improve extraction accuracy. These systems&amp;#10;require the computation of the maximum similarity scores of&amp;#10;several overlapping segments of the input text with the entity&amp;#10;database. We formulate a Batch-Top-K problem with&amp;#10;the goal of sharing computations across overlapping segments.&amp;#10;Our proposed algorithm performs a factor of three&amp;#10;faster than independent Top-K queries and only a factor of&amp;#10;two slower than an unachievable lower bound on total cost.&amp;#10;We then propose a novel modification of the popular Viterbi&amp;#10;algorithm for recognizing entities so as to work with easily&amp;#10;computable bounds on match scores, thereby reducing the&amp;#10;total inference time by a factor of eight compared to stateof-&amp;#10;the-art methods.</abstract></paper><paper><title>Integrating Unstructured Data into Relational Databases</title><author><AuthorName>Imran R. Mansuri</AuthorName><institute><InstituteName>IIT Bomba</InstituteName><country></country></institute></author><author><AuthorName>Sunita Sarawagi</AuthorName><institute><InstituteName>IIT Bomba</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>In this paper we present a system for automatically integrating&amp;#10;unstructured text into a multi-relational database&amp;#10;using state-of-the-art statistical models for structure extraction&amp;#10;and matching. We show how to extend current highperforming&amp;#10;models, Conditional Random Fields and their&amp;#10;semi-markov counterparts, to effectively exploit a variety of&amp;#10;recognition clues available in a database of entities, thereby&amp;#10;significantly reducing the dependence on manually labeled&amp;#10;training data. Our system is designed to load unstructured&amp;#10;records into columns spread across multiple tables in the&amp;#10;database while resolving the relationship of the extracted&amp;#10;text with existing column values, and preserving the cardinality&amp;#10;and link constraints of the database. We show how to&amp;#10;combine the inference algorithms of statistical models with&amp;#10;the database imposed constraints for optimal data integration.</abstract></paper><paper><title>Clean Answers over Dirty Databases: A Probabilistic Approach</title><author><AuthorName>Periklis Andritsos</AuthorName><institute><InstituteName>Univesity of Trent</InstituteName><country></country></institute></author><author><AuthorName>Ariel Fuxman</AuthorName><institute><InstituteName>University of Toront</InstituteName><country></country></institute></author><author><AuthorName>Renee J. Miller</AuthorName><institute><InstituteName>University of Toront</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>The detection of duplicate tuples, corresponding to the&amp;#10;same real-world entity, is an important task in data integration&amp;#10;and cleaning. While many techniques exist to identify&amp;#10;such tuples, the merging or elimination of duplicates can be&amp;#10;a difficult task that relies on ad-hoc and often manual solutions.&amp;#10;We propose a complementary approach that permits&amp;#10;declarative query answering over duplicated data, where&amp;#10;each duplicate is associated with a probability of being in&amp;#10;the clean database. We rewrite queries over a database containing&amp;#10;duplicates to return each answer with the probability&amp;#10;that the answer is in the clean database. Our rewritten&amp;#10;queries are sensitive to the semantics of duplication and help&amp;#10;a user understand which query answers are most likely to be&amp;#10;present in the clean database.&amp;#10;The semantics that we adopt is independent of the way&amp;#10;the probabilities are produced, but is able to effectively exploit&amp;#10;them during query answering. In the absence of external&amp;#10;knowledge that associates each database tuple with a&amp;#10;probability, we offer a technique, based on tuple summaries,&amp;#10;that automates this task. We experimentally study the performance&amp;#10;of our rewritten queries. Our studies show that&amp;#10;the rewriting does not introduce a significant overhead in&amp;#10;query execution time. This work is done in the context of&amp;#10;the ConQuer project at the University of Toronto, which focuses&amp;#10;on the efficient management of inconsistent and dirty&amp;#10;databases.</abstract></paper><paper><title>Syntactic Rule Based Approach toWeb Service Composition</title><author><AuthorName>Ken Pu</AuthorName><institute><InstituteName>University of Toront</InstituteName><country></country></institute></author><author><AuthorName>Vagelis Hristidis</AuthorName><institute><InstituteName>Florida National  International Universit</InstituteName><country></country></institute></author><author><AuthorName>Nick Koudas</AuthorName><institute><InstituteName>University of Toront</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>This paper studies a problem of web service composition&amp;#10;from a syntactic approach. In contrast with other approaches&amp;#10;on enriched semantic description such as statetransition&amp;#10;description of web services, our focus is in the&amp;#10;case when only the input-output type information from the&amp;#10;WSDL specifications is available.&amp;#10;The web service composition problem is formally formulated&amp;#10;as deriving a given desired type from a collection&amp;#10;of available types and web services using a prescribed&amp;#10;set of rules with costs. We show that solving the minimal&amp;#10;cost composition is NP-complete in general, and present&amp;#10;a practical solution based on dynamic programming. Experiements&amp;#10;using a mixture of synthetic and real data sets&amp;#10;show that our approach is viable and produces good results.</abstract></paper><paper><title>Hilda: A High-Level Language for Data-DrivenWeb Applications</title><author><AuthorName>Fan Yang</AuthorName><institute><InstituteName>Cornell Universit</InstituteName><country></country></institute></author><author><AuthorName>Jayavel Shanmugasundaram</AuthorName><institute><InstituteName>Cornell Universit</InstituteName><country></country></institute></author><author><AuthorName>Mirek Riedewald</AuthorName><institute><InstituteName>Cornell Universit</InstituteName><country></country></institute></author><author><AuthorName>Johannes Gehrke</AuthorName><institute><InstituteName>Cornell Universit</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We propose Hilda, a high-level language for developing&amp;#10;data-driven web applications. The primary benefits of Hilda&amp;#10;over existing development platforms are: (a) it uses a unified&amp;#10;data model for all layers of the application, (b) it is&amp;#10;declarative, (c) it models both application queries and updates,&amp;#10;(d) it supports structured programming for web sites,&amp;#10;and (e) it enables conflict detection for concurrent updates.&amp;#10;We also describe the implementation of a simple proof-ofconcept&amp;#10;Hilda compiler, which translates a Hilda application&amp;#10;program into Java Servlet code.</abstract></paper><paper><title>UNIT: User-centric Transaction Management in Web-Database Systems</title><author><AuthorName>Huiming Qu</AuthorName><institute><InstituteName>University of Pittsburg</InstituteName><country></country></institute></author><author><AuthorName>Alexandros Labrinidis</AuthorName><institute><InstituteName>University of Pittsburg</InstituteName><country></country></institute></author><author><AuthorName>Daniel Mosse</AuthorName><institute><InstituteName>University of Pittsburg</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Web-database systems are nowadays an integral part of&amp;#10;everybody&amp;#8217;s life, with applications ranging from monitoring/&amp;#10;trading stock portfolios, to personalized blog aggregation&amp;#10;and news services, to personalized weather tracking&amp;#10;services. For most of these services to be successful (and&amp;#10;their users to be kept satisfied), two criteria need to be met:&amp;#10;user requests must be answered in a timely fashion and using&amp;#10;fresh data. This paper presents a framework to balance&amp;#10;both requirements from the users&amp;#8217; perspective. Toward this,&amp;#10;we propose a user satisfaction metric to measure the overall&amp;#10;effectiveness of the Web-database system. We also provide&amp;#10;a set of algorithms to dynamically optimize this metric,&amp;#10;through query admission control and update frequency&amp;#10;modulation. Finally, we present extensive experimental results&amp;#10;which compare our proposed algorithms to the current&amp;#10;state of the art and show that we outperform competitors&amp;#10;under various workloads (generated based on real traces)&amp;#10;and user requirements.</abstract></paper><paper><title>VBI-Tree: A Peer-to-Peer Framework for Supporting Multi-Dimensional Indexing Schemes</title><author><AuthorName>H. V. Jagadish</AuthorName><institute><InstituteName>University of Michiga</InstituteName><country></country></institute></author><author><AuthorName>Beng Chin Ooi</AuthorName><institute><InstituteName>NUS, Singapor</InstituteName><country></country></institute></author><author><AuthorName>Quang Hieu Vu</AuthorName><institute><InstituteName>NUS, Singapor</InstituteName><country></country></institute></author><author><AuthorName>Rong Zhang</AuthorName><institute><InstituteName>Fudan University, Chin</InstituteName><country></country></institute></author><author><AuthorName>Aoying Zhou</AuthorName><institute><InstituteName>Fudan University, Chin</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Multi-dimensional data indexing has received much attention&amp;#10;in a centralized database. However, not so much&amp;#10;work has been done on this topic in the context of Peerto-&amp;#10;Peer systems. In this paper, we propose a new Peer-to-&amp;#10;Peer framework based on a balanced tree structure overlay,&amp;#10;which can support extensible centralized mapping methods&amp;#10;and query processing based on a variety of multidimensional&amp;#10;tree structures, including R-Tree, X-Tree, SSTree,&amp;#10;and M-Tree. Specifically, in a network with N nodes,&amp;#10;our framework guarantees that point queries and range&amp;#10;queries can be answered within O(logN) hops. We also&amp;#10;provide an effective load balancing strategy to allow nodes&amp;#10;to balance their work load efficiently. An experimental assessment&amp;#10;validates the practicality of our proposal.</abstract></paper><paper><title>Transaction Time Support Inside a Database Engine</title><author><AuthorName>David Lomet</AuthorName><institute><InstituteName>Microsoft Researc</InstituteName><country></country></institute></author><author><AuthorName>Roger Barga</AuthorName><institute><InstituteName>Microsoft Researc</InstituteName><country></country></institute></author><author><AuthorName>Mohamed F. Mokbel</AuthorName><institute><InstituteName>University of Minnesot</InstituteName><country></country></institute></author><author><AuthorName>German Shegalov</AuthorName><institute><InstituteName>Max Planck Institute, German</InstituteName><country></country></institute></author><year>2006</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Transaction time databases retain and provide access&amp;#10;to prior states of a database. An update &amp;quot;inserts&amp;quot; a new&amp;#10;record while preserving the old version. Immortal DB&amp;#10;builds transaction time database support into a database&amp;#10;engine, not in middleware. It supports as of queries returning&amp;#10;records current at the specified time. It also supports&amp;#10;snapshot isolation concurrency control. Versions are&amp;#10;stamped with the &amp;quot;clock times&amp;quot; of their updating transactions.&amp;#10;The timestamp order agrees with transaction serialization&amp;#10;order. Lazy timestamping propagates timestamps&amp;#10;to transaction updates after commit. Versions are kept in&amp;#10;an integrated storage structure, with historical versions initially&amp;#10;stored with current data. Time-splits of pages permit&amp;#10;large histories to be maintained, and enable time based indexing,&amp;#10;which is essential for high performance historical&amp;#10
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -