📄 icde_2000_elementary.txt
字号:
<proceedings><paper><title>Message from the Program Committee Co-Chairs</title><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>Program Vice-Chairs and Award Committee Members</title><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>External Reviewers</title><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract></abstract></paper><paper><title>Rules of Thumb in Data Engineering</title><author><AuthorName>Jim Gray</AuthorName><institute><InstituteName>Microsoft Researc</InstituteName><country></country></institute></author><author><AuthorName>Prashant Shenoy</AuthorName><institute><InstituteName>University of Massachusetts at Amhers</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>This paper reexamines the rules of thumb for the design of data storage systems. Briefly, it looks at storage, processing, and networking costs, ratios, and trends with a particular focus on performance and price/performance. Amdahl's ratio laws for system design need only slight revision after 35 years-the major change being the increased use of RAM. An analysis also indicates storage should be used to cache both database and web data to save disk bandwidth, network bandwidth, and people's time. Surprisingly, the 5-minute rule for disk caching becomes a cache-everything rule for web caching.</abstract></paper><paper><title>Online Data Mining for Co-Evolving Time Sequences</title><author><AuthorName>Byoung-Kee Yi</AuthorName><institute><InstituteName>University of Maryland at College Par</InstituteName><country></country></institute></author><author><AuthorName>N.D. Sidiropoulos</AuthorName><institute><InstituteName>University of Virgini</InstituteName><country></country></institute></author><author><AuthorName>Theodore Johnson</AuthorName><institute><InstituteName>AT&T Lab</InstituteName><country></country></institute></author><author><AuthorName>Alexandros Biliris</AuthorName><institute><InstituteName>AT&T Lab</InstituteName><country></country></institute></author><author><AuthorName>H.V. Jagadish</AuthorName><institute><InstituteName>University of Michiga</InstituteName><country></country></institute></author><author><AuthorName>Christos Faloutsos</AuthorName><institute><InstituteName>Carnegie Mellon Universit</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>In many applications, the data of interest comprises multiple sequences that evolve over time. Examples include currency exchange rates, network traffic data. We develop a fast method to analyze such co-evolving time sequences jointly to allow (a) estimation/forecasting of missing/delayed/future values, (b) quantitative data mining, and (c) outlier detection.Our method, MUSCLES, adapts to changing correlations among time sequences. It can handle indefinitely long sequences efficiently using an incremental algorithm and requires only small amount of storage and less I/O operations. To make it scale for a large number of sequences, we present a variation, the Selective method and propose an efficient algorithm to reduce the problem size.Experiments on real datasets show that outperforms popular competitors in prediction accuracy up to 10 times, and discovers interesting correlations. Moreover, Selective scales up very well for large numbers of sequences, reducing response time up to 110 times over MUSCLES, and sometimes even improves the prediction quality.</abstract></paper><paper><title>Efficient Searches for Similar Subsequences of Different Lengths in Sequence Databases</title><author><AuthorName>Sanghyun Park</AuthorName><institute><InstituteName>University of California at Los Angele</InstituteName><country></country></institute></author><author><AuthorName>Wesley W. Chu</AuthorName><institute><InstituteName>University of California at Los Angele</InstituteName><country></country></institute></author><author><AuthorName>Jeehee Yoon</AuthorName><institute><InstituteName>Hallym Universit</InstituteName><country></country></institute></author><author><AuthorName>Chihcheng Hsu</AuthorName><institute><InstituteName>IB</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We propose an indexing technique for fast retrieval of similar subsequences using time warping distances. A time warping distance is a more suitable similarity measure than the Euclidean distance in many applications, where sequences may be of different lengths or different sampling rates. Our indexing technique uses a disk-based suffix tree as an index structure and employs lower-bound distance functions to filter out dissimilar subsequences without false dismissals. To make the index structure compact and thus accelerate the query processing, we convert sequences of continuous values to sequences of discrete values via a categorization method and store only a subset of suffixes whose first values are different from their preceding values. The experimental results reveal that our proposed technique can be a few orders of magnitude faster than sequential scanning.</abstract></paper><paper><title>Landmarks: A New Model for Similarity-Based Pattern Querying in Time Series Databases</title><author><AuthorName>Chang-Shing Perng</AuthorName><institute><InstituteName>University of California at Los Angele</InstituteName><country></country></institute></author><author><AuthorName>Haixun Wang</AuthorName><institute><InstituteName>University of California at Los Angele</InstituteName><country></country></institute></author><author><AuthorName>Sylvia R. Zhang</AuthorName><institute><InstituteName>University of California at Los Angele</InstituteName><country></country></institute></author><author><AuthorName>D. Stott Parker</AuthorName><institute><InstituteName>University of California at Los Angele</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>In this paper we present the Landmark Model, a model for time series that yields new techniques for similarity-based time series pattern querying. The Landmark Model does not follow traditional similarity models that rely on point-wise Euclidean distance. Instead, it leads to Landmark Similarity, a general model of similarity that is consistent with human intuition and episodic memory.By tracking different specific subsets of features of landmarks, we can efficiently compute different Landmark Similarity measures that are invariant under corresponding subsets of six transformations; namely, Shifting, Uniform Amplitude Scaling, Uniform Time Scaling, Uniform Bi-scaling, Time Warping and Non-uniform Amplitude Scaling.A method of identifying features that are invariant under these transformations is proposed. We also discuss a generalized approach for removing noise from raw time series without smoothing out the peaks and bottoms. Beside these new capabilities, our experiments show that Landmark Indexing is considerably fast.</abstract></paper><paper><title>Managing Escalation of Collaboration Processes in Crisis Mitigation Situations</title><author><AuthorName>Dimitrios Georgakopoulos</AuthorName><institute><InstituteName>MC</InstituteName><country></country></institute></author><author><AuthorName>Hans Schuster</AuthorName><institute><InstituteName>MC</InstituteName><country></country></institute></author><author><AuthorName>Donald Baker</AuthorName><institute><InstituteName>MC</InstituteName><country></country></institute></author><author><AuthorName>Andrzej Cichocki</AuthorName><institute><InstituteName>MC</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Processes for crisis mitigation must permit coordination flexibility and dynamic change to empower crisis mitigation coordinators and experts to deal with the unexpected situations. However, such mitigation processes must also provide enough structure to prevent chaotic response and increase mitigation effectiveness. Such combination of structure and flexibility cannot be effectively supported by existing workflow or groupware technologies.In this paper, we introduce the Collaboration Management Infrastructure (CMI) and describe its capabilities for supporting crisis mitigation processes. CMI provides a comprehensive Collaboration Management Model (CMM) and a corresponding federated system. CMM supports process templates that provide the initial activities, control and data flow structure, and resources needed to start mitigating a variety of crisis situations.In the event of a crisis, the appropriate process template is selected and instantiated. Crisis mitigation is achieved by escalating the instantiated process template. Escalation involves selecting and adding new process templates, creating new activities, roles, and task forces as needed to deal with the current demands in the crisis, and delegating responsibilities to process participants and task forces. CMM provides advanced composable primitives that empower crisis mitigation coordinators and experts to escalate the process. We provide an overview of the implementation of a federated CMI system and discuss our initial experience with various applications in the area of crisis management.</abstract></paper><paper><title>Semantic Conditions for Correctness at Different Isolation Levels</title><author><AuthorName>Arthur J. Bernstein</AuthorName><institute><InstituteName>State University of New York at Stony Broo</InstituteName><country></country></institute></author><author><AuthorName>Philip M. Lewis</AuthorName><institute><InstituteName>State University of New York at Stony Broo</InstituteName><country></country></institute></author><author><AuthorName>Shiyong Lu</AuthorName><institute><InstituteName>State University of New York at Stony Broo</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Many transaction processing applications execute at isolation levels lower than SERIALIZABLE in order to increase throughput and reduce response time. The problem is that non-serializable schedules are not guaranteed to be correct for all applications. The semantics of a particular application determines whether that application will run correctly at a lower isolation level, and in practice it appears that many applications do. Unfortunately, we know of no analysis technique that has been developed to test an application for its correctness at a particular level. Apparently decisions of this nature are made on an informal basis. In this paper we describe such a technique in a formal way.We use a new definition of correctness, semantic correctness, which is weaker than serializability, to investigate the correctness of such executions. For each isolation level, we prove a condition under which transactions that execute at that level will be semantically correct. In addition to the ANSI/ISO isolation levels of READ UNCOMMITTED, READ COMMITTED, and REPEATABLE READ, we also prove a condition for correct execution at the READ COMMITTED with first-committer-wins (a variation of READ COMMITTED) and at the SNAPSHOT isolation level. We assume that different transactions can be executing at different isolation levels, but that each transaction is executing at least at the READ UNCOMMITTED level.</abstract></paper><paper><title>Generalized Isolation Level Definitions</title><author><AuthorName>Atul Adya</AuthorName><institute><InstituteName>Microsoft Researc</InstituteName><country></country></institute></author><author><AuthorName>Barbara Liskov</AuthorName><institute><InstituteName>Massachusetts Institute of Technolog</InstituteName><country></country></institute></author><author><AuthorName>Patrick O'Neil</AuthorName><institute><InstituteName>University of Massachusetts at Bosto</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Commercial databases support different isolation levels to allow programmers to trade off consistency for a potential gain in performance. The isolation levels are defined in the current ANSI standard, but the definitions are ambiguous and revised definitions proposed to correct the problem are too constrained since they allow only pessimistic (locking) implementations. This paper presents new specifications for the ANSI levels. Our specifications are portable; they apply not only to locking implementations, but also to optimistic and multi-version concurrency control schemes. Furthermore, unlike earlier definitions, our new specifications handle predicates in a correct and flexible manner at all levels.</abstract></paper><paper><title>Creating a Customized Access Method for Blobworld</title><author><AuthorName>Megan Thomas</AuthorName><institute><InstituteName>University of California at Berkele</InstituteName><country></country></institute></author><author><AuthorName>Chad Carson</AuthorName><institute><InstituteName>University of California at Berkele</InstituteName><country></country></institute></author><author><AuthorName>Joseph M. Hellerstein</AuthorName><institute><InstituteName>University of California at Berkele</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We present the design and analysis of a customized access method for the content-based image retrieval system, Blobworld. We analyzed several traditional access methods specifically in the context of the Blobworld data and queries on it. Based on how the traditional access methods performed in executing a Blobworld query workload, we developed several new access methods tailored to Blobworld, two of which performed better than any of the traditional access methods in the Blobworld context.</abstract></paper><paper><title>Distributed Query Processing on the Web</title><author><AuthorName>Nalin Gupta</AuthorName><institute><InstituteName>Indian Institute of Scienc</InstituteName><country></country></institute></author><author><AuthorName>Jayant R. Haritsa</AuthorName><institute><InstituteName>Indian Institute of Scienc</InstituteName><country></country></institute></author><author><AuthorName>Maya Ramanath</AuthorName><institute><InstituteName>Indian Institute of Scienc</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Current Web querying systems are based on a &quot;data shipping&quot; mode wherein data is downloaded from remote sites to the user-site, queries are processed locally against these documents, and then further data is downloaded from the network based on these results. A data shipping approach suffers from several disadvantages, including the transfer of large amounts of unnecessary data resulting in network congestion and poor bandwidth utilization, the client-site becoming a processing bottleneck, and extended user response times due to sequential processing.In this paper, we present an alternative &quot;query shipping&quot; approach wherein queries emanating from the user-site are forwarded from one site to another on the Web, the query is processed at each recipient site, and the associated results are returned to the user. Our design does not require co-ordination from any &quot;master site&quot;, making it a truly distributed scheme. It has been implemented as part of DIASPORA (DIstributed Answering System for Processing of Remote Agents), a new Java-based Web database system that is currently operational and is undergoing field trials on our campus network.</abstract></paper><paper><title>Dynamic Histograms: Capturing Evolving Data Sets</title><author><AuthorName>Donko Donjerkovic</AuthorName><institute><InstituteName>University of Wisconsin at Madiso</InstituteName><country></country></institute></author><author><AuthorName>Raghu Ramakrishnan</AuthorName><institute><InstituteName>University of Wisconsin at Madiso</InstituteName><country></country></institute></author><author><AuthorName>Yannis Ioannidis</AuthorName><institute><InstituteName>University of Athen</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Conventional histograms are `static' since they cannot be updated but only recalculated. In this paper, we introduce a `dynamic' version of V-optimal histograms, which is constructed and maintained incrementally. Our experimental results indicate that a variation of Dynamic V-optimal histograms has comparable precision to recalculation methods but is much cheaper to maintain.</abstract></paper><paper><title>Extensible Indexing: a Framework for Integrating Domain-Specific Indexing Schemes into Oracle8i</title><author><AuthorName>Jagannathan Srinivasan</AuthorName><institute><InstituteName>Oracle Corporatio</InstituteName><country></country></institute></author><author><AuthorName>Ravi Murthy</AuthorName><institute><InstituteName>Oracle Corporatio</InstituteName><country></country></institute></author><author><AuthorName>Seema Sundara</AuthorName><institute><InstituteName>Oracle Corporatio</InstituteName><country></country></institute></author><author><AuthorName>Nipun Agarwal</AuthorName><institute><InstituteName>Oracle Corporatio</InstituteName><country></country></institute></author><author><AuthorName>Samuel DeFazio</AuthorName><institute><InstituteName>Oracle Corporatio</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Extensible Indexing is a SQL-based framework that allows users to define domain-specific indexing schemes, and integrate them into the Oracle8i server. Users register a new indexing scheme, the set of related operators, and additional properties through SQL data definition language extensions. The implementation for an indexing scheme is provided as a set of Oracle Data Cartridge Interface (ODCIIndex) routines for index-definition, index-maintenance, and index-scan operations. An index created using the new indexing scheme, referred to as domain index, behaves and performs analogous to those built natively by the database system. Oracle8i server implicitly invokes user-supplied index implementation code when domain index operations are performed, and executes user-supplied index scan routines for efficient evaluation of domain-specific operators.This paper provides an overview of the framework and describes the steps needed to implement an indexing scheme. The paper also presents a case study of Oracle Cartridges (InterMedia Text, Spatial, and Visual Information Retrieval), and Daylight (Chemical compound searching) Cartridge, which have implemented new indexing schemes using this framework and discusses the benefits and limitations.</abstract></paper><paper><title>DB2 Advisor: An Optimizer Smart Enough to Recommend its own Indexes</title><author><AuthorName>Gary Valentin</AuthorName><institute><InstituteName>IBM Toronto La</InstituteName><country></country></institute></author><author><AuthorName>Michael Zuliani</AuthorName><institute><InstituteName>IBM Toronto La</InstituteName><country></country></institute></author><author><AuthorName>Daniel C. Zilio</AuthorName><institute><InstituteName>IBM Toronto La</InstituteName><country></country></institute></author><author><AuthorName>Guy Lohman</AuthorName><institute><InstituteName>IBM Almaden Research Cente</InstituteName><country></country></institute></author><author><AuthorName>Alan Skelley</AuthorName><institute><InstituteName>University of Toront</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>This paper introduces the concept of letting an RDBMS Optimizer optimize its own environment. In our project, we have used the DB2 Optimizer to tackle the index selection problem, a variation of the knapack problem. This paper will discuss our implementation of index recommendation, the user interface, and provide measurements on the quality of the recommended indexes.</abstract></paper><paper><title>Taming the Downtime: High Availability in Sybase ASE 12</title><author><AuthorName>S. Raghuram</AuthorName><institute><InstituteName></InstituteName><country></country></institute></author><author><AuthorName>Sheshadri Ranganath</AuthorName><institute><InstituteName></InstituteName><country></country></institute></author><author><AuthorName>Steve Olson</AuthorName><institute><InstituteName></InstituteName><country></country></institute></author><author><AuthorName>Subrata Nandi</AuthorName><institute><InstituteName></InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>The new Companion Architecture in Sybase Adaptive Server Enterprise (ASE) 12 for high availability is supported on a 2-node cluster with each node running a separate ASE 12 server in companion configuration. This architecture is designed to withstand single point of failure for unplanned outages, and allow both nodes to be used for productive workload during normal operation. It enables fast failover and data recovery, supports automatic client migration during failure, and integrates seamlessly with adjoining layers in multi-tier architecture. It supports single system presentation of data for applications, and presents a rich set of features/infrastructure to reduce the planned downtime.During failover and failback, only the persistent data component is moved between the companion ASEs, making it fast and efficient. Introducing the proxy databases, this architecture enable user databases to be visible and accessible from either of the companions by shipping the queries to the appropriate node and returning the results to the client.</abstract></paper><paper><title>Accurate Estimation of the Cost of Spatial Selections</title><author><AuthorName>Ashraf Aboulnaga</AuthorName><institute><InstituteName>University of Wisconsin at Madiso</InstituteName><country></country></institute></author><author><AuthorName>Jeffrey F. Naughton</AuthorName><institute><InstituteName>University of Wisconsin at Madiso</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Optimizing queries that involve operations on spatial data requires estimating the selectivity and cost of these operations. In this paper, we focus on estimating the cost of spatial selections, or window queries, where the query windows and data objects are general polygons. Cost estimation techniques previously proposed in the literature only handle rectangular query windows over rectangular data objects, thus ignoring the very significant cost of exact geometry comparison (the refinement step in a ``filter and refine'' query processing strategy). The cost of the exact geometry comparison depends on the selectivity of the filtering step and the average number of vertices in the candidate objects identified by this step.In this paper, we introduce a new type of histogram for spatial data that captures the complexity and size of the spatial objects as well as their location. Capturing these attributes makes this type of histogram useful for accurate estimation, as we experimentally demonstrate. We also investigate sampling-based estimation approaches. Sampling can yield better selectivity estimates than histograms for polygon data, but at the high cost of performing exact geometry comparisons for all the sampled objects.</abstract></paper><paper><title>User Defined Aggregates in Object-Relational Systems</title><author><AuthorName>Haixun Wang</AuthorName><institute><InstituteName>University of California at Los Angele</InstituteName><country></country></institute></author><author><AuthorName>Carlo Zaniolo</AuthorName><institute><InstituteName>University of California at Los Angele</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>User-defined aggregates are essential in many advanced database applications, particularly in expressing data mining functions, but they find little support in current systems including Object-Relational databases. Three serious limitations of current systems are (i) the inability of introducing new aggregates (e.g., by coding them in procedural language as originally proposed in SQL3), (ii) the inability of returning partial results during the computation (e.g., to support online aggregation), and (iii) the inability of using aggregates in recursive queries (e.g., to express Bill of Materials and optimized graph searches).In this paper, we presents a unified solution to these problems which realizes SQL3 original proposal for user-defined aggregates (U-DAs), and adds significant improvements in terms of expressive power and ease of use: in fact our SQL-AG system also supports online aggregation, monotonic aggregation, and a high-level aggregate definition language named SADL. We focus on applications of UDAs and SADL.</abstract></paper><paper><title>Scalable Algorithms for Large Temporal Aggregation</title><author><AuthorName>Bongki Moon</AuthorName><institute><InstituteName>University of Arizon</InstituteName><country></country></institute></author><author><AuthorName>Ines Fernando Vega Lopez</AuthorName><institute><InstituteName>University of Arizon</InstituteName><country></country></institute></author><author><AuthorName>Vijaykumar Immanuel</AuthorName><institute><InstituteName>Compaq Computer Corporatio</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>The ability to model time-varying natures is essential to many database applications such as data warehousing and mining. However, the temporal aspects provide many unique characteristics and challenges for query processing and optimization. Among the challenges is computing temporal aggregates, which is complicated by having to compute temporal grouping.In this paper, we introduce a variety of temporal aggregation algorithms that overcome major drawbacks of previous work. First, for small-scale aggregations, both the worst-case and average-case processing time have been improved significantly. Second, for large-scale aggregations, the proposed algorithms can deal with a database that is substantially larger than the size of available memory.</abstract></paper><paper><title>Power Conservative Multi-Attribute Queries on Data Broadcast</title><author><AuthorName>Qinglong Hu</AuthorName><institute><InstituteName>Aleph Computer System, Inc</InstituteName><country></country></institute></author><author><AuthorName>Wang-Chien Lee</AuthorName><institute><InstituteName>GTE Laboratories In</InstituteName><country></country></institute></author><author><AuthorName>Dik Lun Lee</AuthorName><institute><InstituteName>University of Science and Technolog</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>In this paper, we study power conservation techniques for multi-attribute queries on wireless data broadcast channels. Indexing data on broadcast channels can improve client filtering capability, while clustering and scheduling can reduce both access time and tune-in time. Thus, indexing techniques should be coupled with clustering and scheduling methods to reduce the battery power consumption of mobile computers. In this study, three indexing schemes for multi-attribute queries, namely, index tree, signature, and hybrid index, are discussed. We develop cost models for these three indexing schemes and evaluate their performance based on multi-attribute queries on wireless data broadcast channels.</abstract></paper><paper><title>Multi-Level Multi-Channel Air Cache Designs for Broadcasting in a Mobile Environment</title><author><AuthorName>Kiran Prabhakara</AuthorName><institute><InstituteName>University of Central Florid</InstituteName><country></country></institute></author><author><AuthorName>Kien A. Hua</AuthorName><institute><InstituteName>University of Central Florid</InstituteName><country></country></institute></author><author><AuthorName>JungHwan Oh</AuthorName><institute><InstituteName>University of Central Florid</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>In this paper, we investigate efficient ways of broadcasting data to mobile users over multiple physical channels, which cannot be coalesced into a lesser number of high-bandwidth channels. We propose the use of MLMC (Multi-Level Multi-Channel ) Air-Cache which can provide mobile users with data based on their popularity factor. We provide a wide range of design considerations for the server which broadcasts over the MLMC cache. We also investigate some novel techniques for the mobile user to access data from the MLMC cache and show the advantages of designing the broadcast strategy in tandem with the access behavior of the mobile users . Finally, we provide experimental results to compare the techniques we introduce.</abstract></paper><paper><title>An Algebraic Compression Framework for Query Results</title><author><AuthorName>Zhiyuan Chen</AuthorName><institute><InstituteName>Cornell Universit</InstituteName><country></country></institute></author><author><AuthorName>Praveen Seshadri</AuthorName><institute><InstituteName>Cornell Universit</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Decision-support applications in emerging environments require that SQL query results or intermediate results be shipped to clients for further analysis and presentation. These clients may use low bandwidth connections or have severe storage restrictions. Consequently, there is a need to compress the results of a query for efficient transfer and client-side access.This paper explores a variety of techniques that address this issue. Instead of using a fixed method, we choose a combination of compression methods that use statistical and semantic information of the query results to enhance the effect of compression. To represent such a combination, we present a framework of &quot;compression plans&quot; formed by composing primitive compression operators.We also present optimization algorithms that enumerate valid compression plans and choose an optimal plan. Our experiments show that our techniques achieve significant performance improvement over standard compression tools like WinZip.</abstract></paper><paper><title>ACQ: An Automatic Clustering and Querying Approach for Large Image Databases</title><author><AuthorName>Dantong Yu</AuthorName><institute><InstituteName>State University of New York at Buffal</InstituteName><country></country></institute></author><author><AuthorName>Aidong Zhang</AuthorName><institute><InstituteName>State University of New York at Buffal</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Large image collections such as web-based image databases are being built in various locations. Because of the diversity of such image data collections, clustering images becomes an important and non-trivial problem. Such clustering tries to find the densely populated regions in the feature space to be used for efficient image retrieval. In this paper, we present an automatic clustering and querying (ACQ) approach for large image databases.Our approach can efficiently detect clusters of arbitrary shape. It does not require the number of clusters to be known a priori and is insensitive to the noise (outliers) and the order of input data. Based on this clustering approach, efficient image querying is supported. Experiments demonstrate the effectiveness and efficiency of the approach.</abstract></paper><paper><title>The MARIFlow Workflow Management System</title><author><AuthorName>A. Dogac</AuthorName><institute><InstituteName>Software R&D Cente</InstituteName><country></country></institute></author><author><AuthorName>M. Ezbiderli</AuthorName><institute><InstituteName>Software R&D Cente</InstituteName><country></country></institute></author><author><AuthorName>Y. Tambag</AuthorName><institute><InstituteName>Software R&D Cente</InstituteName><country></country></institute></author><author><AuthorName>C. Icdem</AuthorName><institute><InstituteName>Software R&D Cente</InstituteName><country></country></institute></author><author><AuthorName>A. Tumer</AuthorName><institute><InstituteName>Software R&D Cente</InstituteName><country></country></institute></author><author><AuthorName>N. Tatbul</AuthorName><institute><InstituteName>Software R&D Cente</InstituteName><country></country></institute></author><author><AuthorName>N. Hamali</AuthorName><institute><InstituteName>Software R&D Cente</InstituteName><country></country></institute></author><author><AuthorName>C. Beeri</AuthorName><institute><InstituteName>Hebrew Universit</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>MARIFlow System provides for automating and monitoring the flow of control and data over the Internet among different organizations, thereby creating a platform necessary to describe higher order processes involving several organizations and companies.The architecture is general enough to be applied to any business practice where data flow among different industries and co operations and the invocation of activities follow a pattern that can be described through a process definition. The example application provided within the scope of this project is on maritime industry.A MARIFlow process is executed through cooperating agents, called MARCAs (MARIFlow Cooperating Agents) that are automatically initialized at each site that the process executes. MARCAs handle the activities at their site, provide for coordination with other MARCAs in the system by routing the documents in electronic form according to the process description, keeping track of process information, and providing for the security and authentication of documents as well as comprehensive monitoring facilities.More specifically, the functionality provided by the system is as follows: A declarative means to specify the control of document flow over the Internet where it is possible to define the source of data, its control flow and the activities that make use of this data. Fully distributed execution architecture achieved through cooperating agents over the Internet. The agents know about other agents that they need to communicate with and preserve their state during communication. They also manage local information for monitoring purposes and for recovering from failures. Communicating with inside firewall applications. A MARCA can activate in-house activities automatically. However it should be noted that most organizations maybe reluctant to grant access inside the corporate firewall. In such cases, the MARCA passes the documents to an in-house system by properly acknowledging the in-house system on further processing that may be necessary on the documents. A MARCA is also responsible for getting the documents from the in-house system and forwarding them to the related agents as specified in the process definition. There is a coordinating MARCA in the system through which it is possible to define processes graphically from a Web interface. The coordinating MARCA is also responsible for initializing the MARCAs at each site for a new process definition and acting as a facilitator among MARCAs in the sense that for a new workflow definition it decides which new MARCAs are necessary. Note that only one MARCA exists at each site and handles all the activities of all workflow definitions related with that site. Therefore a new MARCA is generated only for a site participating to a workflow definition for the first time. Coordinating MARCA also acts as a data warehouse for monitoring purposes. Authentication and authorization of documents and the process related information. A monitoring mechanism for keeping track of the documents and for providing detailed account of the current status of a process instance within the system. Ability to recover the system from various types of failures.</abstract></paper><paper><title>Distance Exponent: A New Concept for Selectivity Estimation in Metric Trees</title><author><AuthorName>Caetano Traina Jr.</AuthorName><institute><InstituteName>University of Sao Paulo at San Carlo</InstituteName><country></country></institute></author><author><AuthorName>Agma J.M. Traina</AuthorName><institute><InstituteName>University of Sao Paulo at San Carlo</InstituteName><country></country></institute></author><author><AuthorName>Christos Faloutsos</AuthorName><institute><InstituteName>Carnegie Mellon Universit</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>This paper discusses the problem of selectivity estimation for range queries in metric datasets, which include vector, or dimensional, datasets as a special case. The main contribution of this paper is that, surprisingly, many different real datasets follow a &quot;power law&quot;. From this observation we derive an analysis for the distance distribution of metric datasets. This is the first analysis of distance distributions for real metric datasets.We called the exponent of our power law as &quot;distance exponent&quot;. We show that it plays a relevant role for the analysis of real, metric datasets. Specifically, we show (a) how to exploit the distance exponent to derive formulas for selectivity estimation of range queries and (b) how to compute it quickly from a metric index tree.We performed several experiments on many real datasets (road intersections of U.S. counties, vectors characteristics extracted from face matching systems, sets of words, distance matrixes) and synthetic datasets (Sierpinsky triangle, a 2-dimensional uniform distribution and a 2-dimensional line). Our selectivity estimation formulas are accurate, within relative error from 4% to 17%, and always within one standard deviation from the analytical results. Moreover, we present also a quick algorithm to estimate the &quot;distance exponent&quot;, which gives good accuracy and saves orders of magnitude in computation time.</abstract></paper><paper><title>Interactive-Time Similarity Search for Large Image Collections Using Parallel VA-Files</title><author><AuthorName>Roger Weber</AuthorName><institute><InstituteName>ETH Zentru</InstituteName><country></country></institute></author><author><AuthorName>Klemens Boehm</AuthorName><institute><InstituteName>ETH Zentru</InstituteName><country></country></institute></author><author><AuthorName>Hans-J. Schek</AuthorName><institute><InstituteName>ETH Zentru</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Nearest-neighbor search (NN-search) plays a key role for content-based retrieval. But NN-search over high-dimensional features is of linear complexity and query response times is not satisfactory for large collections of images. We have investigated parallel NN-search in a Network of Workstations (NOW) based on the VA-File. We have identified various design alternatives for such a search engine and have evaluated them. Because of the scan-based nature of the VA-File, one might expect an improvement almost linear in the number of components. But the best speedup we have observed is by 30 with only three components. The effect is due to the elimination of the IO-bottleneck. From another perspective, our solution provides interactive-time similarity search, i.e. a search through 1 GB feature data lasts about one second in a NOW with three components.</abstract></paper><paper><title>A Data-Warehouse/OLAP Framework for Scalable Telecommunication Tandem Traffic Analysis</title><author><AuthorName>Qiming Chen</AuthorName><institute><InstituteName>Hewlett Packard Lab</InstituteName><country></country></institute></author><author><AuthorName>Meichun Hsu</AuthorName><institute><InstituteName>Hewlett Packard Lab</InstituteName><country></country></institute></author><author><AuthorName>Umesh Dayal</AuthorName><institute><InstituteName>Hewlett Packard Lab</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>In a telecommunication network, hundreds of millions of call detail records (CDRs) are generated daily. Applications such as tandem traffic analysis require the collection and mining of CDRs on a continuous basis. The data volumes and data flow rates pose serious scalability and performance challenges. This has motivated us to develop a scalable data-warehouse/OLAP framework, and based on this framework, tackle the issue of scaling the whole operation chain, including data cleansing, loading, maintenance, access and analysis.We introduce the notion of dynamic data warehousing for managing information at different aggregation levels with different life spans. We use OLAP servers, together with the associated multidimensional databases, as a computation platform for data caching, reduction and aggregation, in addition to data analysis. The framework supports parallel computation for scaling up data mining, and supports incremental OLAP for providing continuous data mining. A tandem traffic analysis engine is implemented on the proposed framework.In addition to the parallel and incremental computation architecture, we provide a set of application-specific optimization mechanisms for scaling performance. These mechanisms fit well into the above framework. Our experience demonstrates the practical value of the above framework in supporting an important class of telecommunication business intelligence applications.</abstract></paper><paper><title>MetaComm: A Meta-Directory for Telecommunications</title><author><AuthorName>J. Freire</AuthorName><institute><InstituteName></InstituteName><country></country></institute></author><author><AuthorName>D. Lieuwen</AuthorName><institute><InstituteName></InstituteName><country></country></institute></author><author><AuthorName>J. Ordille</AuthorName><institute><InstituteName></InstituteName><country></country></institute></author><author><AuthorName>L. Garg</AuthorName><institute><InstituteName></InstituteName><country></country></institute></author><author><AuthorName>M. Holder</AuthorName><institute><InstituteName></InstituteName><country></country></institute></author><author><AuthorName>H. Urroz</AuthorName><institute><InstituteName></InstituteName><country></country></institute></author><author><AuthorName>G. Michael</AuthorName><institute><InstituteName></InstituteName><country></country></institute></author><author><AuthorName>J. Orbach</AuthorName><institute><InstituteName></InstituteName><country></country></institute></author><author><AuthorName>L. Tucker</AuthorName><institute><InstituteName></InstituteName><country></country></institute></author><author><AuthorName>Q. Ye</AuthorName><institute><InstituteName></InstituteName><country></country></institute></author><author><AuthorName>R. Arlein</AuthorName><institute><InstituteName></InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>A great deal of corporate data is buried in network devices --- such as PBX messaging/email platforms, and data networking equipment --- where it is difficult to access and modify. Typically, the data is only available to the device itself for its internal purposes and it must be administered using either a proprietary interface or a standard protocol against a proprietary schema. This leads to many problems, most notably: the need for data replication and difficult interoperation with other devices and applications. MetaComm addresses these problems by providing a framework to integrate data from multiple devices into a meta-directory. The system allows user information to be modified through a directory using the LDAP protocol as well as directly through two legacy devices: a Definity (R) PBX and a voice messaging system. In order to prevent data inconsistencies, updates to any system must be reflected appropriately in all systems. We also discuss implementation details and experiences.</abstract></paper><paper><title>Extracting Delta for Incremental Data Warehouse Maintenance</title><author><AuthorName>Prabhu Ram</AuthorName><institute><InstituteName>The Boeing Compan</InstituteName><country></country></institute></author><author><AuthorName>Lyman Do</AuthorName><institute><InstituteName>The Boeing Compan</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>This paper seeks to highlight an area important to commercial data warehouse deployments that has received limited research attention, namely, the extraction of changes to the data at the source systems. We refer to these changes as deltas. Extracting deltas from source systems is the first step in the incremental maintenance of data warehouses. A common assumption among current incremental maintenance methods is that deltas are somehow made available - normally in the form of differential files. Extraction of deltas from source systems is often not a straight forward process nor an efficient one.In this paper, we analyze how deltas can be extracted from large systems. We analyze delta extraction methods that are currently available, namely, time stamps, differential snapshots, triggers, and archive logs. We point out the strengths and weaknesses of each method through analysis and when appropriate through experimentation. We have been investigating the method called Op-Delta at Boeing that better suits delta extraction from large integrated systems. We discuss the benefits of Op-Delta, discuss how it could be implemented, and present comparative results from our experimentation.</abstract></paper><paper><title>Image Database Retrieval with Multiple-Instance Learning Techniques</title><author><AuthorName>Cheng Yang</AuthorName><institute><InstituteName>Massachusetts Institute of Technolog</InstituteName><country></country></institute></author><author><AuthorName>Tomas Lozano-Perez</AuthorName><institute><InstituteName>Massachusetts Institute of Technolog</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>In this paper, we develop and test an approach to retrieving images from an image database based on content similarity. First, each picture is divided into many overlapping regions. For each region, the sub-picture is filtered and converted into a feature vector. In this way, each picture is represented by a number of different feature vectors. The user selects positive and negative image examples to train the system. During the training, a multiple-instance learning method known as the Diverse Density algorithm is employed to determine which feature vector in each image best represents the user's concept, and which dimensions of the feature vectors are important. The system tries to retrieve images with similar feature vectors from the remainder of the database. A variation of the weighted correlation statistic is used to determine image similarity. The approach is tested on a medium-sized database of natural scenes as well as single- and multiple-object images.</abstract></paper><paper><title>PAC Nearest Neighbor Queries: Approximate and Controlled Search in High-Dimensional and Metric Spaces</title><author><AuthorName>Paolo Ciaccia</AuthorName><institute><InstituteName>University of Bologna and CSITE-CN</InstituteName><country></country></institute></author><author><AuthorName>Marco Patella</AuthorName><institute><InstituteName>University of Bologna and CSITE-CN</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>In high-dimensional and complex metric spaces, determining the nearest neighbor (NN) of a query object q can be a very expensive task, because of the poor partitioning operated by index structures - the so-called &quot;curse of dimensionality&quot;. This also affects approximately correct (AC) algorithms, which return as result a point whose distance from q is less than \math times the distance between q and its true NN.In this paper we introduce a new approach to approximate similarity search, called PAC-NN queries, where the error bound \math can be exceeded with probability \math and both \math and \math parameters can be tuned at query time to trade the quality of the result for the cost of the search.We describe sequential and index-based PAC-NN algorithms that exploit the distance distribution of the query object in order to determine a stopping condition that respects the error bound. Analysis and experimental evaluation of the sequential algorithm confirm that, for moderately large data sets and suitable \math and \math values, PAC-NN queries can be efficiently solved and the error controlled. Then, we provide experimental evidence that indexing can further speed-up the retrieval process by up to 1-2 orders of magnitude without giving up the accuracy of the result.</abstract></paper><paper><title>Efficiently Supporting Multiple Similarity Queries for Mining in Metric Databases</title><author><AuthorName>Bernhard Braunmuller</AuthorName><institute><InstituteName>Universit鋞 of M黱che</InstituteName><country></country></institute></author><author><AuthorName>Martin Ester</AuthorName><institute><InstituteName>Universit鋞 of M黱che</InstituteName><country></country></institute></author><author><AuthorName>Hans-Peter Kriegel</AuthorName><institute><InstituteName>Universit鋞 of M黱che</InstituteName><country></country></institute></author><author><AuthorName>Jorg Sander</AuthorName><institute><InstituteName>Universit鋞 of M黱che</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>Metric databases are databases where a metric distance function is defined for pairs of database objects. In such databases, similarity queries in the form of range queries or k-nearest neighbor queries are the most important queries. In traditional query processing, single queries are issued independently by different users. In many data mining applications, however, the database is typically explored by iteratively asking similarity queries for answers of previous similarity queries.In this paper, we introduce a generic scheme for such data mining algorithms and we investigate two orthogonal approaches, reducing I/O cost as well as CPU cost, to speed-up the processing of multiple similarity queries. The proposed techniques apply to any type of similarity query and to an implementation based on an index or using a sequential scan. Parallelization yields an additional impressive speed-up. An extensive performance evaluation confirms the efficiency of our approach.</abstract></paper><paper><title>Declustering Using Golden Ratio Sequences</title><author><AuthorName>Randeep Bhatia</AuthorName><institute><InstituteName>Bell Laboratorie</InstituteName><country></country></institute></author><author><AuthorName>Rakesh K. Sinha</AuthorName><institute><InstituteName>Bell Laboratorie</InstituteName><country></country></institute></author><author><AuthorName>Chung-Min Chen</AuthorName><institute><InstituteName>Telcordia Technologies, Inc</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>We propose a new data declustering scheme for range queries. Our scheme is based on Golden Ratio Sequences (GRS), which have found applications in broadcast disks, hashing, packet routing, etc.We show by analysis and simulation that GRS is nearly the best possible scheme for 2-dimensional range queries. Specifically, it is the best possible scheme when the number of disks (M) is at most 22; has response time at most one more than that of the best possible scheme for M less than or equal to 94; and has response time at most three more than that of the best possible scheme for M less than or equal to 550. We also show that it outperforms the cyclic declustering scheme -- a recently proposed scheme that was shown to have better performance than previously known schemes for this problem. We give some analytical results to suggest that the average performance of our scheme is within 14 percent of the optimal scheme. Our analytical results also suggest a worst case response time within a factor 3 of the optimal for any query, and within a factor 1.5 of the optimal for large queries. We also give a multidimensional extension of our scheme, which has better performance than the multidimensional generalization of the cyclic declustering scheme.</abstract></paper><paper><title>Optimization Techniques for Data-Intensive Decision Flows</title><author><AuthorName>Richard Hull</AuthorName><institute><InstituteName>Bell Laboratories, Lucent Technologie</InstituteName><country></country></institute></author><author><AuthorName>Bharat Kumar</AuthorName><institute><InstituteName>Bell Laboratories, Lucent Technologie</InstituteName><country></country></institute></author><author><AuthorName>Gang Zhou</AuthorName><institute><InstituteName>Bell Laboratories, Lucent Technologie</InstituteName><country></country></institute></author><author><AuthorName>Francois Llirbat</AuthorName><institute><InstituteName>Domaine de Voluceau-ROCQUENCOUR</InstituteName><country></country></institute></author><author><AuthorName>Guozhu Dong</AuthorName><institute><InstituteName>Wright State Universit</InstituteName><country></country></institute></author><author><AuthorName>Jianwen Su</AuthorName><institute><InstituteName>University of California at Santa Barbar</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>For an enterprise to take advantage of the opportunities afforded by electronic commerce it must be able to make decisions about business transactions in near-realtime. In the coming era of segment-of-one marketing, these decisions will be quite intricate, so that customer treatments can be highly personalized, reflecting customer preferences, the customer's history with the enterprise, and targeted business objectives. This paper describes a paradigm called &quot;decision flows&quot; for specifying a form of incremental decision-making that can combine diverse business factors in near-realtime.This paper introduces and empirically analyzes a variety of optimization strategies for decision flows that are &quot;data-intensive&quot;, i.e., that involve many database queries. A primary focus is on the use of parallelism and eagerness (a.k.a. speculative execution) to minimize work and/or reduce response time. A family of optimization techniques is developed, including algorithms and heuristics for scheduling tasks of the decision flow. Using a prototype execution engine the techniques are compared and analyzed in connection with decision-making applications having differing characteristics.</abstract></paper><paper><title>Optimal Index and Data Allocation in Multiple Broadcast Channels</title><author><AuthorName>Shou-Chih Lo</AuthorName><institute><InstituteName>National Tsing Hua Universit</InstituteName><country></country></institute></author><author><AuthorName>Arbee L.P. Chen</AuthorName><institute><InstituteName>National Tsing Hua Universit</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>The issue of data broadcast has received much attention in mobile computing. A periodic broadcast of frequently requested data can reduce the workload of the up-link channel and facilitate data access for the mobile user. Since the mobile units usually have limited battery capacity, the minimization of the access latency for the broadcast data is an important problem. The indexing and scheduling techniques on the broadcast data should be considered.In this paper we propose a solution to find the optimal index and data allocation, which minimizes the access latency for any number of broadcast channels. We represent all the possible allocations as a tree in which the optimal one is searched, and propose a pruning strategy based on some properties to greatly reduce the search space. Experiments are performed to show the effectiveness of the pruning strategy. Moreover, we pro-pose two heuristics to solve the same problem when the size of the broadcast data is large.</abstract></paper><paper><title>Clustering Categorical Data</title><author><AuthorName>Yi Zhang</AuthorName><institute><InstituteName>Chinese University of Hong Kon</InstituteName><country></country></institute></author><author><AuthorName>Ada Wai-chee Fu</AuthorName><institute><InstituteName>Chinese University of Hong Kon</InstituteName><country></country></institute></author><author><AuthorName>Chun Hing Cai</AuthorName><institute><InstituteName>Chinese University of Hong Kon</InstituteName><country></country></institute></author><author><AuthorName>Pheng Ann Heng</AuthorName><institute><InstituteName>Chinese University of Hong Kon</InstituteName><country></country></institute></author><year>2000</year><conference>International Conference on Data Engineering</conference><citation></citation><abstract>In this paper we propose two methods to study the problem of clustering categorical data. The first method is based on dynamical system approach. The second method is based on the graph partitioning approach.</abstract></paper><paper><title>Mining Bases for Association Rules Using Closed Sets</title><author><AuthorName>Nicolas Pasquier</AuthorName><institute><InstituteName>Universit
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -