⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 sigmod_2005_elementary.txt

📁 利用lwp::get写的
💻 TXT
📖 第 1 页 / 共 5 页
字号:
<proceedings><paper><title>Sampling algorithms in a stream operator</title><author><AuthorName>Theodore Johnson</AuthorName><institute><InstituteName>AT&amp;T Labs Research</InstituteName><country></country></institute></author><author><AuthorName>S. Muthukrishnan</AuthorName><institute><InstituteName>Rutgers University</InstituteName><country></country></institute></author><author><AuthorName>Irina Rozenbaum</AuthorName><institute><InstituteName>Rutgers University</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Management of Data</conference><citation><name>Jeffrey S. Vitter, Random sampling with a reservoir, ACM Transactions on Mathematical Software (TOMS), v.11 n.1, p.37-57, March 1985</name><name>N. Duffield, C. Lund, M. Thorup. Learn more, sample less: control of volume and variance in network measurement. SIGCOMM 2001 Measurement workshop.</name><name>G. Manku and R. Motwani. Approximate frequency counts over data streams. Proc. VLDB, 2002, 346--357.</name><name>Brian Babcock , Shivnath Babu , Mayur Datar , Rajeev Motwani , Jennifer Widom, Models and issues in data stream systems, Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 03-05, 2002, Madison, Wisconsin</name><name>S. Muthukrishnan. Data stream algorithms and applications. http://www.cs,rutgers.edu/~stream-1-1.ps.</name><name>A. Singh. http://www.cs.ucsb.edu/~ambuj/Courses/multimediaDB/sampling.pdf</name><name>Amitabha Bagchi , Amitabh Chaudhary , David Eppstein , Michael T. Goodrich, Deterministic sampling and range counting in geometric data streams, Proceedings of the twentieth annual symposium on Computational geometry, June 08-11, 2004, Brooklyn, New York, USA</name><name>John Hershberger , Subhash Suri, Adaptive sampling for geometric problems over data streams, Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 14-16, 2004, Paris, France</name><name>Peter J. Haas , Jeffrey F. Naughton , S. Seshadri , Lynne Stokes, Sampling-Based Estimation of the Number of Distinct Values of an Attribute, Proceedings of the 21th International Conference on Very Large Data Bases, p.311-322, September 11-15, 1995</name><name>Mayur Datar , S. Muthukrishnan, Estimating Rarity and Similarity over Data Stream Windows, Proceedings of the 10th Annual European Symposium on Algorithms, p.323-334, September 17-21, 2002</name><name>SQL Server 2005. &lt;u&gt;http://www.microsoft.com/technet/prodtechnol/sql/2005/evaluate/dwsqlsy.mspx&lt;/u&gt;</name><name>P. Gulutzan and T. Pelzer, SQL-99 Complete, Really, CMP Books, 1999.</name><name>D. Carney, U. Cetinternel, M. Cherniack, C. Coney, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul and S. Zdonik. Monitoring Streams - A New Class of Data Management Applications. VLDB 2002.</name><name>Graham Cormode , Theodore Johnson , Flip Korn , S. Muthukrishnan , Oliver Spatscheck , Divesh Srivastava, Holistic UDAFs at streaming speeds, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France</name><name>Marc Gyssens , Laks V. S. Lakshmanan , Iyer N. Subramanian, Tables as a paradigm for querying and restructuring (extended abstract), Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.93-103, June 04-06, 1996, Montreal, Quebec, Canada</name><name>Jianjun Chen , David J. DeWitt , Feng Tian , Yuan Wang, NiagaraCQ: a scalable continuous query system for Internet databases, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.379-390, May 15-18, 2000, Dallas, Texas, United States</name><name>S. Chandrasekaran et al. TelegraphCQ: Continuous Dataflow Processing for an Uncertain World. Proc. CIDR 2003.</name><name>Michael Greenwald , Sanjeev Khanna, Space-efficient online computation of quantile summaries, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.58-66, May 21-24, 2001, Santa Barbara, California, United States</name><name>Phillip B. Gibbons, Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports, Proceedings of the 27th International Conference on Very Large Data Bases, p.541-550, September 11-14, 2001</name><name>R. Motwani, J. Widom, A. Arasu. B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein, R. Varma. Query Processing, Resource Management, and Approximation in a Data Stream management System. In CIDR, pages 245--256, Jan 2003.</name><name>Chuck Cranor , Theodore Johnson , Oliver Spataschek , Vladislav Shkapenyuk, Gigascope: a stream database for network applications, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 09-12, 2003, San Diego, California</name><name>H. Wang, C. Zaniolo and C. Luo, ATLAS: A Small but Complete SQL Extension for Data Mining and Data Streams, Proc. VLDB 2003 pg 5--20.</name><name>Y.-N. Law, H. Wang and C. Zaniolo, Query Languages and Data Models for Database Sequences and Data Streams, Proc. VLDB 2004 pg 492--503.</name><name>A. Broder, On the Resemblance and Containment of Documents, Proceedings of the Compression and Complexity of Sequences 1997, p.21, June 11-13, 1997</name></citation><abstract>Complex queries over high speed data streams often need to rely on approximations to keep up with their input. The research community has developed a rich literature on approximate streaming algorithms for this application. Many of these algorithms produce samples of the input stream, providing better properties than conventional random sampling. In this paper, we abstract the stream sampling process and design a new stream sample operator. We show how it can be used to implement a wide variety of algorithms that perform sampling and sampling-based aggregations. Also, we show how to implement the operator in Gigascope - a high speed stream database specialized for IP network monitoring applications. As an example study, we apply the operator within such an enhanced Gigascope to perform subset-sum sampling which is of great interest for IP network management. We evaluate this implemention on a live, high speed internet traffic data stream and find that (a) the operator is a flexible, versatile addition to Gigascope suitable for tuning and algorithm engineering, and (b) the operator imposes only a small evaluation overhead. This is the first operational implementation we know of, for a wide variety of stream sampling algorithms at line speed within a data stream management system.</abstract></paper><paper><title>Fault-tolerance in the Borealis distributed stream processing system</title><author><AuthorName>Magdalena Balazinska</AuthorName><institute><InstituteName>MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA</InstituteName><country></country></institute></author><author><AuthorName>Hari Balakrishnan</AuthorName><institute><InstituteName>MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA</InstituteName><country></country></institute></author><author><AuthorName>Samuel Madden</AuthorName><institute><InstituteName>MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA</InstituteName><country></country></institute></author><author><AuthorName>Michael Stonebraker</AuthorName><institute><InstituteName>MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Management of Data</conference><citation><name>Daniel J. Abadi , Don Carney , Ugur &amp;#199;etintemel , Mitch Cherniack , Christian Convey , Sangdon Lee , Michael Stonebraker , Nesime Tatbul , Stan Zdonik, Aurora: a new model and architecture for data stream management, The VLDB Journal &amp;mdash; The International Journal on Very Large Data Bases, v.12 n.2, p.120-139, August 2003</name><name>Abadi et al. The design of the Borealis stream processing engine. In CIDR, Jan. 2005.</name><name>Abadi et al. The design of the Borealis stream processing engine. Technical Report CS-04-08, Department of Computer Science, Brown University, Jan. 2005.</name><name>G. Alonso and C. Mohan. WFMS: The next generation of distributed processing tools. In S. Jajodia and L. Kerschberg, editors, Advanced Transaction Models and Architectures. Kluwer, 1997.</name><name>Alonso et al. Exotica/FMQM: A persistent message-based architecture for distributed workflow management. In Proc. of IFIP WG8.1 Working Conf. on Information Systems for Decentralized Organizations, Aug. 1995.</name><name>A. Arasu, S. Babu, and J. Widom. The CQL continuous query language: Semantic foundations and query execution. Technical Report 2003-67, Stanford University, Oct. 2003.</name><name>Ron Avnur , Joseph M. Hellerstein, Eddies: continuously adaptive query processing, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.261-272, May 15-18, 2000, Dallas, Texas, United States</name><name>Brian Babcock , Shivnath Babu , Rajeev Motwani , Mayur Datar, Chain: operator scheduling for memory minimization in data stream systems, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 09-12, 2003, San Diego, California</name><name>Brian Babcock , Shivnath Babu , Mayur Datar , Rajeev Motwani , Jennifer Widom, Models and issues in data stream systems, Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 03-05, 2002, Madison, Wisconsin</name><name>Philip A. Bernstein , Meichun Hsu , Bruce Mann, Implementing recoverable requests using queues, Proceedings of the 1990 ACM SIGMOD international conference on Management of data, p.112-122, May 23-26, 1990, Atlantic City, New Jersey, United States</name><name>Eric A. Brewer, Lessons from Giant-Scale Services, IEEE Internet Computing, v.5 n.4, p.46-55, July 2001</name><name>D. Carney, U. &amp;#199;etintemel, A. Rasin, S. Zdonik, M. Cherniack, and M. Stonebraker. Operator scheduling in a data stream manager. In 29th VLDB, Sept. 2003.</name><name>S. Chandrasekaran and M. J. Franklin. Remembrance of streams past: Overload-sensitive management of archived streams. In 30th VLDB, Sept. 2004.</name><name>Chandrasekaran et al. TelegraphCQ: Continuous dataflow processing for an uncertain world. In CIDR, Jan. 2003.</name><name>Cherniack et al. Scalable distributed stream processing. In CIDR, Jan. 2003.</name><name>Chuck Cranor , Theodore Johnson , Oliver Spataschek , Vladislav Shkapenyuk, Gigascope: a stream database for network applications, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 09-12, 2003, San Diego, California</name><name>Abhinandan Das , Johannes Gehrke , Mirek Riedewald, Approximate join processing over data streams, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 09-12, 2003, San Diego, California</name><name>E. N. (Mootaz) Elnozahy , Lorenzo Alvisi , Yi-Min Wang , David B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys (CSUR), v.34 n.3, p.375-408, September 2002</name><name>Nick Feamster , David G. Andersen , Hari Balakrishnan , M. Frans Kaashoek, Measuring the effects of internet path faults on reactive routing, Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, June 11-14, 2003, San Diego, CA, USA</name><name>Hector Garcia-Molina , Daniel Barbara, How to assign votes in a distributed system, Journal of the ACM (JACM), v.32 n.4, p.841-860, Oct. 1985</name><name>David K. Gifford, Weighted voting for replicated data, Proceedings of the seventh ACM symposium on Operating systems principles, p.150-162, December 10-12, 1979, Pacific Grove, California, United States</name><name>Jim Gray , Pat Helland , Patrick O'Neil , Dennis Shasha, The dangers of replication and a solution, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.173-182, June 04-06, 1996, Montreal, Quebec, Canada</name><name>Jim Gray , Andreas Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann Publishers Inc., San Francisco, CA, 1992</name><name>M. Hsu. Special issue on workflow systems. IEEE Data Eng. Bulletin, 18(1), Mar. 1995.</name><name>Jeong-Hyon Hwang , Magdalena Balazinska , Alexander Rasin , Ugur Cetintemel , Michael Stonebraker , Stan Zdonik, High-Availability Algorithms for Distributed Stream Processing, Proceedings of the 21st International Conference on Data Engineering (ICDE'05), p.779-790, April 05-08, 2005</name><name>Mohan Kamath , Gustavo Alonso , Roger G&amp;#252;nth&amp;#246;r , C. Mohan, Providing High Availability in Very Large Worklflow Management Systems, Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology, p.427-442, March 25-29, 1996</name><name>Leonard Kawell, Jr. , Steven Beckhardt , Timothy Halvorsen , Raymond Ozzie , Irene Greif, Replicated document management in a group communication system, Proceedings of the 1988 ACM conference on Computer-supported cooperative work, September 26-28, 1988, Portland, Oregon, United States</name><name>Y.-N. Law, H. Wang, and C. Zaniolo. Query languages and data models for database sequences and data streams. In 30th VLDB, Sept. 2004.</name><name>David Lomet , Mark Tuttle, A theory of redo recovery, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 09-12, 2003, San Diego, California</name><name>Motwani et al. Query processing, approximation, and resource management in a data stream management system. In CIDR, Jan. 2003.</name><name>Naughton et al. The Niagara Internet query system. IEEE Data Eng. Bulletin, 24(2), June 2001.</name><name>Christopher Alden Remi Olston , Jennifer Widom, Approximate replication, 2003</name><name>Chris Olston , Jing Jiang , Jennifer Widom, Adaptive filters for continuous queries over distributed data streams, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 09-12, 2003, San Diego, California</name><name>Vijayshankar Raman , Joseph M. Hellerstein, Partial results for online query processing, Proceedings of the 2002 ACM SIGMOD international conference on Management of data, June 03-06, 2002, Madison, Wisconsin</name><name>Mehul A. Shah , Joseph M. Hellerstein , Eric Brewer, Highly available, fault-tolerant, parallel dataflows, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France</name><name>Utkarsh Srivastava , Jennifer Widom, Flexible time management in data stream systems, Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 14-16, 2004, Paris, France</name><name>R. E. Strom. Fault-tolerance in the SMILE stateful publish-subscribe system. In DEBS, May 2004.</name><name>N. Tatbul, U. &amp;#199;etintemel, S. Zdonik, M. Cherniack, and M. Stonebraker. Load shedding in a data stream manager. In 29th VLDB, Sept. 2003.</name><name>D. B. Terry , M. M. Theimer , Karin Petersen , A. J. Demers , M. J. Spreitzer , C. H. Hauser, Managing update conflicts in Bayou, a weakly connected replicated storage system, Proceedings of the fifteenth ACM symposium on Operating systems principles, p.172-182, December 03-06, 1995, Copper Mountain, Colorado, United States</name><name>The NTP Project. NTP: The Network Time Protocol. http://www.ntp.org/.</name><name>P. A. Tucker and D. Maier. Dealing with disorder. In MPDS, June 2003.</name><name>R. Urbano. Oracle Streams Replication Administrator's Guide, 10g Release 1 (10.1). Oracle Corporation, Dec. 2003.</name></citation><abstract>We present a replication-based approach to fault-tolerant distributed stream processing in the face of node failures, network failures, and network partitions. Our approach aims to reduce the degree of inconsistency in the system while guaranteeing that available inputs capable of being processed are processed within a specified time threshold. This threshold allows a user to trade availability for consistency: a larger time threshold decreases availability but limits inconsistency, while a smaller threshold increases availability but produces more inconsistent results based on partial data. In addition, when failures heal, our scheme corrects previously produced results, ensuring eventual consistency.Our scheme uses a data-serializing operator to ensure that all replicas process data in the same order, and thus remain consistent in the absence of failures. To regain consistency after a failure heals, we experimentally compare approaches based on checkpoint/redo and undo/redo techniques and illustrate the performance trade-offs between these schemes.</abstract></paper><paper><title>Holistic aggregates in a networked world: distributed tracking of approximate quantiles</title><author><AuthorName>Graham Cormode</AuthorName><institute><InstituteName>Bell Laboratories</InstituteName><country></country></institute></author><author><AuthorName>Minos Garofalakis</AuthorName><institute><InstituteName>Bell Laboratories</InstituteName><country></country></institute></author><author><AuthorName>S. Muthukrishnan</AuthorName><institute><InstituteName>Rutgers and AT&amp;T</InstituteName><country></country></institute></author><author><AuthorName>Rajeev Rastogi</AuthorName><institute><InstituteName>Bell Laboratories</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Management of Data</conference><citation><name>Noga Alon , Phillip B. Gibbons , Yossi Matias , Mario Szegedy, Tracking join and self-join sizes in limited storage, Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.10-20, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States</name><name>Noga Alon , Yossi Matias , Mario Szegedy, The space complexity of approximating the frequency moments, Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, p.20-29, May 22-24, 1996, Philadelphia, Pennsylvania, United States</name><name>Brian Babcock , Chris Olston, Distributed top-k monitoring, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 09-12, 2003, San Diego, California</name><name>G. Cormode and S. Muthukrishnan. "An improved data stream summary: The count-min sketch and its applications". In Proceedings of Latin American Informatics, 2004.</name><name>Graham Cormode , S. Muthukrishnan, What's hot and what's not: tracking most frequent items dynamically, Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.296-306, June 09-11, 2003, San Diego, California</name><name>Chuck Cranor , Theodore Johnson , Oliver Spataschek , Vladislav Shkapenyuk, Gigascope: a stream database for network applications, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 09-12, 2003, San Diego, California</name><name>Dartmouth wireless traces. (http://cmc.cs.dartmouth.-edu/data/dartmouth.html).</name><name>A. Das, S. Ganguly, M. Garofalakis, and R. Rastogi. "Distributed Set-Expression Cardinality Estimation". In Proceedings of VLDB, 2004.</name><name>A. Deshpande, C. Guestrin, S. R. Madden, J. M. Hellerstein, and W. Hong. "Model-Driven Data Acquisition in Sensor Networks". In Proceedings of VLDB, 2004.</name><name>Sumit Ganguly , Minos Garofalakis , Rajeev Rastogi, Processing set expressions over continuous update streams, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 09-12, 2003, San Diego, California</name><name>Phillip B. Gibbons, Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports, Proceedings of the 27th International Conference on Very Large Data Bases, p.541-550, September 11-14, 2001</name><name>Anna C. Gilbert , Yannis Kotidis , S. Muthukrishnan , Martin Strauss, Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries, Proceedings of the 27th International Conference on Very Large Data Bases, p.79-88, September 11-14, 2001</name><name>A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. "How to Summarize the Universe: Dynamic Maintenance of Quantiles". In Proceedings of VLDB, 2002.</name><name>Michael Greenwald , Sanjeev Khanna, Space-efficient online computation of quantile summaries, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.58-66, May 21-24, 2001, Santa Barbara, California, United States</name><name>Michael B. Greenwald , Sanjeev Khanna, Power-conserving computation of order-statistics over sensor networks, Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 14-16, 2004, Paris, France</name><name>G. H. Hardy, J. E. Littlewood, and G. P&amp;#243;lya. "Inequalities". Cambridge University Press, 1988. (Second Edition).</name><name>Internet traffic archive. (http://ita.ee.lbl.gov/).</name><name>Samuel Madden , Michael J. Franklin , Joseph M. Hellerstein , Wei Hong, The design of an acquisitional query processor for sensor networks, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 09-12, 2003, San Diego, California</name><name>Amit Manjhi , Vladislav Shkapenyuk , Kedar Dhamdhere , Christopher Olston, Finding (Recently) Frequent Items in Distributed Data Streams, Proceedings of the 21st International Conference on Data Engineering (ICDE'05), p.767-778, April 05-08, 2005</name><name>Gurmeet Singh Manku , Sridhar Rajagopalan , Bruce G. Lindsay, Approximate medians and other quantiles in one pass and with limited memory, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.426-435, June 01-04, 1998, Seattle, Washington, United States</name><name>G. S. Manku and R. Motwani. "Approximate Frequency Counts over Data Streams". In Proceedings of VLDB, 2002.</name><name>Chris Olston , Jing Jiang , Jennifer Widom, Adaptive filters for continuous queries over distributed data streams, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 09-12, 2003, San Diego, California</name><name>Nisheeth Shrivastava , Chiranjeeb Buragohain , Divyakant Agrawal , Subhash Suri, Medians and beyond: new aggregation techniques for sensor networks, Proceedings of the 2nd international conference on Embedded networked sensor systems, November 03-05, 2004, Baltimore, MD, USA</name></citation><abstract>While traditional database systems optimize for performance on one-shot queries, emerging large-scale monitoring applications require continuous tracking of complex aggregates and data-distribution summaries over collections of physically-distributed streams. Thus, effective solutions have to be simultaneously space efficient (at each remote site), communication efficient (across the underlying communication network), and provide continuous, guaranteed-quality estimates. In this paper, we propose novel algorithmic solutions for the problem of continuously tracking complex holistic aggregates in such a distributed-streams setting --- our primary focus is on approximate quantile summaries, but our approach is more broadly applicable and can handle other holistic-aggregate functions (e.g., "heavy-hitters" queries). We present the first known distributed-tracking schemes for maintaining accurate quantile estimates with provable approximation guarantees, while simultaneously optimizing the storage space at each remote site as well as the communication cost across the network. In a nutshell, our algorithms employ a combination of local tracking at remote sites and simple prediction models for local site behavior in order to produce highly communication- and space-efficient solutions. We perform extensive experiments with real and synthetic data to explore the various tradeoffs and understand the role of prediction models in our schemes. The results clearly validate our approach, revealing significant savings over naive solutions as well as our analytical worst-case guarantees.</abstract></paper><paper><title>Deriving private information from randomized data</title><author><AuthorName>Zhengli Huang</AuthorName><institute><InstituteName>Syracuse University, Syracuse, NY</InstituteName><country></country></institute></author><author><AuthorName>Wenliang Du</AuthorName><institute><InstituteName>Syracuse University, Syracuse, NY</InstituteName><country></country></institute></author><author><AuthorName>Biao Chen</AuthorName><institute><InstituteName>Syracuse University, Syracuse, NY</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Management of Data</conference><citation><name>Dakshi Agrawal , Charu C. Aggarwal, On the design and quantification of privacy preserving data mining algorithms, Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.247-255, May 2001, Santa Barbara, California, United States</name><name>Rakesh Agrawal , Ramakrishnan Srikant, Privacy-preserving data mining, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.439-450, May 15-18, 2000, Dallas, Texas, United States</name><name>R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. V. der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition. SIAM, Philadelphia, PA, 1994.</name><name>R. Bronson. Linear Algebra, An Introduction. Academic Press, 1991.</name><name>Chris Clifton , Murat Kantarcioglu , Jaideep Vaidya , Xiaodong Lin , Michael Y. Zhu, Tools for privacy preserving distributed data mining, ACM SIGKDD Explorations Newsletter, v.4 n.2, p.28-34, December 2002</name><name>W. Du, Y. S. Han, and S. Chen. Privacy-preserving multivariate statistical analysis: Linear regression and classification. In Proceedings of the 4th SIAM International Conference on Data Mining, Lake Buena Vista, Florida, USA, April 2004.</name><name>Wenliang Du , Zhijun Zhan, Using randomized response techniques for privacy-preserving data mining, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2003, Washington, D.C.</name><name>Alexandre Evfimievski , Johannes Gehrke , Ramakrishnan Srikant, Limiting privacy breaches in privacy preserving data mining, Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.211-222, June 09-11, 2003, San Diego, California</name><name>Alexandre Evfimievski , Ramakrishnan Srikant , Rakesh Agrawal , Johannes Gehrke, Privacy preserving mining of association rules, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 23-26, 2002, Edmonton, Alberta, Canada</name><name>Bobi Gilburd , Assaf Schuster , Ran Wolff, k-TTP: a new privacy model for large-scale distributed environments, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA</name><name>Shafi Goldwasser, Multi party computations: past and present, Proceedings of the sixteenth annual ACM symposium on Principles of distributed computing, p.1-6, August 21-24, 1997, Santa Barbara, California, United States</name><name>Richard W Hamming, Numerical methods for scientists and engineers (2nd ed.), Dover Publications, Inc., New York, NY, 1986</name><name>W. Hardle and L. Simar. Applied Multivariate Statistical Analysis. Springer-Verlag, 2003.</name><name>I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, 1986.</name><name>Murat Kantarcio&amp;#487;lu , Jiashun Jin , Chris Clifton, When do data mining results violate privacy?, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA</name><name>Hillol Kargupta , Souptik Datta , Qi Wang , Krishnamoorthy Sivakumar, On the Privacy Preserving Properties of Random Data Perturbation Techniques, Proceedings of the Third IEEE International Conference on Data Mining, p.99, November 19-22, 2003</name><name>Yehuda Lindell , Benny Pinkas, Privacy Preserving Data Mining, Proceedings of the 20th Annual International Cryptology Conference on Advances in Cryptology, p.36-54, August 20-24, 2000</name><name>D. Meng , K. Sivakumar , H. Kargupta, Privacy-Sensitive Bayesian Network Parameter Learning, Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM'04), p.487-490, November 01-04, 2004</name><name>Benny Pinkas, Cryptographic techniques for privacy-preserving data mining, ACM SIGKDD Explorations Newsletter, v.4 n.2, p.12-19, December 2002</name><name>H. Vincent Poor, An introduction to signal detection and estimation (2nd ed.), Springer-Verlag New York, Inc., New York, NY, 1994</name><name>S. Rizvi and J. R. Haritsa. Maintaining data privacy in association rule mining. In Proceedings of the 28th VLDB Conference, Hong Kong, China, 2002.</name><name>Ashish P. Sanil , Alan F. Karr , Xiaodong Lin , Jerome P. Reiter, Privacy preserving regression modelling via distributed computation, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA</name><name>Jaideep Vaidya , Chris Clifton, Privacy preserving association rule mining in vertically partitioned data, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 23-26, 2002, Edmonton, Alberta, Canada</name><name>Jaideep Vaidya , Chris Clifton, Privacy-preserving k-means clustering over vertically partitioned data, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2003, Washington, D.C.</name><name>Ke Wang , Philip S. Yu , Sourav Chakraborty, Bottom-Up Generalization: A Data Mining Solution to Privacy Protection, Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM'04), p.249-256, November 01-04, 2004</name><name>S. L. Warner. Randomized response: A survey technique for eliminating evasive answer bias. The American Statistical Association, 60(309):63--69, March 1965.</name><name>Rebecca Wright , Zhiqiang Yang, Privacy-preserving Bayesian network structure computation on distributed heterogeneous data, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA</name></citation><abstract>Randomization has emerged as a useful technique for data disguising in privacy-preserving data mining. Its privacy properties have been studied in a number of papers. Kargupta et al. challenged the randomization schemes, and they pointed out that randomization might not be able to preserve privacy. However, it is still unclear what factors cause such a security breach, how they affect the privacy preserving property of the randomization, and what kinds of data have higher risk of disclosing their private contents even though they are randomized.We believe that the key factor is the correlations among attributes. We propose two data reconstruction methods that are based on data correlations. One method uses the Principal Component Analysis (PCA) technique, and the other method uses the Bayes Estimate (BE) technique. We have conducted theoretical and experimental analysis on the relationship between data correlations and the amount of private information that can be disclosed based our proposed data reconstructions schemes. Our studies have shown that when the correlations are high, the original data can be reconstructed more accurately, i.e., more private information can be disclosed.To improve privacy, we propose a modified randomization scheme, in which we let the correlation of random noises "similar" to the original data. Our results have shown that the reconstruction accuracy of both PCA-based and BE-based schemes become worse as the similarity increases.</abstract></paper><paper><title>Incognito: efficient full-domain K-anonymity</title><author><AuthorName>Kristen LeFevre</AuthorName><institute><InstituteName>University of Wisconsin - Madison, Madison, WI</InstituteName><country></country></institute></author><author><AuthorName>David J. DeWitt</AuthorName><institute><InstituteName>University of Wisconsin - Madison, Madison, WI</InstituteName><country></country></institute></author><author><AuthorName>Raghu Ramakrishnan</AuthorName><institute><InstituteName>University of Wisconsin - Madison, Madison, WI</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Management of Data</conference><citation><name>G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas, and A. Zhu. Anonymizing tables. In Proc. of the 10th Int'l Conference on Database Theory, January 2005.</name><name>Rakesh Agrawal , Ramakrishnan Srikant, Fast Algorithms for Mining Association Rules in Large Databases, Proceedings of the 20th International Conference on Very Large Data Bases, p.487-499, September 12-15, 1994</name><name>Roberto J. Bayardo , Rakesh Agrawal, Data Privacy through Optimal k-Anonymization, Proceedings of the 21st International Conference on Data Engineering (ICDE'05), p.217-228, April 05-08, 2005</name><name>Richard Ernest Bellman, Dynamic Programming, Dover Publications, Incorporated, 2003</name><name>C. Blake and C. Merz. UCI repository of machine learning databases, 1998.</name><name>Surajit Chaudhuri , Umeshwar Dayal, An overview of data warehousing and OLAP technology, ACM SIGMOD Record, v.26 n.1, p.65-74, March 1997</name><name>Benjamin C.  M. Fung , Ke Wang , Philip S. Yu, Top-Down Specialization for Information and Privacy Preservation, Proceedings of the 21st International Conference on Data Engineering (ICDE'05), p.205-216, April 05-08, 2005</name><name>Jim Gray , Surajit Chaudhuri , Adam Bosworth , Andrew Layman , Don Reichart , Murali Venkatrao , Frank Pellow , Hamid Pirahesh, Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals, Data Mining and Knowledge Discovery, v.1 n.1, p.29-53, 1997</name><name>Venky Harinarayan , Anand Rajaraman , Jeffrey D. Ullman, Implementing data cubes efficiently, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.205-216, June 04-06, 1996, Montreal, Quebec, Canada</name><name>A. Hundepool and L. Willenborg. &amp;mu;- and &amp;mu;-ARGUS: Software for statistical disclosure control. In Proc. of the Third Int'l Seminar on Statistical Confidentiality, 1996.</name><name>Vijay S. Iyengar, Transforming data to satisfy privacy constraints, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 23-26, 2002, Edmonton, Alberta, Canada</name><name>K. LeFevre, D. DeWitt, and R. Ramakrishnan. Multidimensional k-anonymity. Technical Report 1521, University of Wisconsin, 2005.</name><name>Adam Meyerson , Ryan Williams, On the complexity of optimal K-anonymity, Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 14-16, 2004, Paris, France</name><name>P. Samarati, Protecting Respondents' Identities in Microdata Release, IEEE Transactions on Knowledge and Data Engineering, v.13 n.6, p.1010-1027, November 2001</name><name>P. Samarati and L. Sweeney. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical Report SRI-CSL-98-04, SRI Computer Science Laboratory, 1998.</name><name>Ramakrishnan Srikant , Rakesh Agrawal, Mining Generalized Association Rules, Proceedings of the 21th International Conference on Very Large Data Bases, p.407-419, September 11-15, 1995</name><name>Latanya Sweeney, Achieving k-anonymity privacy protection using generalization and suppression, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, v.10 n.5, p.571-588, October 2002</name><name>Latanya Sweeney, k-anonymity: a model for protecting privacy, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, v.10 n.5, p.557-570, October 2002</name><name>Ke Wang , Philip S. Yu , Sourav Chakraborty, Bottom-Up Generalization: A Data Mining Solution to Privacy Protection, Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM'04), p.249-256, November 01-04, 2004</name><name>L. Willenborg and T. deWaal. Elements of Statistical Disclosure Control. Springer Verlag Lecture Notes in Statistics, 2000.</name><name>W. Winkler. Using simulated annealing for k-anonymity. Research Report 2002-07, US Census Bureau Statistical Research Division, November 2002.</name></citation><abstract>A number of organizations publish microdata for purposes such as public health and demographic research. Although attributes that clearly identify individuals, such as Name and Social Security Number, are generally removed, these databases can sometimes be joined with other public databases on attributes such as Zipcode, Sex, and Birthdate to re-identify individuals who were supposed to remain anonymous. "Joining" attacks are made easier by the availability of other, complementary, databases over the Internet.K-anonymization is a technique that prevents joining attacks by generalizing and/or suppressing portions of the released microdata so that no individual can be uniquely distinguished from a group of size k. In this paper, we provide a practical framework for implementing one model of k-anonymization, called full-domain generalization. We introduce a set of algorithms for producing minimal full-domain generalizations, and show that these algorithms perform up to an order of magnitude faster than previous algorithms on two real-life databases.Besides full-domain generalization, numerous other models have also been proposed for k-anonymization. The second contribution in this paper is a single taxonomy that categorizes previous models and introduces some promising new alternatives.</abstract></paper><paper><title>To do or not to do: the dilemma of disclosing anonymized data</title><author><AuthorName>Laks V. S. Lakshmanan</AuthorName><institute><InstituteName>University of British Columbia</InstituteName><country></country></institute></author><author><AuthorName>Raymond T. Ng</AuthorName><institute><InstituteName>University of British Columbia</InstituteName><country></country></institute></author><author><AuthorName>Ganesh Ramesh</AuthorName><institute><InstituteName>University of British Columbia</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Management of Data</conference><citation><name>Nabil R. Adam , John C. Worthmann, Security-control methods for statistical databases: a comparative study, ACM Computing Surveys (CSUR), v.21 n.4, p.515-556, Dec. 1989</name><name>Aggarwal C. C and Yu P. S. A Condensation Approach to Privacy Preserving Data Mining. EDBT, 2004.</name><name>Aggarwal. G et. al. Anonymizing Tables. ICDT Conference, 2005.</name><name>Rakesh Agrawal , Ramakrishnan Srikant, Privacy-preserving data mining, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.439-450, May 15-18, 2000, Dallas, Texas, United States</name><name>Dakshi Agrawal , Charu C. Aggarwal, On the design and quantification of privacy preserving data mining algorithms, Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.247-255, May 2001, Santa Barbara, California, United States</name><name>Rakesh Agrawal , Tomasz Imieli&amp;#324;ski , Arun Swami, Mining association rules between sets of items in large databases, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.207-216, May 25-28, 1993, Washington, D.C., United States</name><name>Chris Clifton, Using sample size to limit exposure to data mining, Journal of Computer Security, v.8 n.4, p.281-307, Dec. 2000</name><name>Irit Dinur , Kobbi Nissim, Revealing information while preserving privacy, Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.202-210, June 09-11, 2003, San Diego, California</name><name>Josep Domingo-Ferrer , Anna Oganian , Vicen&amp;#231; Torra, Information-Theoretic Disclosure Risk Measures in Statistical Disclosure Control of Tabular Data, Proceedings of the 14th International Conference on Scientific and Statistical Database Management, p.227-231, July 24-26, 2002</name><name>Alexandre Evfimievski , Ramakrishnan Srikant , Rakesh Agarwal , Johannes Gehrke, Privacy preserving mining of association rules, Information Systems, v.29 n.4, p.343-364, June 2004</name><name>Fienberg S. E. et. al. Disclosure Limitation using perturbation and related methods for Categorical Data. Journal of Office Statistics, 14, 1998.</name><name>Vijay S. Iyengar, Transforming data to satisfy privacy constraints, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 23-26, 2002, Edmonton, Alberta, Canada</name><name>Mark Jerrum , Alistair Sinclair , Eric Vigoda, A polynomial-time approximation algorithm for the permanent of a matrix with  non-negative entries, Proceedings of the thirty-third annual ACM symposium on Theory of computing, p.712-721, July 2001, Hersonissos, Greece</name><name>M. Jerrum and U. Vazirani. A mildly exponential approximation algorithm for the permanent. Algorithmica 16, 1996.</name><name>Murat Kantarcioglu , Chris Clifton, Privacy-Preserving Distributed Mining of Association Rules on Horizontally Partitioned Data, IEEE Transactions on Knowledge and Data Engineering, v.16 n.9, p.1026-1037, September 2004</name><name>Yehuda Lindell , Benny Pinkas, Privacy Preserving Data Mining, Proceedings of the 20th Annual International Cryptology Conference on Advances in Cryptology, p.36-54, August 20-24, 2000</name><name>Moore Jr. R. A. Controlled Data-Swapping Techniques for Masking Public Use Microdata Sets. Statistical Research Division Report Series, RR 96-04, US Bureau of Census, Washington D. C., 1996.</name><name>Krishnamurty Muralidhar , Rathindra Sarathy, Security of random data perturbation methods, ACM Transactions on Database Systems (TODS), v.24 n.4, p.487-493, Dec. 1999</name><name>Adam Meyerson , Ryan Williams, On the complexity of optimal K-anonymity, Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 14-16, 2004, Paris, France</name><name>Benny Pinkas, Cryptographic techniques for privacy-preserving data mining, ACM SIGKDD Explorations Newsletter, v.4 n.2, p.12-19, December 2002</name><name>Rasmussen L. E. Approximating the permanent: A simple approach. Random Structures and Algorithms, 5, 1994.</name><name>Samarati P. and Sweeney L. Protecting Privacy when Disclosing Information: k-anonymity and its Enforcement through Generalization and Suppression. IEEE Symposium on Research in Security and Privacy, 1998.</name><name>Latanya Sweeney, k-anonymity: a model for protecting privacy, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, v.10 n.5, p.557-570, October 2002</name><name>J. F. Traub , Y. Yemini , H. Wo&amp;#378;niakowski, The statistical security of a statistical database, ACM Transactions on Database Systems (TODS), v.9 n.4, p.672-679, Dec. 1984</name><name>Valiant L. G. The Complexity of Computing the Permanent. Theoretical Computer Science, 8, 1979.</name><name>Vassilios S. Verykios , Ahmed K. Elmagarmid , Elisa Bertino , Yucel Saygin , Elena Dasseni, Association Rule Hiding, IEEE Transactions on Knowledge and Data Engineering, v.16 n.4, p.434-447, April 2004</name><name>Yang X. and Li C. Secure XML Publishing without Information Leakage in the Presence of Data Inference. VLDB Conference, 2004.</name></citation><abstract>Decision makers of companies often face the dilemma of whether to release data for knowledge discovery, vis a vis the risk of disclosing proprietary or sensitive information. While there are various "sanitization" methods, in this paper we focus on anonymization, given its widespread use in practice. We give due diligence to the question of "just how safe the anonymized data is", in terms of protecting the true identities of the data objects. We consider both the scenarios when the hacker has no information, and more realistically, when the hacker may have partial information about items in the domain. We conduct our analyses in the context of frequent set mining. We propose to capture the prior knowledge of the hacker by means of a belief function, where an educated guess of the frequency of each item is assumed. For various classes of belief functions, which correspond to different degrees of prior knowledge, we derive formulas for computing the expected number of "cracks". While obtaining the exact values for the more general situations is computationally hard, we propose a heuristic called the O-estimate. It is easy to compute, and is shown to be accurate empirically with real benchmark datasets. Finally, based on the O-estimates, we propose a recipe for the decision makers to resolve their dilemma.</abstract></paper><paper><title>Constrained optimalities in query personalization</title><author><AuthorName>Georgia Koutrika</AuthorName><institute><InstituteName>University of Athens, Hellas</InstituteName><country></country></institute></author><author><AuthorName>Yannis Ioannidis</AuthorName><institute><InstituteName>University of Athens, Hellas</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Management of Data</conference><citation><name>Nicolas Bruno , Surajit Chaudhuri , Luis Gravano, Top-k selection queries over relational databases: Mapping strategies and performance evaluation, ACM Transactions on Database Systems (TODS), v.27 n.2, p.153-187, June 2002</name><name>Surajit Chaudhuri , Luis Gravano, Evaluating Top-k Selection Queries, Proceedings of the 25th International Conference on Very Large Data Bases, p.397-410, September 07-10, 1999</name><name>Sumit Ganguly , Waqar Hasan , Ravi Krishnamurthy, Query optimization for parallel execution, Proceedings of the 1992 ACM SIGMOD international conference on Management of data, p.9-18, June 02-05, 1992, San Diego, California, United States</name><name>Glover F. Tabu Search - Part I. ORSA Journal on Computing, Vol. 1, pp. 190--206, 1989.</name><name>David E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1989</name><name>F. Ilyas , G. Aref , K. Elmagarmid, Supporting top-k join queries in relational databases, The VLDB Journal &amp;mdash; The International Journal on Very Large Data Bases, v.13 n.3, p.207-221, September 2004</name><name>Internet Movies Database. Available at www.imdb.com.</name><name>George Karypis, Evaluation of Item-Based Top-N Recommendation Algorithms, Proceedings of the tenth international conference on Information and knowledge management, October 05-10, 2001, Atlanta, Georgia, USA</name><name>Kellerer, H., Pferschy, U., Pisinger, D. Knapsack Problems. Springer-Verlag, 2003.</name><name>Kirkpatrick, Gelatt, C. D., Vecchi, M. P. Optimization by Simulated Annealing, Science 220, 4598, 671--680, 1983.</name><name>Donald Kossmann , Konrad Stocker, Iterative dynamic programming: a new class of query optimization algorithms, ACM Transactions on Database Systems (TODS), v.25 n.1, p.43-82, March 2000</name><name>Georgia Koutrika , Yannis Ioannidis, Personalization of Queries in Database Systems, Proceedings of the 20th International Conference on Data Engineering, p.597, March 30-April 02, 2004</name><name>Fang Liu , Clement Yu , Weiyi Meng, Personalized web search by mapping user queries to categories, Proceedings of the eleventh international conference on Information and knowledge management, November 04-09, 2002, McLean, Virginia, USA</name><name>James Pitkow , Hinrich Sch&amp;#252;tze , Todd Cass , Rob Cooley , Don Turnbull , Andy Edmonds , Eytan Adar , Thomas Breuel, Personalized search, Communications of the ACM, v.45 n.9, September 2002</name><name>P. Griffiths Selinger , M. M. Astrahan , D. D. Chamberlin , R. A. Lorie , T. G. Price, Access path selection in a relational database management system, Proceedings of the 1979 ACM SIGMOD international conference on Management of data, May 30-June 01, 1979, Boston, Massachusetts</name></citation><abstract>Personalization is a powerful mechanism that helps users to cope with the abundance of information on the Web. Database query personalization achieves this by dynamically constructing queries that return results of high interest to the user. This, however, may conflict with other constraints on the query execution time and/or result size that may be imposed by the search context, such as the device used, the network connection, etc. For example, if the user is accessing information using a mobile phone, then it is desirable to construct a personalized query that executes quickly and returns a handful of answers. Constrained Query Personalization (CQP) is an integrated approach to database query answering that dynamically takes into account the queries issued, the user's interest in the results, response time, and result size in order to build personalized queries. In this paper, we introduce CQP as a family of constrained optimization problems, where each time one of the parameters of concern is optimized while the others remain within the bounds of range constraints. Taking into account some key (exact or approximate) properties of these parameters, we map CQP to a state search problem and provide several algorithms for the discovery of optimal solutions. Experimental results demonstrate the effectiveness of the proposed techniques and the appropriateness of the overall approach.</abstract></paper><paper><title>Reference reconciliation in complex information spaces</title><author><AuthorName>Xin Dong</AuthorName><institute><InstituteName>University of Washington, Seattle, WA</InstituteName><country></country></institute></author><author><AuthorName>Alon Halevy</AuthorName><institute><InstituteName>University of Washington, Seattle, WA</InstituteName><country></country></institute></author><author><AuthorName>Jayant Madhavan</AuthorName><institute><InstituteName>University of Washington, Seattle, WA</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Management of Data</conference><citation><name>R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating Fuzzy Duplicates in Data Warehouses. In Proc. of VLDB, 2002.</name><name>Indrajit Bhattacharya , Lise Getoor, Iterative record linkage for cleaning and integration, Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, June 13, 2004, Paris, France</name><name>Mikhail Bilenko , Raymond J. Mooney, Adaptive duplicate detection using learnable string similarity measures, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2003, Washington, D.C.</name><name>M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems Special Issue on Information Integration on the Web, September 2003.</name><name>V. Bush. As we may think. The Atlantic Monthly, 1945.</name><name>Surajit Chaudhuri , Kris Ganjam , Venkatesh Ganti , Rajeev Motwani, Robust and efficient fuzzy match for online data cleaning, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 09-12, 2003, San Diego, California</name><name>Computer and information science papers citeseer publications researchindex. http://citeseer.ist.psu.edu/.</name><name>W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration, 2002.</name><name>William W. Cohen , Henry Kautz , David McAllester, Hardening soft information sources, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.255-259, August 20-23, 2000, Boston, Massachusetts, United States</name><name>W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWEB, pages 73--78, 2003.</name><name>http://www.cs.umass.edu/~mccallum/data/cora-refs.tar.gz.</name><name>A. Doan, Y. Lu, Y. Lee, and J. Han. Object matching for information integration: a profiler-based approach. In IIWeb, 2003.</name><name>X. Dong and A. Halevy. A Platform for Personal Information Management and Integration. In Proc. of CIDR, 2005.</name><name>X. Dong, A. Halevy, and J. Madhavan. Reference Reconciliation in Complex Information Spaces. Technical Report 2005-03-04, Univ. of Washington, 2005.</name><name>X. Dong, A. Halevy, E. Nemes, S. Sigurdsson, and P. Domingos. Semex: Toward on-the-fly personal information integration. In IIWeb, 2004.</name><name>Susan Dumais , Edward Cutrell , JJ Cadiz , Gavin Jancke , Raman Sarin , Daniel C. Robbins, Stuff I've seen: a system for personal information retrieval and re-use, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, July 28-August 01, 2003, Toronto, Canada</name><name>I. P. Fellegi and A. B. Sunter. A theory for record linkage. In Journal of the American Statistical Association, 1969.</name><name>Helena Galhardas , Daniela Florescu , Dennis Shasha , Eric Simon , Cristian-Augustin Saita, Declarative Data Cleaning: Language, Model, and Algorithms, Proceedings of the 27th International Conference on Very Large Data Bases, p.371-380, September 11-14, 2001</name><name>Google. http://desktop.google.com/, 2004.</name><name>L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: current practice and future directions. http://www.act.cmis.csiro.au/rohanb/PAPERS/record.linkage.pdf.</name><name>Mauricio A. Hern&amp;#225;ndez , Salvatore J. Stolfo, The merge/purge problem for large databases, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.127-138, May 22-25, 1995, San Jose, California, United States</name><name>Liang Jin , Chen Li , Sharad Mehrotra, Efficient Record Linkage in Large Data Sets, Proceedings of the Eighth International Conference on Database Systems for Advanced Applications, p.137, March 26-28, 2003</name><name>D. V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SIAM Data Mining (SDM), 2005.</name><name>Mong Li Lee , Tok Wang Ling , Wai Lup Low, IntelliClean: a knowledge-based intelligent data cleaner, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.290-294, August 20-23, 2000, Boston, Massachusetts, United States</name><name>Andrew Kachites McCallum , Kamal Nigam , Jason Rennie , Kristie Seymore, Automating the Construction of Internet Portals with Machine Learning, Information Retrieval, v.3 n.2, p.127-163, July 2000</name><name>A. McCallum and B. Wellner. Toward conditional models of identity uncertainty with application to proper noun coreference. In IIWEB, 2003.</name><name>Andrew McCallum , Kamal Nigam , Lyle H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States</name><name>M. Michalowski, S. Thakkar, and C. A. Knoblock. Exploiting secondary sources for unsupervised record linkage. In IIWeb, 2004.</name><name>H. Newcombe, J. Kennedy, S. Axford, and A. James. Automatic linkage of vital records. In Science 130 (1959), no. 3381, pages 954--959, 1959.</name><name>Parag and P. Domingos. Multi-relational record linkage. In MRDM, 2004.</name><name>H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, 2002.</name><name>J. C. Pinheiro and D. X. Sun. Methods for linking and mining massive heterogeneous databases. In SIGKDD, 1998.</name><name>D. Quan, D. Huynh, and D. R. Karger. Haystack: A platform for authoring end user semantic web applications. In ISWC, 2003.</name><name>Sunita Sarawagi , Anuradha Bhamidipaty, Interactive deduplication using active learning, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 23-26, 2002, Edmonton, Alberta, Canada</name><name>Sheila Tejada , Craig A. Knoblock , Steven Minton, Learning domain-independent string transformation weights for high accuracy object identification, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 23-26, 2002, Edmonton, Alberta, Canada</name><name>W. E. Winkler. Using the em algorithm for weight computation in the fellegi-sunter model of record linkage. In Section on Survey Research Methods, 1988.</name><name>W. E. Winkler. The state of record linkage and current research problems. Technical report, U. S. Bureau of the Census, Wachington, DC, 1999.</name></citation><abstract>Reference reconciliation is the problem of identifying when different references (i.e., sets of attribute values) in a dataset correspond to the same real-world entity. Most previous literature assumed references to a single class that had a fair number of attributes (e.g., research publications). We consider complex information spaces: our references belong to multiple related classes and each reference may have very few attribute values. A prime example of such a space is Personal Information Management, where the goal is to provide a coherent view of all the information on one's desktop.Our reconciliation algorithm has three principal features. First, we exploit the associations between references to design new methods for reference comparison. Second, we propagate information between reconciliation decisions to accumulate positive and negative evidences. Third, we gradually enrich references by merging attribute values. Our experiments show that (1) we considerably improve precision and recall over standard methods on a diverse set of personal information datasets, and (2) there are advantages to using our algorithm even on a standard citation dataset benchmark.</abstract></paper><paper><title>Magnet: supporting navigation in semistructured data environments</title><author><AuthorName>Vineet Sinha</AuthorName><institute><InstituteName>MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), Cambridge, MA</InstituteName><country></country></institute></author><author><AuthorName>David R. Karger</AuthorName><institute><InstituteName>MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), Cambridge, MA</InstituteName><country></country></institute></author><year>2005</year><conference>International Conference on Management of Data</conference><citation><name>Marcia J. Bates. Information search tactics. Journal of the American Society for Information Science, 30(4):205--214, 1979.</name><name>Marcia J. Bates. The design of browsing and berrypicking techniques for the online search interface. Online Review, 13(5):407--424, October 1989.</name><name>Marcia J. Bates, Where should the person stop and the information search interface start?, Information Processing and Management: an International Journal, v.26 n.5, p.575-591, 1990</name><name>David Carmel , Yoelle S. Maarek , Matan Mandelbrod , Yosi Mass , Aya Soffer, Searching XML documents via XML fragments, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, July 28-August 01, 2003, Toronto, Canada</name><name>Tiziana Catarci, Maria Francesca Costabile, Stefano Levialdi, and Carlo Batini. Visual query systems for databases: A survey. Journal of Visual Languages and Computing, 8(2):215--260, 1997.</name><name>Douglass R. Cutting , David R. Karger , Jan O. Pedersen , John W. Tukey, Scatter/Gather: a cluster-based approach to browsing large document collections, Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, p.318-329, June 21-24, 1992, Copenhagen, Denmark</name><name>Daniel Egnor and Robert Lord. Structured information retrieval using XML. In Working Notes of the ACM SIGIR Workshop on XML and Information Retrieval, 2000.</name><name>User Interface Engineering. Users don't learn to search better. http://www.uie.com/Articles/not_learn_search.htm.</name><name>Norbert Fuhr, Mounia Lalmas, and Saadia Malik, editors. INitiative for the Evaluation of XML Retrieval (INEX). Proceedings of the Second INEX Workshop, December 2003.</name><name>George W. Furnas , Samuel J. Rauch, Considerations for information environments and the NaviQue workspace, Proceedings of the third ACM conference on Digital libraries, p.79-88, June 23-26, 1998, Pittsburgh, Pennsylvania, United States</name><name>Roy Goldman , Jennifer Widom, Interactive Query and Search in Semistructured Databases, Selected papers from the International Workshop on The World Wide Web and Databases, p.52-62, March 27-28, 1998</name><name>Donna Harman, Ranking algorithms, Information retrieval: data structures and algorithms, Prentice-Hall, Inc., Upper Saddle River, NJ, 1992</name><name>Donna Harman, Relevance feedback and other query modification techniques, Information retrieval: data structures and algorithms, Prentice-Hall, Inc., Upper Saddle River, NJ, 1992</name><name>Haystack. The universal information client. http://haystack.lcs.mit.edu/.</name><name>The Apache Software Foundation Jakatra Project. The Lucene search engine. http://www.lucene.com/.</name><name>Susanne Jul , George W. Furnas, Navigation in electronic worlds: a CHI 97 workshop, ACM SIGCHI Bulletin, v.29 n.4, p.44-49, Oct. 1997</name><name>Jaap Kamps , Maarten Marx , Maarten de Rijke , B&amp;#246;rkur Sigurbj&amp;#246;rnsson, XML retrieval: what to retrieve?, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, July 28-August 01, 2003, Toronto, Canada</name><name>O. Lassila and R. Swick. Resource description framework (RDF): Model and syntax specification. http://www.w3.org/TR/1999/REC-rdf-syntax-19990222, February 1999. W3C Recommendation.</name><name>Jason McHugh , Serge Abiteboul , Roy Goldman , Dallas Quass , Jennifer Widom, Lore: a database management system for semistructured data, ACM SIGMOD Record, v.26 n.3, p.54-66, Sept. 1997</name><name>Penny Nii, The blackboard model of problem solving, AI Magazine, v.7 n.2, p.38-53, Summer 1986</name><name>Donald A. Norman, Design rules based on analyses of human error, Communications of the ACM, v.26 n.4, p.254-258, April 1983</name><name>Peter Pirolli , Stuart Card, Information foraging in information access environments, Proceedings of the SIGCHI conference on Human factors in computing systems, p.51-58, May 07-11, 1995, Denver, Colorado, United States</name><name>The Simile Project. Longwell suit of web-based RDF browsers. http://simile.mit.edu/longwell/.</name><name>Jason D. M. Rennie, Lawrence Shih, Jaime Teevan, and David R. Karger. Tackling the poor assumptions of naive bayes text classifiers. In Proceedings of the Twentieth International Conference on Machine Learning, 2003.</name><name>Jaime Teevan , Christine Alvarado , Mark S. Ackerman , David R. Karger, The perfect search engine is not enough: a study of orienteering behavior in directed search, Proceedings of the SIGCHI conference on Human factors in computing systems, p.415-422, April 24-29, 2004, Vienna, Austria</name><name>Agathoniki Trigoni, Interactive Query Formulation in Semistructured Databases, Proceedings of the 5th International Conference on Flexible Query Answering Systems, p.356-369, October 27-29, 2002</name><name>Bienvenido V&amp;#233;lez , Ron Weiss , Mark A. Sheldon , David K. Gifford, Fast and effective query refinement, Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, p.6-15, July 27-31, 1997, Philadelphia, Pennsylvania, United States</name><name>Ka-Ping Yee , Kirsten Swearingen , Kevin Li , Marti Hearst, Faceted metadata for image search and browsing, Proceedings of the SIGCHI conference on Human factors in computing systems, April 05-10, 2003, Ft. Lauderdale, Florida, USA</name><name>Franois Yergeau, Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, and Eve Maler. Extensible markup language (XML). http: //www.w3.org/TR/2004/REC-xml-20040204/, February 2004. W3C Recommendation.</name></citation><abstract>With the growing importance of systems containing arbitrary semi-structured relationships, the need for supporting users searching in such repositories has grown. Currently support for users' search needs 

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -