⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 sigmod_2001_elementary.txt

📁 利用lwp::get写的
💻 TXT
📖 第 1 页 / 共 5 页
字号:
<proceedings><paper><title>Efficient computation of Iceberg cubes with complex measures</title><author><AuthorName>Jiawei Han</AuthorName><institute><InstituteName>School of Computing Science, Simon Fraser University, B.C., Canada</InstituteName><country></country></institute></author><author><AuthorName>Jian Pei</AuthorName><institute><InstituteName>School of Computing Science, Simon Fraser University, B.C., Canada</InstituteName><country></country></institute></author><author><AuthorName>Guozhu Dong</AuthorName><institute><InstituteName>Department of Computer Science, Wright State University, Dayton, OH</InstituteName><country></country></institute></author><author><AuthorName>Ke Wang</AuthorName><institute><InstituteName>School of Computing Science, Simon Fraser University, B.C., Canada</InstituteName><country></country></institute></author><year>2001</year><conference>International Conference on Management of Data</conference><citation><name>Sameet Agarwal , Rakesh Agrawal , Prasad Deshpande , Ashish Gupta , Jeffrey F. Naughton , Raghu Ramakrishnan , Sunita Sarawagi, On the Computation of Multidimensional Aggregates, Proceedings of the 22th International Conference on Very Large Data Bases, p.506-521, September 03-06, 1996</name><name>Rakesh Agrawal , Ramakrishnan Srikant, Fast Algorithms for Mining Association Rules in Large Databases, Proceedings of the 20th International Conference on Very Large Data Bases, p.487-499, September 12-15, 1994</name><name>R. J. Bayardo, R. Agrawal, and D. Gunopulos. Constraint-based rule mining on large, dense data sets. ICDE'99.</name><name>Kevin Beyer , Raghu Ramakrishnan, Bottom-up computation of sparse and Iceberg CUBE, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.359-370, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States</name><name>Surajit Chaudhuri , Umeshwar Dayal, An overview of data warehousing and OLAP technology, ACM SIGMOD Record, v.26 n.1, p.65-74, March 1997</name><name>Min Fang , Narayanan Shivakumar , Hector Garcia-Molina , Rajeev Motwani , Jeffrey D. Ullman, Computing Iceberg Queries Efficiently, Proceedings of the 24rd International Conference on Very Large Data Bases, p.299-310, August 24-27, 1998</name><name>G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. ICDE'00.</name><name>Jim Gray , Surajit Chaudhuri , Adam Bosworth , Andrew Layman , Don Reichart , Murali Venkatrao , Frank Pellow , Hamid Pirahesh, Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals, Data Mining and Knowledge Discovery, v.1 n.1, p.29-53, 1997</name><name>Jiawei Han , Jian Pei , Yiwen Yin, Mining frequent patterns without candidate generation, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.1-12, May 15-18, 2000, Dallas, Texas, United States</name><name>Venky Harinarayan , Anand Rajaraman , Jeffrey D. Ullman, Implementing data cubes efficiently, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.205-216, June 04-06, 1996, Montreal, Quebec, Canada</name><name>Laks V. S. Lakshmanan , Raymond Ng , Jiawei Han , Alex Pang, Optimization of constrained frequent set queries with 2-variable constraints, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.157-168, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States</name><name>Raymond T. Ng , Laks V. S. Lakshmanan , Jiawei Han , Alex Pang, Exploratory mining and pruning optimizations of constrained associations rules, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.13-24, June 01-04, 1998, Seattle, Washington, United States</name><name>Jian Pei , Jiawei Han, Can we push more constraints into frequent pattern mining?, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.350-354, August 20-23, 2000, Boston, Massachusetts, United States</name><name>Kenneth A. Ross , Divesh Srivastava, Fast Computation of Sparse Datacubes, Proceedings of the 23rd International Conference on Very Large Data Bases, p.116-125, August 25-29, 1997</name><name>R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD'97.</name><name>Yihong Zhao , Prasad M. Deshpande , Jeffrey F. Naughton, An array-based algorithm for simultaneous multidimensional aggregates, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.159-170, May 11-15, 1997, Tucson, Arizona, United States</name></citation><abstract>It is often too expensive to compute and materialize a complete high-dimensional data cube. Computing an iceberg cube, which contains only aggregates above certain thresholds, is an effective way to derive nontrivial multi-dimensional aggregations for OLAP and data mining.</abstract></paper><paper><title>On computing correlated aggregates over continual data streams</title><author><AuthorName>Johannes Gehrke</AuthorName><institute><InstituteName>Cornell University</InstituteName><country></country></institute></author><author><AuthorName>Flip Korn</AuthorName><institute><InstituteName>AT&amp;T Labs-Research</InstituteName><country></country></institute></author><author><AuthorName>Divesh Srivastava</AuthorName><institute><InstituteName>AT&amp;T Labs-Research</InstituteName><country></country></institute></author><year>2001</year><conference>International Conference on Management of Data</conference><citation><name>Noga Alon , Yossi Matias , Mario Szegedy, The space complexity of approximating the frequency moments, Journal of Computer and System Sciences, v.58 n.1, p.137-147, Feb. 1999</name><name>Khaled Alsabti , Sanjay Ranka , Vineet Singh, A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data, Proceedings of the 23rd International Conference on Very Large Data Bases, p.346-355, August 25-29, 1997</name><name>Ron Avnur , Joseph M. Hellerstein, Eddies: continuously adaptive query processing, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.261-272, May 15-18, 2000, Dallas, Texas, United States</name><name>D. Chatziantoniou. Ad hoc OLAP: Expression and evaluation. In Proceedings of the IEEE International Conference on Data Engineering, 1999.</name><name>Damianos Chatziantoniou , Michael O. Akinde , Theodore Johnson , Samuel Kim, The MD-join: An Operator for Complex OLAP, Proceedings of the 17th International Conference on Data Engineering, p.524-533, April 02-06, 2001</name><name>Damianos Chatziantoniou , Kenneth A. Ross, Querying Multiple Features of Groups in Relational Databases, Proceedings of the 22th International Conference on Very Large Data Bases, p.295-306, September 03-06, 1996</name><name>A. Delis, C. Faloutsos, and S. Ghandeharizadeh, editors. SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, 1999, Philadephia, PJjennsylvania, USA. ACM Press, 1999.</name><name>Pedro Domingos , Geoff Hulten, Mining high-speed data streams, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.71-80, August 20-23, 2000, Boston, Massachusetts, United States</name><name>J. Feigenbaum , S. Kannan , M. Strauss , M. Viswanathan, An Approximate L1-Difference Algorithm for Massive Data Streams, Proceedings of the 40th Annual Symposium on Foundations of Computer Science, p.501, October 17-18, 1999</name><name>J. Feigenbaum , S. Kannan , M. Strauss , M. Viswanathan, Testing and spot-checking of data streams (extended abstract), Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms, p.165-174, January 09-11, 2000, San Francisco, California, United States</name><name>A. Feldmann , A. C. Gilbert , W. Willinger, Data networks as cascades: investigating the multifractal nature of Internet WAN traffic, Proceedings of the ACM SIGCOMM '98 conference on Applications, technologies, architectures, and protocols for computer communication, p.42-55, August 31-September 04, 1998, Vancouver, British Columbia, Canada</name><name>Jessica H. Fong , Martin Strauss, An Approximate Lp-Difference Algorithm for Massive Data Streams, Proceedings of the 17th Annual Symposium on Theoretical Aspects of Computer Science, p.193-204, February 01, 2000</name><name>Johannes Gehrke , Venkatesh Ganti , Raghu Ramakrishnan , Wei-Yin Loh, BOAT&amp;mdash;optimistic decision tree construction, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.169-180, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States</name><name>Phillip B. Gibbons , Yossi Matias , Viswanath Poosala, Fast Incremental Maintenance of Approximate Histograms, Proceedings of the 23rd International Conference on Very Large Data Bases, p.466-475, August 25-29, 1997</name><name>Phillip B. Gibbons , Yossi Matias, Synopsis data structures for massive data sets, Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms, p.909-910, January 17-19, 1999, Baltimore, Maryland, United States</name><name>S. Guha , N. Mishra , R. Motwani , L. O'Callaghan, Clustering data streams, Proceedings of the 41st Annual Symposium on Foundations of Computer Science, p.359, November 12-14, 2000</name><name>Peter J. Haas , Joseph M. Hellerstein, Ripple joins for online aggregation, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.287-298, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States</name><name>Joseph M. Hellerstein , Peter J. Haas , Helen J. Wang, Online aggregation, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.171-182, May 11-15, 1997, Tucson, Arizona, United States</name><name>M.R. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. Technical Report 1998-011, Digital Equipment Corporation, Systems Research Center, May, 1998.</name><name>Gurmeet Singh Manku , Sridhar Rajagopalan , Bruce G. Lindsay, Approximate medians and other quantiles in one pass and with limited memory, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.426-435, June 01-04, 1998, Seattle, Washington, United States</name><name>Gurmeet Singh Manku , Sridhar Rajagopalan , Bruce G. Lindsay, Random sampling techniques for space efficient online computation of order statistics of large datasets, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.251-262, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States</name><name>Chris Olston , Jennifer Widom, Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data, Proceedings of the 26th International Conference on Very Large Data Bases, p.144-155, September 10-14, 2000</name><name>Vijayshankar Raman , Bhaskaran Raman , Joseph M. Hellerstein, Online Dynamic Reordering for Interactive Data Processing, Proceedings of the 25th International Conference on Very Large Data Bases, p.709-720, September 07-10, 1999</name></citation><abstract>In many applications from telephone fraud detection to network management, data arrives in a stream, and there is a need to maintain a variety of statistical summary information about a large number of customers in an online fashion. At present, such applications maintain basic aggregates such as running extrema values (MIN, MAX), averages, standard deviations, etc., that can be computed over data streams with limited space in a straightforward way. However, many applications require knowledge of more complex aggregates relating different attributes, so-called correlated aggregates. As an example, one might be interested in computing the percentage of international phone calls that are longer than the average duration of a domestic phone call. Exact computation of this aggregate requires multiple passes over the data stream, which is infeasible.</abstract></paper><paper><title>Iceberg-cube computation with PC clusters</title><author><AuthorName>Raymond T. Ng</AuthorName><institute><InstituteName>Univ British Columbia, 2366 Main Mall, UBC, Vancouver, BC</InstituteName><country></country></institute></author><author><AuthorName>Alan Wagner</AuthorName><institute><InstituteName>Univ British Columbia, 2366 Main Mall, UBC, Vancouver, BC</InstituteName><country></country></institute></author><author><AuthorName>Yu Yin</AuthorName><institute><InstituteName>Univ British Columbia, 2366 Main Mall, UBC, Vancouver, BC</InstituteName><country></country></institute></author><year>2001</year><conference>International Conference on Management of Data</conference><citation><name>Sameet Agarwal , Rakesh Agrawal , Prasad Deshpande , Ashish Gupta , Jeffrey F. Naughton , Raghu Ramakrishnan , Sunita Sarawagi, On the Computation of Multidimensional Aggregates, Proceedings of the 22th International Conference on Very Large Data Bases, p.506-521, September 03-06, 1996</name><name>Elena Baralis , Stefano Paraboschi , Ernest Teniente, Materialized Views Selection in a Multidimensional Database, Proceedings of the 23rd International Conference on Very Large Data Bases, p.156-165, August 25-29, 1997</name><name>Kevin Beyer , Raghu Ramakrishnan, Bottom-up computation of sparse and Iceberg CUBE, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.359-370, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States</name><name>Michael Eberl , Wolfgang Karl , Carsten Trinitis , Andreas Blaszczyk, Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications, Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, p.493-498, September 26-29, 1999</name><name>Min Fang , Narayanan Shivakumar , Hector Garcia-Molina , Rajeev Motwani , Jeffrey D. Ullman, Computing Iceberg Queries Efficiently, Proceedings of the 24rd International Conference on Very Large Data Bases, p.299-310, August 24-27, 1998</name><name>Sanjay Goil , Alok Choudhary, High Performance OLAP and Data Mining on Parallel Computers, Data Mining and Knowledge Discovery, v.1 n.4, p.391-417, December 1997</name><name>Jim Gray , Adam Bosworth , Andrew Layman , Hamid Pirahesh, Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total, Proceedings of the Twelfth International Conference on Data Engineering, p.152-159, February 26-March 01, 1996</name><name>Himanshu Gupta , Venky Harinarayan , Anand Rajaraman , Jeffrey D. Ullman, Index Selection for OLAP, Proceedings of the Thirteenth International Conference on Data Engineering, p.208-219, April 07-11, 1997</name><name>Venky Harinarayan , Anand Rajaraman , Jeffrey D. Ullman, Implementing data cubes efficiently, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.205-216, June 04-06, 1996, Montreal, Quebec, Canada</name><name>Joseph M. Hellerstein , Peter J. Haas , Helen J. Wang, Online aggregation, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.171-182, May 11-15, 1997, Tucson, Arizona, United States</name><name>M. Kamber, J. Han and J. Chiang. Metarule-guided mining of multi-dimensional association rules using data cubes. In Proc. 1997 KDD, pp. 207-210.</name><name>Raymond T. Ng , Laks V. S. Lakshmanan , Jiawei Han , Alex Pang, Exploratory mining and pruning optimizations of constrained associations rules, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.13-24, June 01-04, 1998, Seattle, Washington, United States</name><name>Kenneth A. Ross , Divesh Srivastava, Fast Computation of Sparse Datacubes, Proceedings of the 23rd International Conference on Very Large Data Bases, p.116-125, August 25-29, 1997</name><name>Sunita Sarawagi, Explaining Differences in Multidimensional Aggregates, Proceedings of the 25th International Conference on Very Large Data Bases, p.42-53, September 07-10, 1999</name><name>Amit Shukla , Prasad Deshpande , Jeffrey F. Naughton, Materialized View Selection for Multidimensional Datasets, Proceedings of the 24rd International Conference on Very Large Data Bases, p.488-499, August 24-27, 1998</name><name>Anurag Srivastava , Eui-Hong Han , Vipin Kumar , Vineet Singh, Parallel Formulations of Decision-Tree Classification Algorithms, Data Mining and Knowledge Discovery, v.3 n.3, p.237-261, September 1999</name><name>Masahisa Tamura , Masaru Kitsuregawa, Dynamic Load Balancing for Parallel Association Rule Mining on Heterogenous PC Cluster Systems, Proceedings of the 25th International Conference on Very Large Data Bases, p.162-173, September 07-10, 1999</name><name>Mohammed J. Zaki, Parallel and Distributed Association Mining: A Survey, IEEE Concurrency, v.7 n.4, p.14-25, October 1999</name><name>Yihong Zhao , Prasad M. Deshpande , Jeffrey F. Naughton, An array-based algorithm for simultaneous multidimensional aggregates, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.159-170, May 11-15, 1997, Tucson, Arizona, United States</name></citation><abstract>In this paper, we investigate the approach of using low cost PC cluster to parallelize the computation of iceberg-cube queries. We concentrate on techniques directed towards online querying of large, high-dimensional datasets where it is assumed that the total cube has net been precomputed. The algorithmic space we explore considers trade-offs between parallelism, computation and I/0. Our main contribution is the development and a comprehensive evaluation of various novel, parallel algorithms. Specifically: (1) Algorithm RP is a straightforward parallel version of BUC [BR99]; (2) Algorithm BPP attempts to reduce I/0 by outputting results in a more efficient way; (3) Algorithm ASL, which maintains cells in a cuboid in a skiplist, is designed to put the utmost priority on load balancing; and (4) alternatively, Algorithm PT load-balances by using binary partitioning to divide the cube lattice as evenly as possible.</abstract></paper><paper><title>Outlier detection for high dimensional data</title><author><AuthorName>Charu C. Aggarwal</AuthorName><institute><InstituteName>IBM T. J. Watson Research Center, Yorktown Heights, NY</InstituteName><country></country></institute></author><author><AuthorName>Philip S. Yu</AuthorName><institute><InstituteName>IBM T. J. Watson Research Center, Yorktown Heights, NY</InstituteName><country></country></institute></author><year>2001</year><conference>International Conference on Management of Data</conference><citation><name>Charu C. Aggarwal, Re-designing distance functions and distance-based applications for high dimensional data, ACM SIGMOD Record, v.30 n.1, p.13-18, March 2001</name><name>Charu C. Aggarwal , Joel L. Wolf , Philip S. Yu , Cecilia Procopiuc , Jong Soo Park, Fast algorithms for projected clustering, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.61-72, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States</name><name>Charu C. Aggarwal , Philip S. Yu, Finding generalized projected clusters in high dimensional spaces, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.70-81, May 15-18, 2000, Dallas, Texas, United States</name><name>C. C. Aggarwal, J. B. Orlin, R. P. Tai. Optimized Crossover for the Independent Set Problem. Operations Research 45(2), March 1997.</name><name>Rakesh Agrawal , Johannes Gehrke , Dimitrios Gunopulos , Prabhakar Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.94-105, June 01-04, 1998, Seattle, Washington, United States</name><name>Rakesh Agrawal , Tomasz Imieli&amp;#324;ski , Arun Swami, Mining association rules between sets of items in large databases, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.207-216, May 25-28, 1993, Washington, D.C., United States</name><name>A. Arning, R. Agrawal, P. Raghavan. A Linear Method for Deviation Detection in Large Databases. KDD Conference Proceedings, 1995.</name><name>V. Barnett, T. Lewis. Outliers in Statistical Data. John Wiley and Sons, NY 1994.</name><name>Kevin S. Beyer , Jonathan Goldstein , Raghu Ramakrishnan , Uri Shaft, When Is ''Nearest Neighbor'' Meaningful?, Proceeding of the 7th International Conference on Database Theory, p.217-235, January 10-12, 1999</name><name>Markus M. Breunig , Hans-Peter Kriegel , Raymond T. Ng , J&amp;#246;rg Sander, LOF: identifying density-based local outliers, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.93-104, May 15-18, 2000, Dallas, Texas, United States</name><name>Kaushik Chakrabarti , Sharad Mehrotra, Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces, Proceedings of the 26th International Conference on Very Large Data Bases, p.89-100, September 10-14, 2000</name><name>C. Darwin. The Origin of the Species by Natural Selection. Published, 1859.</name><name>D. Hawkins. Identification of Outliers, Chapman and Hall, London, 1980.</name><name>Kenneth Alan De Jong, An analysis of the behavior of a class of genetic adaptive systems., 1975</name><name>M. Ester, H.-P. Kriegel, J. Sander, X. Xu. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD Conference Proceedings, 1996.</name><name>J. J. Grefenstette. Genesis Software Version 5.0. Available at http://www.santafe.edu.</name><name>David E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1989</name><name>Sudipto Guha , Rajeev Rastogi , Kyuseok Shim, CURE: an efficient clustering algorithm for large databases, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.73-84, June 01-04, 1998, Seattle, Washington, United States</name><name>Alexander Hinneburg , Charu C. Aggarwal , Daniel A. Keim, What Is the Nearest Neighbor in High Dimensional Spaces?, Proceedings of the 26th International Conference on Very Large Data Bases, p.506-515, September 10-14, 2000</name><name>John H. Holland, Adaptation in natural and artificial systems, MIT Press, Cambridge, MA, 1992</name><name>S. Kirkpatrick, C. D. Gelatt, M. P. Vecchi. Optimization by Simulated Annealing. Science (220) (4589): pages 671-680, 1983.</name><name>Edwin M. Knorr , Raymond T. Ng, Algorithms for Mining Distance-Based Outliers in Large Datasets, Proceedings of the 24rd International Conference on Very Large Data Bases, p.392-403, August 24-27, 1998</name><name>Edwin M. Knorr , Raymond T. Ng, Finding Intensional Knowledge of Distance-Based Outliers, Proceedings of the 25th International Conference on Very Large Data Bases, p.211-222, September 07-10, 1999</name><name>Raymond T. Ng , Jiawei Han, Efficient and Effective Clustering Methods for Spatial Data Mining, Proceedings of the 20th International Conference on Very Large Data Bases, p.144-155, September 12-15, 1994</name><name>Sridhar Ramaswamy , Rajeev Rastogi , Kyuseok Shim, Efficient algorithms for mining outliers from large data sets, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.427-438, May 15-18, 2000, Dallas, Texas, United States</name><name>Sunita Sarawagi , Rakesh Agrawal , Nimrod Megiddo, Discovery-Driven Exploration of OLAP Data Cubes, Proceedings of the 6th International Conference on Extending Database Technology: Advances in Database Technology, p.168-182, March 23-27, 1998</name><name>Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada</name></citation><abstract>The outlier detection problem has important applications in the field of fraud detection, network robustness analysis, and intrusion detection. Most such applications are high dimensional domains in which the data can contain hundreds of dimensions. Many recent algorithms use concepts of proximity in order to find outliers based on their relationship to the rest of the data. However, in high dimensional space, the data is sparse and the notion of proximity fails to retain its meaningfulness. In fact, the sparsity of high dimensional data implies that every point is an almost equally good outlier from the perspective of proximity-based definitions. Consequently, for high dimensional data, the notion of finding meaningful outliers becomes substantially more complex and non-obvious. In this paper, we discuss new techniques for outlier detection which find the outliers by studying the behavior of projections from the data set.</abstract></paper><paper><title>Bit-sliced index arithmetic</title><author><AuthorName>Denis Rinfret</AuthorName><institute><InstituteName>UMass/Boston, Dept. of CS, UMass/Boston, Boston, MA</InstituteName><country></country></institute></author><author><AuthorName>Patrick O'Neil</AuthorName><institute><InstituteName>UMass/Boston &amp; Microsoft Research, Dept. of CS, UMass/Boston, Boston, MA</InstituteName><country></country></institute></author><author><AuthorName>Elizabeth O'Neil</AuthorName><institute><InstituteName></InstituteName><country></country></institute></author><year>2001</year><conference>International Conference on Management of Data</conference><citation><name>Hal Berenson , Phil Bernstein , Jim Gray , Jim Melton , Elizabeth O'Neil , Patrick O'Neil, A critique of ANSI SQL isolation levels, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.1-10, May 22-25, 1995, San Jose, California, United States</name><name>Timothy C. Bell , Alistair Moffat , Ian H. Witten , Justin Zobel, The MG retrieval system: compressing for space and speed, Communications of the ACM, v.38 n.4, p.41-42, April 1995</name><name>Byte Magazine, Ann O'Leary. Managing Mission- Critical Text [with ORACLE ConText Cartridge]. http://www.byte.com/art/9709/sec4/art1.htm</name><name>Chee-Yong Chan , Yannis E. Ioannidis, Bitmap index design and evaluation, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.355-366, June 01-04, 1998, Seattle, Washington, United States</name><name>Chee-Yong Chan , Yannis E. Ioannidis, An efficient bitmap encoding scheme for selection queries, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.215-226, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States</name><name>Full Select syntax in DB2 SQL Reference Text, http://www.csa.ru/dblab/DB2/db2s0/fullslt.htm</name><name>D. K. Harmon, Ed. Proceedings of TREC for Text Retrieval Conference (Washington D.C., Nov. 1992) National Institute of Standards Special Publication 500-207. NIST, Washington, D.C.</name><name>K. Jacobs, with contributors: R. Bamford, G. Doherty, K. Haas, M. Holt, F. Putzolu, B. Quigley. Concurrency Control: Transaction Isolation and Serializability in SQL92 and Oracle7. Oracle White Paper, Part No. A33745, July, 1995.</name><name>Marcin Kaszkiel , Justin Zobel , Ron Sacks-Davis, Efficient passage ranking for document databases, ACM Transactions on Information Systems (TOIS), v.17 n.4, p.406-439, Oct. 1999</name><name>M. Morris Mano, Digital design (2nd ed.), Prentice-Hall, Inc., Upper Saddle River, NJ, 1991</name><name>Fionn Murtagh. A Very Fast, Exact Nearest Neighbour Algorithm for use in Information Retrieval. Information Technology: Research and Development 1982, Vol. 1, Pages 275-283.</name><name>Fionn Murtagh, Clustering in massive data sets, Handbook of massive data sets, Kluwer Academic Publishers, Norwell, MA, 2002</name><name>Alistair Moffat , Justin Zobel, Self-indexing inverted files for fast text retrieval, ACM Transactions on Information Systems (TOIS), v.14 n.4, p.349-379, Oct. 1996</name><name>Patrick E. O'Neil, Model 204 Architecture and Performance, Proceedings of the 2nd International Workshop on High Performance Transaction Systems, p.40-59, September 28-30, 1987</name><name>Patrick O'Neil , Dallan Quass, Improved query performance with variant indexes, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.38-49, May 11-15, 1997, Tucson, Arizona, United States</name><name>Patrick O'Neil , Elizabeth O'Neil, Database (2nd ed.): principles, programming, and performance, Morgan Kaufmann Publishers Inc., San Francisco, CA, 2000</name><name>PCWEEK ONLINE, Timothy Dyck. ConText Gets faster and friendlier. [Also discusses DB2 Text Extender.] http://www8.zdnet.com/eweek/reviews/0414/14cont.html</name><name>Shirley A. Perry and Peter Willet. A Review of the use of Inverted Files for Best Match Searching in Information Retrieval Systems. J. of Information Science 6 (1083) 59-66.</name><name>Gerard Salton, Automatic text processing: the transformation, analysis, and retrieval of information by computer, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1989</name><name>E. Voorhees and D. Harmon. Overview of the Fifth Text REtrieval Conference (TREC-5). Proceedings of the 5th Text Retrieval Conference, Nov. 1996, NIST, http://trec.nist.gov/pubs.html</name><name>Ian H. Witten , Alistair Moffat , Timothy C. Bell, Managing gigabytes (2nd ed.): compressing and indexing documents and images, Morgan Kaufmann Publishers Inc., San Francisco, CA, 1999</name><name>Ming-Chuan Wu , Alejandro P. Buchmann, Encoded Bitmap Indexing for Data Warehouses, Proceedings of the Fourteenth International Conference on Data Engineering, p.220-230, February 23-27, 1998</name><name>Ming-Chuan Wu, Query optimization for selections using bitmaps, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.227-238, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States</name><name>Justin Zobel , Alistair Moffat, Exploring the similarity space, ACM SIGIR Forum, v.32 n.1, p.18-34, Spring 1998</name></citation><abstract>The bit-sliced index (BSI) was originally defined in [ONQ97]. The current paper introduces the concept of BSI arithmetic. For any two BSI's X and Y on a table T, we show how to efficiently generate new BSI's Z, V, and W, such that Z = X + Y, V = X - Y, and W = MIN(X, Y); this means that if a row r in T has a value x represented in BSI X and a value y in BSI Y, the value for r in BSI Z will be x + y, the value in V will be x - y and the value in W will be MIN(x, y). Since a bitmap representing a set of rows is the simplest bit-sliced index, BSI arithmetic is the most straightforward way to determine multisets of rows (with duplicates) resulting from the SQL clauses UNION ALL (addition), EXCEPT ALL (subtraction), and INTERSECT ALL (min) (see [OO00, DB2SQL] for definitions of these clauses). Another contribution of the current paper is to generalize BSI range restrictions from [ONQ97] to a new non-Boolean form: to determine the top k BSI-valued rows, for ally meaningful value k between one and the total number of rows in T. Together with bit-sliced addition, this permits us to solve a common basic problem of text retrieval: given an object-relational table T of rows representing documents, with a collection type column K representing keyword terms, we demonstrate an efficient algorithm to find k documents that share the largest number of terms with some query list Q of terms. A great deal of published work on such problems exists in the Information Retrieval (IR) field. The algorithm we introduce, which we call Bit-Sliced Term-Matching, or BSTM, uses an approach comparable in performance to the most efficient known IR algorithm, a major improvement on current DBMS text searching algorithms, with the advantage that it uses only indexing we propose for native database operations.</abstract></paper><paper><title>Space-efficient online computation of quantile summaries</title><author><AuthorName>Michael Greenwald</AuthorName><institute><InstituteName>Computer &amp; Information Science Department, University of Pennsylvania, 200 South 33rd Street, Philadelphia, PA</InstituteName><country></country></institute></author><author><AuthorName>Sanjeev Khanna</AuthorName><institute><InstituteName>Computer &amp; Information Science Department, University of Pennsylvania, 200 South 33rd Street, Philadelphia, PA</InstituteName><country></country></institute></author><year>2001</year><conference>International Conference on Management of Data</conference><citation><name>Rakesh Agrawal and Arun Swami. A one-pass space-efficient algorithm for finding quantiles. Proc. 7th Int. Conf. Management of Data, COMAD, 28-30 December 1995.</name><name>Khaled Alsabti , Sanjay Ranka , Vineet Singh, A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data, Proceedings of the 23rd International Conference on Very Large Data Bases, p.346-355, August 25-29, 1997</name><name>Surajit Chaudhuri , Rajeev Motwani , Vivek Narasayya, Random sampling for histogram construction: how much is enough?, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.436-447, June 01-04, 1998, Seattle, Washington, United States</name><name>Phillip B. Gibbons , Yossi Matias , Viswanath Poosala, Fast Incremental Maintenance of Approximate Histograms, Proceedings of the 23rd International Conference on Very Large Data Bases, p.466-475, August 25-29, 1997</name><name>Michael Greenwald, Practical algorithms for self scaling histograms or better than average data collection, Performance Evaluation, 27-28, p.19-40, Oct. 1996</name><name>Raj Jain , Imrich Chlamtac, The P2 algorithm for dynamic calculation of quantiles and histograms without storing observations, Communications of the ACM, v.28 n.10, p.1076-1085, Oct. 1985</name><name>I. Pohl. A minimum storage algorithm for computing the median. IBM Research Report RC 2701, November 1969.</name><name>Gurmeet Singh Manku , Sridhar Rajagopalan , Bruce G. Lindsay, Approximate medians and other quantiles in one pass and with limited memory, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.426-435, June 01-04, 1998, Seattle, Washington, United States</name><name>Gurmeet Singh Manku , Sridhar Rajagopalan , Bruce G. Lindsay, Random sampling techniques for space efficient online computation of order statistics of large datasets, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.251-262, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States</name><name>J. I. Munro and M.S. Paterson. Selection and sorting with limited storage. Theoretical Computer Science, vol. 12: 315-323; 1980.</name><name>M.S. Paterson. Progress in selection. Technical Report, University ofWarwick, Coventry, UK, 1997.</name><name>Viswanath Poosala, Venkatesh Ganti, and Yannis E. Ioannidis. Approximate query answering using histograms. Bulletin of the IEEE Technical Committee on Data Engineering, 22(4):6-15, December 1999.</name><name>Viswanath Poosala , Peter J. Haas , Yannis E. Ioannidis , Eugene J. Shekita, Improved histograms for selectivity estimation of range predicates, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.294-305, June 04-06, 1996, Montreal, Quebec, Canada</name></citation><abstract>An &amp;isin;-approximate quantile summary of a sequence of N elements is a data structure that can answer quantile queries about the sequence to within a precision of &amp;isin;N.</abstract></paper><paper><title>Probe, count, and classify: categorizing hidden web databases</title><author><AuthorName>Panagiotis G. Ipeirotis</AuthorName><institute><InstituteName>Computer Science Dept., Columbia University</InstituteName><country></country></institute></author><author><AuthorName>Luis Gravano</AuthorName><institute><InstituteName>Computer Science Dept., Columbia University</InstituteName><country></country></institute></author><author><AuthorName>Mehran Sahami</AuthorName><institute><InstituteName>E.piphany, Inc.</InstituteName><country></country></institute></author><year>2001</year><conference>International Conference on Management of Data</conference><citation><name>Chidanand Apt&amp;#233; , Fred Damerau , Sholom M. Weiss, Automated learning of decision rules for text categorization, ACM Transactions on Information Systems (TOIS), v.12 n.3, p.233-251, July 1994</name><name>Jamie Callan , Margaret Connell , Aiqun Du, Automatic discovery of language models for text databases, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.479-490, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States</name><name>C. W. Cleverdon and J. Mills. The testing of index language devices. Aslib Proceedings, 15(4):106-130, 1963.</name><name>W. W. Cohen. Learning trees and rules with set-valued features. In Proceedings of AAAI'96, IAAI'96, volume 1, pages 709-716. AAAI, 1996.</name><name>Nick Craswell , Peter Bailey , David Hawking, Server selection on the World Wide Web, Proceedings of the fifth ACM conference on Digital libraries, p.37-46, June 02-07, 2000, San Antonio, Texas, United States</name><name>The Deep Web: Surfacing Hidden Value. Accessible at http://www.completeplanet.com/Tutorials/DeepWeb/index.asp.</name><name>Susan Dumais , John Platt , David Heckerman , Mehran Sahami, Inductive learning algorithms and representations for text categorization, Proceedings of the seventh international conference on Information and knowledge management, p.148-155, November 02-07, 1998, Bethesda, Maryland, United States</name><name>S. Gauch, G. Wang, and M. Gomez. Profusion*: Intelligent fusion from multiple, distributed search engines. The Journal of Universal Computer Science, 2(9):637-649, Sept. 1996.</name><name>Luis Gravano , H&amp;#233;ctor Garc&amp;#237;a-Molina , Anthony Tomasic, GlOSS: text-source discovery over the Internet, ACM Transactions on Database Systems (TODS), v.24 n.2, p.229-264, June 1999</name><name>G. Grefenstette and J. Nioche. Estimation of English and non-English language use on the WWW. In RIAO 2000, 2000.</name><name>David Hawking , Paul Thistlewaite, Methods for information server selection, ACM Transactions on Information Systems (TOIS), v.17 n.1, p.40-76, Jan. 1999</name><name>Panagiotis G. Ipeirotis , Luis Gravano , Mehran Sahami, Automatic Classification of Text Databases Through Query Probing, Selected papers from the Third International Workshop WebDB 2000 on The World Wide Web and Databases, p.245-255, May 18-19, 2000</name><name>Thorsten Joachims, Text Categorization with Suport Vector Machines: Learning with Many Relevant Features, Proceedings of the 10th European Conference on Machine Learning, p.137-142, April 21-23, 1998</name><name>R. L. Johnston. Gershgorin theorems for partitioned matrices. Linear Algebra and its Applications, 4:205-220, 1971.</name><name>D. Koller and M. Sahami. Toward optimal feature selection. In Machine Learning, Proceedings of the Thirteenth International Conference (ICML '96), pages 284-292, 1996.</name><name>Daphne Koller , Mehran Sahami, Hierarchically Classifying Documents Using Very Few Words, Proceedings of the Fourteenth International Conference on Machine Learning, p.170-178, July 08-12, 1997</name><name>David D. Lewis , Robert E. Schapire , James P. Callan , Ron Papka, Training algorithms for linear text classifiers, Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, p.298-306, August 18-22, 1996, Zurich, Switzerland</name><name>A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In Learning for Text Categorization: Papers from the 1998 AAAI Workshop, 1998.</name><name>Weiyi Meng , King-Lup Liu , Clement T. Yu , Xiaodong Wang , Yuhsi Chang , Naphtali Rishe, Determining Text Databases to Search in the Internet, Proceedings of the 24rd International Conference on Very Large Data Bases, p.14-25, August 24-27, 1998</name><name>Weiyi Meng , Clement Yu , King-Lup Liu, Detection of Heterogeneities in a Multiple Text Database Environment, Proceedings of the Fourth IECIS International Conference on Cooperative Information Systems, p.22, September 02-04, 1999</name><name>Thomas M. Mitchell, Machine Learning, McGraw-Hill Higher Education, 1997</name><name>Mike Perkowitz , Robert B. Doorenbos , Oren Etzioni , Daniel S. Weld, Learning to Understand Information on the Internet: An

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -