📄 analysis of data mining algorithms.htm
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0057)http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm -->
<HTML><HEAD><TITLE>Analysis of Data Mining Algorithms</TITLE>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.2800.1106" name=GENERATOR></HEAD>
<BODY>
<CENTER><FONT size=+2>Analysis of Data Mining Algorithms</FONT></CENTER>
<CENTER>by</CENTER>
<CENTER>Karuna Pande Joshi</CENTER><PRE> Copyright Karuna Pande Joshi 1997.</PRE>
<HR>
<P><B><FONT size=+1>Table of Contents</FONT></B>
<OL>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#Introduction">Introduction</A>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#class_algo">Classification
Algorithms</A>
<OL>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#data_class_mtd">Data
Classification Methods</A>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#data_abst">Data
Abstraction</A>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#class_rule_learn">Classification-rule
learning.</A>
<OL>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#ID3 algorithm">ID3
algorithm</A>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#c4.5_algo">C4.5
algorithm</A>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#sliq_algo">SLIQ
algorithm</A>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#other_algos">Other
Algorithms</A> </LI></OL>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#parallel_algos">Parallel
Algorithms</A>
<OL>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#basic_idea">Basic
Idea</A>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#synch_tree">Synchronous
Tree Construction Approach</A>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#partition_tree">Partitioned
Tree Construction Approach</A> </LI></OL></LI></OL>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#assoc_rules">Association
Rule Algorithms</A>
<OL>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#apriori">Apriori
Algorithm</A>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#dist_parallel">Distributed/Parallel
Algorithms</A> </LI></OL>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#seq_analysis">Sequential
Analysis</A>
<OL>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#seq_patterns">Sequential
Patterns</A>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#algo_seq_pattern">Algorithms
for Finding Sequential Patterns</A>
<OL>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#algorithm">Algorithm</A>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#aprioriall">Algorithm
AprioriAll</A>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#apriorisome">Algorithm
AprioriSome</A>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#relative_perform">Relative
Performance of the two Algorithms</A> </LI></OL></LI></OL>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#Conclusion">Conclusion</A>
<OL>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#compare_algos">Comparing
Algorithms</A>
<LI><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#Drawbacks">Drawbacks
of Existing Algorithms</A> </LI></OL></LI></OL><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#References">REFERENCES</A>
<BR><A
href="http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm#Appendix">APPENDIX
A : URL Listing</A>
<P>
<HR width="100%">
<CENTER><A name=Introduction></A><B><U><FONT color=#000000><FONT
size=+2>Introduction</FONT></FONT></U></B></CENTER>
<P>With an enormous amount of data stored in databases and data warehouses, it
is increasingly important to develop powerful tools for analysis of such data
and mining interesting knowledge from it. Data mining is a process of inferring
knowledge from such huge data. Data Mining has three major components Clustering
or <I>Classification</I>, <I>Association Rules</I> and <I>Sequence Analysis</I>.
<P>By simple definition, in classification/clustering we analyze a set of data
and generate a set of grouping rules which can be used to classify future data.
For example, one may classify diseases and provide the symptoms which describe
each class or subclass. This has much in common with traditional work in
statistics and machine learning. However, there are important new issues which
arise because of the sheer size of the data. One of the important problem in
data mining is the Classification-rule learning which involves finding rules
that partition given data into predefined classes. In the data mining domain
where millions of records and a large number of attributes are involved, the
execution time of existing algorithms can become prohibitive, particularly in
interactive applications. This is discussed in detail in <I>Chapter 2</I>.
<P>An association rule is a rule which implies certain association relationships
among a set of objects in a database. In this<I> </I>process we discover a set
of association rules at multiple levels of abstraction from the relevant set(s)
of data in a database. For example, one may discover a set of symptoms often
occurring together with certain kinds of diseases and further study the reasons
behind them. Since finding interesting association rules in databases may
disclose some useful patterns for decision support, selective marketing,
financial forecast, medical diagnosis, and many other applications, it has
attracted a lot of attention in recent data mining research . Mining association
rules may require iterative scanning of large transaction or relational
databases which is quite costly in processing. Therefore, efficient mining of
association rules in transaction and/or relational databases has been studied
substantially. This is discussed in detail in <I>Chapter 3</I>.
<P>In <I>sequential Analysis</I>, we seek to discover patterns that occur in
sequence. This deals with data that appear in separate transactions (as opposed
to data that appear in the same transaction in the case of association).For e.g.
: If a shopper buys item A in the first week of the month, then s/he buys item B
in the second week etc. This is discussed in detail in <I>Chapter 4</I>.
<P>There are many algorithms proposed that try to address the above aspects of
data mining. Compiling a list of all algorithms suggested/used for these
problems is an arduous task . I have thus limited the focus of this report to
list only some of the algorithms that have had better success than the others. I
have included a list of URLs in Appendix A which can be referred to for more
information on data mining algorithms.
<P>
<HR>
<HR>
<CENTER><A name=class_algo></A><B><U><FONT size=+2>Classification
Algorithms</FONT></U></B></CENTER>
<P>In Data classification one develops a description or model for each class in
a database, based on the features present in a set of class-labeled training
data. There have been many data classification methods studied, including
decision-tree methods, such as C4.5, statistical methods, neural networks, rough
sets, database-oriented methods etc.
<P><A name=data_class_mtd></A><FONT size=+1>Data Classification Methods</FONT>
<P>In this paper, I have discussed in detail some of the <I>machine-learning</I>
algorithms that have been successfully applied in the initial stages of this
field. The other methods listed above are just being applied to data mining and
have not been very successful. This section briefly describes these other
methods. Appendix A lists the URLs which can be referred to for more information
on these various methods.
<UL>
<LI><B>Statistical Algorithms</B> Statistical analysis systems such as SAS and
SPSS have been used by analysts to detect unusual patterns and explain
patterns using statistical models such as linear models. Such systems have
their place and will continue to be used.
<LI><B>Neural Networks </B>Artificial neural networks mimic the
pattern-finding capacity of the human brain and hence some researchers have
suggested applying Neural Network algorithms to pattern-mapping. Neural
networks have been applied successfully in a few applications that involve
classification.
<LI><B>Genetic algorithms</B> Optimization techniques that use processes such
as genetic combination, mutation, and natural selection in a design based on
the concepts of natural evolution.
<LI><B>Nearest neighbor method </B>A technique that classifies each record in
a dataset based on a combination of the classes of the k record(s) most
similar to it in a historical dataset. Sometimes called the k-nearest neighbor
technique.
<LI><B>Rule induction</B> The extraction of useful <I>if-then</I> rules from
data based on statistical significance.
<LI><B>Data visualization</B> The visual interpretation of complex
relationships in multidimensional data. </LI></UL><A name=data_abst></A><FONT
size=+1>Data Abstraction</FONT>
<P>Many existing algorithms suggest abstracting the test data before classifying
it into various classes. There are several alternatives for doing abstraction
before classification: A data set can be generalized to either a minimally
generalized abstraction level, an intermediate abstraction level, or a rather
high abstraction level. Too low an abstraction level may result in scattered
classes, bushy classification trees, and difficulty at concise semantic
interpretation; whereas too high a level may result in the loss of
classification accuracy. The generalization-based multi-level classification
process has been implemented in the DB-Miner system.[4] <BR>
<P><A name=class_rule_learn></A><FONT size=+1>Classification-rule
learning</FONT><FONT size=+1></FONT>
<P>Classification-rule learning involves finding rules or decision trees that
partition given data into predefined classes. For any realistic problem domain
of the classification-rule learning, the set of possible decision trees is too
large to be searched exhaustively. In fact, the computational complexity of
finding an optimal classification decision tree is NP hard.
<P>Most of the existing induction-based algorithms use <B>Hunt</B>'s method as
the basic algorithm.[2] Here is a recursive description of Hunt's method for
constructing a decision tree from a set T of training cases with classes denoted
{C<SUB>1</SUB>, C<SUB>2</SUB>,
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -