📄 4.txt
字号:
发信人: strawman (只作鸳鸯不羡仙), 信区: DataMining
标 题: ID3&C4.5Tutorial
发信站: 南京大学小百合站 (Mon Aug 19 11:20:59 2002), 站内信件
Building Classification Models: ID3 and C4.5
Introduction
Basic Definitions
The ID3 Algorithm
Using Gain Ratios
C4.5 Extensions
Pruning Decision Trees and Deriving Rule Sets
Classification Models in the undergraduate AI Course
References
Introduction
ID3 and C4.5 are algorithms introduced by Quinlan for inducing
Classification Models, also called Decision Trees, from data.
We are given a set of records. Each record has the same structure,
consisting of a number of attribute/value pairs. One of these attributes
represents the category of the record. The problem is to determine a
decision tree that on the basis of answers to questions about the
non-category attributes predicts correctly the value of the category
attribute. Usually the category attribute takes only the values {true,
false}, or {success, failure}, or something equivalent. In any case, one
of its values will mean failure.
For example, we may have the results of measurements taken by experts on
some widgets. For each widget we know what is the value for each
measurement and what was decided, if to pass, scrap, or repair it.
That is, we have a record with as non categorical attributes the
measurements, and as categorical attribute the disposition for the
widget.
Here is a more detailed example. We are dealing with records reporting
on weather conditions for playing golf. The categorical attribute
specifies whether or not to Play. The non-categorical attributes are:
ATTRIBUTE |POSSIBLE VALUES
============+=======================
outlook | sunny, overcast, rain
------------+-----------------------
temperature | continuous
------------+-----------------------
humidity | continuous
------------+-----------------------
windy | true, false
============+=======================
and the training data is:
OUTLOOK | TEMPERATURE | HUMIDITY | WINDY | PLAY
=====================================================
sunny | 85 | 85 | false | Don't Play
sunny | 80 | 90 | true | Don't Play
overcast| 83 | 78 | false | Play
rain | 70 | 96 | false | Play
rain | 68 | 80 | false | Play
rain | 65 | 70 | true | Don't Play
overcast| 64 | 65 | true | Play
sunny | 72 | 95 | false | Don't Play
sunny | 69 | 70 | false | Play
rain | 75 | 80 | false | Play
sunny | 75 | 70 | true | Play
overcast| 72 | 90 | true | Play
overcast| 81 | 75 | false | Play
rain | 71 | 80 | true | Don't Play
Notice that in this example two of the attributes have continuous
ranges, Temperature and Humidity. ID3 does not directly deal with such
cases, though below we examine how it can be extended to do so. A
decision tree is important not because it summarizes what we know, i.e.
the training set, but because we hope it will classify correctly new
cases. Thus when building classification models one should have both
training data to build the model and test data to verify how well it
actually works.
A simpler example from the stock market involving only discrete ranges
has Profit as categorical attribute, with values {up, down}. Its non
categorical attributes are:
ATTRIBUTE |POSSIBLE VALUES
============+=======================
age | old, midlife, new
------------+-----------------------
competition | no, yes
------------+-----------------------
type | software, hardware
------------+-----------------------
and the training data is:
AGE| COMPETITION | TYPE| PROFIT
=========================================
old| yes | swr| down
--------+-------------+---------+--------
old| no | swr | down
--------+-------------+---------+--------
old| no | hwr| down
--------+-------------+---------+--------
mid| yes | swr| down
--------+-------------+---------+--------
mid| yes | hwr| down
--------+-------------+---------+--------
mid| no | hwr| up
--------+-------------+---------+--------
mid| no | swr| up
--------+-------------+---------+--------
new| yes | swr| up
--------+-------------+---------+--------
new| no | hwr| up
--------+-------------+---------+--------
new| no | swr| up
--------+-------------+---------+--------
For a more complex example, here are files that provide records for a
series of votes in Congress. The first file describes the structure of
the records. The second file provides the Training Set, and the third
the Test Set.
The basic ideas behind ID3 are that:
In the decision tree each node corresponds to a non-categorical
attribute and each arc to a possible value of that attribute. A leaf
of the tree specifies the expected value of the categorical attribute
for the records described by the path from the root to that leaf.
[This defines what is a Decision Tree.]
In the decision tree at each node should be associated the
non-categorical attribute which is most informative among the attributes
not yet considered in the path from the root. [This establishes what is
a "Good" decision tree.]
Entropy is used to measure how informative is a node. [This defines what
we mean by "Good". By the way, this notion was introduced by Claude
Shannon in Information Theory.]
C4.5 is an extension of ID3 that accounts for unavailable values,
continuous attribute value ranges, pruning of decision trees, rule
derivation, and so on.
Definitions
If there are n equally probable possible messages, then the
probability p of each is 1/n and the information conveyed by a message
is -log(p) = log(n). [In what follows all logarithms are in base 2.]
That is, if there are 16 messages, then log(16) = 4 and we need 4 bits
to identify each message.
In general, if we are given a probability distribution P = (p1, p2, ..,
pn) then the Information conveyed by this distribution, also called the
Entropy of P, is:
I(P) = -(p1*log(p1) + p2*log(p2) + .. + pn*log(pn))
For example, if P is (0.5, 0.5) then I(P) is 1, if P is (0.67, 0.33)
then I(P) is 0.92, if P is (1, 0) then I(P) is 0. [Note that the more
uniform is the probability distribution, the greater is its
information.]
If a set T of records is partitioned into disjoint exhaustive classes
C1, C2, .., Ck on the basis of the value of the categorical attribute,
then the information needed to identify the class of an element of T
is Info(T) = I(P), where P is the probability distribution of the
partition (C1, C2, .., Ck):
P = (|C1|/|T|, |C2|/|T|, ..., |Ck|/|T|)
In our golfing example, we have Info(T) = I(9/14, 5/14) = 0.94,
and in our stock market example we have Info(T) = I(5/10,5/10) = 1.0.
If we first partition T on the basis of the value of a non-categorical
attribute X into sets T1, T2, .., Tn then the information needed to
identify the class of an element of T becomes the weighted average of
the information needed to identify the class of an element of Ti, i.e.
the weighted average of Info(Ti):
|Ti|
Info(X,T) = Sum for i from 1 to n of ---- * Info(Ti)
|T|
In the case of our golfing example, for the attribute Outlook we have
Info(Outlook,T) = 5/14*I(2/5,3/5) + 4/14*I(4/4,0) + 5/14*I(3/5,2/5)
= 0.694
Consider the quantity Gain(X,T) defined as
Gain(X,T) = Info(T) - Info(X,T)
This represents the difference between the information needed to
identify an element of T and the information needed to identify an
element of T after the value of attribute X has been obtained, that is,
this is the gain in information due to attribute X.
In our golfing example, for the Outlook attribute the gain is:
Gain(Outlook,T) = Info(T) - Info(Outlook,T) = 0.94 - 0.694 = 0.246.
If we instead consider the attribute Windy, we find that Info(Windy,T)
is 0.892 and Gain(Windy,T) is 0.048. Thus Outlook offers a greater
informational gain than Windy.
We can use this notion of gain to rank attributes and to build
decision trees where at each node is located the attribute with greatest
gain among the attributes not yet considered in the path from the
root.
The intent of this ordering are twofold:
To create small decision trees so that records can be identified after
only a few questions.
To match a hoped for minimality of the process represented by the
records being considered(Occam's Razor).
The ID3 Algorithm
The ID3 algorithm is used to build a decision tree, given a set of
non-categorical attributes C1, C2, .., Cn, the categorical attribute C,
and a training set T of records.
function ID3 (R: a set of non-categorical attributes,
C: the categorical attribute,
S: a training set) returns a decision tree;
begin
If S is empty, return a single node with value Failure;
If S consists of records all with the same value for
the categorical attribute,
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -