📄 c4.5文档说明.txt

📁 C4.5文档说明（数据类型
💻 TXT
📖 第 1 页 / 共 4 页
字号:

Don't forget the commas between values! If you leave them out, See5 will not be able to process your data. 
Notice that `?' is used to denote a value that is missing or unknown. Similarly, `N/A' denotes a value that is not applicable for a particular case. Also note that the cases do not contain values for the attribute FTI since its values are computed from other attribute values. 

Test and cases files (optional)
Of course, the value of predictive patterns lies in their ability to make accurate predictions! It is difficult to judge the accuracy of a classifier by measuring how well it does on the cases used in its construction; the performance of the classifier on new cases is much more informative. (For instance, any number of gurus tell us about patterns that `explain' the rise/fall behavior of the stock market in the past. Even though these patterns may appear plausible, they are only valuable to the extent that they make useful predictions about future rises and falls.) 
The third kind of file used by See5 consists of new test cases (e.g. hypothyroid.test) on which the classifier can be evaluated. This file is optional and, if used, has exactly the same format as the data file. 

Another optional file, the cases file (e.g. hypothyroid.cases), differs from a test file only in allowing the cases' classes to be unknown (`?'). The cases file is used primarily with the cross-referencing procedure and public source code, both of which are described later on. 

Costs file (optional)
The last kind of file, the costs file (e.g. hypothyroid.costs), is also optional and sets out differential misclassification costs. In some applications there is a much higher penalty for certain types of mistakes. In this application, a prediction that hypothyroidism is not present could be very costly if in fact it is. On the other hand, predicting incorrectly that a patient is hypothyroid may be a less serious error. See5 allows different misclassification costs to be associated with each combination of real class and predicted class. We will return to this topic near the end of the tutorial. 

User Interface
It is difficult to see what is going on in an interface without actually using it. As a simple illustration, here is the main window of See5 after the hypothyroid application has been selected. 

 

The main window of See5 has six buttons on its toolbar. From left to right, they are 

Locate Data 
invokes a browser to find the files for your application, or to change the current application; 
Construct Classifier 
selects the type of classifier to be constructed and sets other options; 
Stop 
interrupts the classifier-generating process; 
Review Output 
re-displays the output from the last classifier construction (if any); 
Use Classifier 
interactively applies the current classifier to one or more cases; and 
Cross-Reference 
shows how cases in training or test data relate to (parts of) a classifier and vice versa. 
These functions can also be initiated from the File menu. 
The Edit menu facilities changes to the names and costs files after an application's files have been located. On-line help is available through the Help menu. 

Constructing Classifiers
Once the names, data, and optional files have been set up, everything is ready to use See5. 
The first step is to locate the date using the Locate Data button on the toolbar (or the corresponding selection from the File menu). We will assume that the hypothyroid data above has been located in this manner. 

There are several options that affect the type of classifier that See5 produces and the way that it is constructed. The Construct Classifier button on the toolbar (or selection from the File menu) displays a dialog box that sets out these classifier construction options: 

 

Many of the options have default values that should be satisfactory for most applications. 

Decision trees
When See5 is invoked with the default values of all options, it constructs a decision tree and generates output like this: 
	See5 [Release 1.20a]	Wed Sep  1 11:01:05 2004

	Class specified by attribute `diagnosis'
	
	Read 2772 cases (24 attributes) from hypothyroid.data
	
	Decision tree:
	
	TSH <= 6: negative (2472/2)
	TSH > 6:
	:...FTI <= 65:
	    :...thyroid surgery = t:
	    :   :...FTI <= 36.1: negative (2.1)
	    :   :   FTI > 36.1: primary (2.1/0.1)
	    :   thyroid surgery = f:
	    :   :...TT4 <= 61: primary (51/3.7)
	    :       TT4 > 61:
	    :       :...referral source in {WEST,SVHD}: primary (0)
	    :           referral source = STMW: primary (0.1)
	    :           referral source = SVHC: primary (1)
	    :           referral source = SVI: primary (3.8/0.8)
	    :           referral source = other:
	    :           :...TSH <= 22: negative (6.4/2.7)
	    :               TSH > 22: primary (5.8/0.8)
	    FTI > 65:
	    :...on thyroxine = t: negative (37.7)
	        on thyroxine = f:
	        :...thyroid surgery = t: negative (6.8)
	            thyroid surgery = f:
	            :...TT4 > 153: negative (6/0.1)
	                TT4 <= 153:
	                :...TT4 <= 37: primary (2.5/0.2)
	                    TT4 > 37: compensated (174.6/24.8)
	
	
	Evaluation on training data (2772 cases):
	
		    Decision Tree   
		  ----------------  
		  Size      Errors  
	
		    14    7( 0.3%)   <<
	
	
		   (a)   (b)   (c)   (d)    <-classified as
		  ----  ----  ----  ----
		    60     3                (a): class primary
		         153           1    (b): class compensated
		                       2    (c): class secondary
		           1        2552    (d): class negative
	
	
	Evaluation on test data (1000 cases):
	
		    Decision Tree   
		  ----------------  
		  Size      Errors  
	
		    14    4( 0.4%)   <<
	
	
		   (a)   (b)   (c)   (d)    <-classified as
		  ----  ----  ----  ----
		    31                 1    (a): class primary
		     1    39                (b): class compensated
		                            (c): class secondary
		           2         926    (d): class negative
	
	
	Time: 0.0 secs

(Since hardware platforms can differ in floating point precision and rounding, the output that you see might not be exactly the same as the above.) 
The first line identifies the version of See5 and the run date. See5 constructs a decision tree from the 2772 training cases in the file hypothyroid.data, and this appears next. Although it may not look much like a tree, this output can be paraphrased as: 


	if TSH is less than or equal to 6 then negative
	else
	if TSH is greater than 6 then
	    if FTI is less than or equal to 65 then
		if thyroid surgery equals t then
		    if FTI is less than or equal to 36.1 then negative
		    else
		    if FTI is greater than 36.1 then primary
		else
		if thyroid surgery equals f then
		    if TT4 is less than or equal to 61 then primary
		    else
		    if TT4 is greater than 61 then
		    . . . .

and so on. 
The tree employs a case's attribute values to map it to a leaf designating one of the classes. Every leaf of the tree is followed by a cryptic (n) or (n/m). For instance, the last leaf of the decision tree is compensated (174.6/24.8), for which n is 174.6 and m is 24.8. The value of n is the number of cases in the file hypothyroid.data that are mapped to this leaf, and m (if it appears) is the number of them that are classified incorrectly by the leaf. (A non-integral number of cases can arise because, when the value of an attribute in the tree is not known, See5 splits the case and sends a fraction down each branch.) 

The last section of the See5 output concerns the evaluation of the decision tree, first on the cases in hypothyroid.data from which it was constructed, and then on the new cases in hypothyroid.test. The size of the tree is its number of leaves and the column headed Errors shows the number and percentage of cases misclassified. The tree, with 14 leaves, misclassifies 7 of the 2772 given cases, an error rate of 0.3%. (This might seem inconsistent with the errors recorded at the leaves -- the leaf mentioned above shows 24.8 errors! The discrepancy arises because parts of a case split as a result of unknown attribute values can be misclassified and yet, when the votes from all the parts are aggregated, the correct class can still be chosen.) 

If the number of classes is twenty or less, performance on the training cases is further analyzed in a confusion matrix that pinpoints the kinds of errors made. In this example, the decision tree misclassifies 

three of the primary cases as compensated, 
one of the compensated cases as negative, 
both secondary cases as negative, and 
one negative case as compensated. 
A similar report of performance is given for the optional test cases. A very simple majority classifier predicts that every new case belongs to the most common class in the training data. In this example, 2553 of the 2772 training cases belong to class negative so that a majority classifier would always opt for negative. The 1000 test cases from file hypothyroid.test include 928 belonging to class negative, so a simple majority classifier would have an error rate of 7.2%. The decision tree has a lower error rate of 0.4% on the new cases, but notice that this is higher than its error rate on the training cases. If there are not more than twenty classes, the confusion matrix for the test cases again shows the detailed breakdown of correct and incorrect classifications. 


Discrete value subsets
By default, a test on a discrete attributes has a separate branch for each of its values that is present in the data. Tests with a high fan-out can have the undesirable side-effect of fragmenting the data during construction of the decision tree. See5 has a Subset option that can mitigate this fragmentation to some extent: attribute values are grouped into subsets and each subtree is associated with a subset rather than with a single value. 

In the hypothyroid example, invoking this option merely simplifies part of the tree as 


	referral source in {WEST,STMW,SVHC,SVI,SVHD}: primary (4.9/0.8)

with no effect on classification performance on either the training or test data. 
Although it does not help much for this application, the Subset option is recommended when a dataset has important discrete attributes with more than four or five values. 


Rulesets
Decision trees can sometimes be quite difficult to understand. An important feature of See5 is its ability to generate classifiers called rulesets that consist of unordered collections of (relatively) simple if-then rules. 

The Rulesets option causes classifiers to be expressed as rulesets rather than decision trees, here giving the following rules: 


	Rule 1: (31, lift 42.7)
		thyroid surgery = f
		TSH > 6
		TT4 <= 37
		->  class primary  [0.970]
	
	Rule 2: (63/6, lift 39.3)
		TSH > 6
		FTI <= 65
		->  class primary  [0.892]
	
	Rule 3: (270/116, lift 10.3)
		TSH > 6
		->  class compensated  [0.570]
	
	Rule 4: (2225/2, lift 1.1)
		TSH <= 6
		->  class negative  [0.999]
	
	Rule 5: (296, lift 1.1)
		on thyroxine = t
		FTI > 65
		->  class negative  [0.997]
	
	Rule 6: (240, lift 1.1)
		TT4 > 153
		->  class negative  [0.996]
	
	Rule 7: (29, lift 1.1)
		thyroid surgery = t
		FTI > 65
		->  class negative  [0.968]
	
	Default class: negative

Each rule consists of: 

A rule number -- this is quite arbitrary and serves only to identify the rule. 
Statistics (n, lift x) or (n/m, lift x) that summarize the performance of the rule. Similarly to a leaf, n is the number of training cases covered by the rule and m, if it appears, shows how many of them do not belong to the class predicted by the rule. The rule's accuracy is estimated by the Laplace ratio (n-m+1)/(n+2). The lift x is the result of dividing the rule's estimated accuracy by the relative frequency of the predicted class in the training set. 
One or more conditions that must all be satisfied if the rule is to be applicable. 
A class predicted by the rule. 
A value between 0 and 1 that indicates the confidence with which this prediction is made. (Note: If boosting is used, this confidence is measured using an artificial weighting of the training cases and so does not reflect the accuracy of the rule.) 
When a ruleset like this is used to classify a case, it may happen that several of the rules are applicable (that is, all their conditions are satisfied). If the applicable rules predict different classes, there is an implicit conflict that could be resolved in two ways: we could believe the rule with the highest confidence, or we could attempt to aggregate the rules' predictions to reach a verdict. See5 adopts the latter strategy -- each applicable rule votes for its predicted class with a voting weight equal to its confidence value, the votes are totted up, and the class with the highest total vote is chosen as the final prediction. There is also a default class, here negative, that is used when none of the rules apply. 
Rulesets are generally easier to understand than trees since each rule describes a specific context associated with a class. Furthermore, a ruleset generated from a tree usually has fewer rules than than the tree has leaves, another plus for comprehensibility. (In this example, the first decision tree with 14 leaves is reduced to seven rules.) Finally, rules are often more accurate predictors than decision trees -- a point not illustrated here, since the ruleset has an error rate of 0.5% on the test cases. For very large datasets, however, generating rules with the Ruleset option can require considerably more computer time. 

In the example above, rules are ordered by class and sub-ordered by confidence. An alternative ordering by estimated contribution to predictive accuracy can be selected using the Sort by utility option. Under this option, the rule that most reduces the error rate appears first and the rule that contributes least appears last. Furthermore, results are reported in a selected number of bands so that the predictive accuracies of the more important subsets of rules are also estimated. For example, if the Sort by utility option with four bands is selected, the hypothyroid rules are reordered as 

	Rule 1: (2225/2, lift 1.1)
		TSH <= 6
		->  class negative  [0.999]
	
	Rule 2: (270/116, lift 10.3)
		TSH > 6
		->  class compensated  [0.570]
	
	Rule 3: (63/6, lift 39.3)
		TSH > 6
		FTI <= 65
		->  class primary  [0.892]
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -