📄 c4.5文档说明.txt
字号:
In the hypothyroid example, using a sample of 60% would cause a classifier to be constructed from a randomly-selected 1663 of the 2772 cases in hypothyroid.data, then tested on the remaining 1109 cases.
By default, the random sample changes every time that a classifier is constructed, so that successive runs of See5 with sampling will usually produce different results. This re-sampling can be avoided by selecting the Lock sample option that uses the current sample for constructing subsequent classifiers. If this option is selected, the sample will change only when another application is loaded, the sample percentage is altered, the option is unselected, or See5 is restarted.
Cross-validation trials
As we saw earlier, the performance of a classifier on the training cases from which it was constructed gives a poor estimate of its accuracy on new cases. The true predictive accuracy of the classifier can be estimated by sampling, as above, or by using a separate test file; either way, the classifier is evaluated on cases that were not used to build it. However, this estimate can be unreliable unless the numbers of cases used to build and evaluate the classifier are both large. If the cases in hypothyroid.data and hypothyroid.test were to be shuffled and divided into a new 2772-case training set and a 1000-case test set, See5 might construct a different classifier with a lower or higher error rate on the test cases.
One way to get a more reliable estimate of predictive accuracy is by f-fold cross-validation. The cases in the data file are divided into f blocks of roughly the same size and class distribution. For each block in turn, a classifier is constructed from the cases in the remaining blocks and tested on the cases in the hold-out block. In this way, each case is used just once as a test case. The error rate of a classifier produced from all the cases is estimated as the ratio of the total number of errors on the hold-out cases to the total number of cases.
The Cross-validation option with f folds runs such a f-fold cross-validation. Suppose now that we select the Cross-validation option with 10 folds together with the Rulesets option. After giving details of the individual rulesets, the output shows a summary like this:
Fold Rules
---- ----------------
No Errors
0 7 0.8%
1 7 0.3%
2 7 0.5%
3 7 0.3%
4 8 0.8%
5 7 0.5%
6 7 0.3%
7 6 0.5%
8 7 0.5%
9 7 0.8%
Mean 7.0 0.5%
SE 0.1 0.1%
This estimates the error rate of the rulesets produced from the 2772 cases in hypothyroid.data at 0.5%. The SE figures (the standard errors of the means) provide an estimate of the variability of these results.
The cross-validation procedure can be repeated for different random partitions of the cases into blocks. The average error rate from these distinct cross-validations is then an even more reliable estimate of the error rate of the single classifier produced from all the cases.
Since every cross-validation fold uses only part of the application's data, running a cross-validation does not cause a classifier to be saved. To save a classifier for later use, simply run See5 without employing cross-validation.
Differential misclassification costs
Up to this point, all errors have been treated as equal -- we have simply counted the number of errors made by a classifier to summarize its performance. Let us now turn to the situation in which the `cost' associated with a classification error depends on the predicted and true class of the misclassified case.
See5 allows costs to be assigned to any combination of predicted and true class via entries in the optional file filestem.costs. Each entry has the form
predicted class, true class: cost
where cost is any non-negative value. The file may contain any number of entries; if a particular combination is not specified explicitly, its cost is taken to be 0 if the predicted class is correct and 1 otherwise.
To illustrate the idea, suppose that it was a much more serious error to classify a hypothyroid patient as negative than the converse. A hypothetical costs file hypothyroid.costs might look like this:
negative, primary: 5
negative, secondary: 5
negative, compensated: 5
This specifies that the cost of misclassifying any primary, secondary, or compensated patient as negative is 5 units. Since they are not given explicitly, all other errors have cost 1 unit. In other words, the first kind of error is five times more costly.
A costs file is automatically read by See5 unless the system is told to ignore it. The output from the system using default options now looks like this:
See5 [Release 1.20a] Wed Sep 1 11:04:52 2004
Class specified by attribute `diagnosis'
Read 2772 cases (24 attributes) from hypothyroid.data
Read misclassification costs from hypothyroid.costs
Decision tree:
TSH <= 6:
:...TT4 > 54: negative (2444.3)
: TT4 <= 54:
: :...referral source in {WEST,SVHC,SVHD}: negative (0)
: referral source = STMW: negative (1)
: referral source = SVI: negative (18)
: referral source = other:
: :...T4U > 0.88: secondary (3.5/2.1)
: T4U <= 0.88:
: :...query hypothyroid = f: negative (3.6)
: query hypothyroid = t: secondary (1.7/1.1)
TSH > 6:
:...FTI <= 65:
:...TT4 <= 63: primary (59.4/8)
: TT4 > 63:
: :...T4U <= 1.1: compensated (2.9/1.3)
: T4U > 1.1:
: :...TT4 <= 90: primary (8.8/1.7)
: TT4 > 90: compensated (1.3/0.2)
FTI > 65:
:...on thyroxine = t: negative (37.7)
on thyroxine = f:
:...thyroid surgery = t: negative (6.8)
thyroid surgery = f:
:...TT4 <= 61:
:...TT4 <= 37: primary (2.5/0.2)
: TT4 > 37: compensated (3.4/0.4)
TT4 > 61:
:...age > 8:
:...TT4 <= 144: compensated (163.7/22.7)
: TT4 > 144:
: :...TT4 <= 153: compensated (2.3/0.3)
: TT4 > 153: negative (6/0.1)
age <= 8:
:...TSH > 29: primary (0.7)
TSH <= 29:
:...referral source in {WEST,SVHC,SVI,
: SVHD}: compensated (0)
referral source = other: compensated (2.8)
referral source = STMW:
:...age <= 1: compensated (1)
age > 1: primary (0.7)
Evaluation on training data (2772 cases):
Decision Tree
-----------------------
Size Errors Cost
21 11( 0.4%) 0.00 <<
(a) (b) (c) (d) <-classified as
---- ---- ---- ----
62 1 (a): class primary
154 (b): class compensated
2 (c): class secondary
6 2 2 2543 (d): class negative
Evaluation on test data (1000 cases):
Decision Tree
-----------------------
Size Errors Cost
21 8( 0.8%) 0.01 <<
(a) (b) (c) (d) <-classified as
---- ---- ---- ----
31 1 (a): class primary
1 39 (b): class compensated
(c): class secondary
2 2 2 922 (d): class negative
Time: 0.0 secs
This new decision tree has a higher error rate than the first decision tree for both the training and test cases, and might therefore appear entirely inferior to it. The real difference comes when we compare the total cost of misclassified training cases for the two trees. The first decision tree, which was derived without reference to the differential costs, has a total cost of 19 (4x1 + 3x5) for the misclassified training cases in hypothyroid.data. The corresponding value for the new tree is 11 (11x1). That is, the total misclassification cost over the training cases is lower than that of the original tree. The total misclassification cost on the test data is 8 (3x1 + 1x5) for the original tree and also 8 (8x1) for the new tree.
Using Classifiers
Once a classifier has been constructed, an interactive interpreter can be used to predict the classes to which new cases belong. The Use Classifier button invokes the interpreter, using the most recent classifier for the current application, and prompts for information about the case to be classified. Since the values of all attributes may not be needed, the attribute values requested will depend on the case itself. When all the relevant information has been entered, the most likely class (or classes) are shown, each with a confidence value. For example, this is the result of analyzing a case using the first decision tree above:
Classifiers can also be used in batch mode. The sample application provided in the public source code reads cases from a cases file and shows the predicted class and the confidence for each.
Cross-Referencing Classifiers and Data
See5 incorporates a unique facility that links data and the relevant sections of (possibly boosted) classifiers. We will illustrate this facility using the first decision tree for the hypothyroid application and the cases in hypothyroid.data from which it was constructed.
The Cross-Reference button brings up a window showing the most recent classifier for the current application and how it relates to the cases in the data, test or cases file. (If more than one of these is present, a menu will prompt you to select the file.)
The window is divided into two panes, with the classifier on the left and a list of cases on the right. The Reset button can be used at any time to restore the window to this initial state.
Each case has a [?] tag (that is red if the case is misclassified), an identifying number or label, and the class predicted for the case (also red when incorrect). Clicking on the tag [?] in front of a case number or label displays that case:
The values of label attributes and attributes excluded or ignored are displayed in a lighter tone to indicate that they play no part in classifying the case.
Clicking on a case's label or number shows the part(s) of the classifier(s) relevant to that case. For instance, clicking on case 3169 shows the leaf to which this case is mapped:
If a case has missing values for one or more attributes, if it is covered by several rules, or if boosted classifiers are used, more than one leaf or rule may be relevant to a case. In such situations, all relevant classifier parts are shown.
Click on any leaf or rule, and all the cases that map to the leaf or rule are shown. For instance, clicking on Reset and then the leaf indicated shows all cases that are covered by that leaf:
This last pane may be puzzling for two reasons:
The case pane shows nine cases but the count shown at the leaf is 3.8. This happens because some of these nine cases have unknown values for the attributes tested on the path to this leaf (TSH, FTI, thyroid surgery, TT4, referral source). Cases like this are split into partial cases associated with each outcome of the test.
This leaf predicts class primary, but some cases belonging to other classes are not highlighted in red to indicate an error. As noted above, parts of a case split as a result of unknown attribute values can be misclassified and yet, when the votes from all the parts are aggregated, the correct class can still be chosen. Cases 3469, 3640, 2266, 311, and 3607 are classified correctly by the decision tree as a whole.
The Save button preserves the details of the displayed classifier and case list as an ASCII file selected through a dialog box.
Generating Classifiers in Batch Mode
The See5 distribution includes a program See5X that can be used to produce classifiers non-interactively. This console application resides in the same folder as See5 (usually C:\Program Files\See5 for single-computer licences or the See5 folder on your desktop for network licences) and is invoked from an MS-DOS Prompt window. The command to run the program is:
start See5X -f filestem parameters
where the parameters enable one or more options discussed above to be selected:
-s use the Subset option
-r use the Ruleset option
-u bands sort rules by their utility into bands
-b use the Boosting option with 10 trials
-t trials ditto with specified number of trials
-w winnow attributes before constructing a classifier
-S x use the Sampling option with x%
-I seed set the sampling seed value
-X folds carry out a cross-validation
-g turn off the global tree pruning stage
-c CF set the Pruning CF value
-m cases set the Minimum cases
-p use the Fuzzy thresholds option
-e ignore any costs file
If desired, output from See5 can be diverted to a file in the usual way.
As an example (for a single-computer licensee), typing the commands
cd "C:\Program Files\See5"
start See5X -f Samples\anneal -r -b >save.txt
in a MS-DOS Prompt window will generate a boosted ruleset classifier for the anneal application in the Samples directory, leaving the output in file save.txt.
Linking to Other Programs
The classifiers generated by See5 are retained in files filestem.tree (for decision trees) and filestem.rules (for rulesets). Free C source code is available to read these classifier files and to make predictions with them, enabling you to use See5 classifiers in other programs. As an example, the source includes a program to read cases from a cases file, and to show how each is classified by boosted or single trees or rulesets.
Click here to download a zip archive containing the public source code.
? RULEQUEST RESEARCH 2004 Last updated September 2004
--------------------------------------------------------------------------------
home products download evaluations prices purchase contact us
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -