📄 c4.5文档说明.txt
字号:
See5: An Informal Tutorial
Welcome to See5, a system that extracts informative patterns from data. The following sections show how to prepare data files for See5 and illustrate the options for using the system.
In this tutorial, file names and See5 input or output appear in blue fixed-width font while file extensions and other general forms are shown highlighted in green. Buttons and options on the Windows GUI are in maroon.
Preparing Data for See5
Application files
Names file
What's in a name?
Specifying the classes
Explicitly-defined attributes
Attributes defined by formulas
Dates, times, and timestamps
Selecting the attributes that can appear in classifiers
Data file
Test and cases files (optional)
Costs file (optional)
User Interface
Constructing Classifiers
Decision trees
Discrete value subsets
Rulesets
Boosting
Winnowing attributes
Softening thresholds
Advanced pruning options
Sampling from large datasets
Cross-validation trials
Differential misclassification costs
Using Classifiers
Cross-Referencing Classifiers and Data
Generating Classifiers in Batch Mode
Linking to Other Programs
--------------------------------------------------------------------------------
Preparing Data for See5
We will illustrate See5 using a medical application -- mining a database of thyroid assays from the Garvan Institute of Medical Research, Sydney, to construct diagnostic rules for hypothyroidism. Each case concerns a single referral and contains information on the source of the referral, assays requested, patient data, and referring physician's comments. Here are three examples:
Attribute Case 1 Case 2 Case 3 .....
age 41 23 46
sex F F M
on thyroxine f f f
query on thyroxine f f f
on antithyroid medication f f f
sick f f f
pregnant f f not applicable
thyroid surgery f f f
I131 treatment f f f
query hypothyroid f f f
query hyperthyroid f f f
lithium f f f
tumor f f f
goitre f f f
hypopituitary f f f
psych f f f
TSH 1.3 4.1 0.98
T3 2.5 2 unknown
TT4 125 102 109
T4U 1.14 unknown 0.91
FTI 109 unknown unknown
referral source SVHC other other
diagnosis negative negative negative
ID 3733 1442 2965
This is exactly the sort of task for which See5 was designed. Each case belongs to one of a small number of mutually exclusive classes (negative, primary, secondary, compensated). Properties of every case that may be relevant to its class are provided, although some cases may have unknown or non-applicable values for some attributes. There are 24 attributes in this example, but See5 can deal with any number of attributes.
See5's job is to find how to predict a case's class from the values of the other attributes. See5 does this by constructing a classifier that makes this prediction. As we will see, See5 can construct classifiers expressed as decision trees or as sets of rules.
Application files
Every See5 application has a short name called a filestem; we will use the filestem hypothyroid for this illustration. All files read or written by See5 for an application have names of the form filestem.extension, where filestem identifies the application and extension describes the contents of the file.
Here is a summary table of the extensions used by See5 (to be described in later sections):
names description of the application's attributes [required]
data cases used to generate a classifier [required]
test unseen cases used to test a classifier [optional]
cases cases to be classified subsequently [optional]
costs differential misclassification costs [optional]
tree decision tree classifier produced by See5 [output]
rules ruleset classifier produced by See5 [output]
out report produced when a classifier is generated [output]
set settings used for the last classifier [output]
The case of letters in both the filestem and extension is important -- file names APP.DATA, app.data, and App.Data, are all different. The extensions must be written in lower case as shown above, otherwise See5 will not recognize the files for your application.
If See5 cannot seem to find your files even though the filestem and extensions are correct, please check that file extensions are not hidden on your computer. (If extensions are hidden and you write a text file from Wordpad, it automatically adds an extension .txt that makes the file invisible to See5.) Here's what to do:
Double click "My Computer", select "Tools" (or "View" for Windows 98), then "Folder Options" and then the "View" tab. The box "Hide file extensions for known file types" should not be checked. If it is, uncheck it and click "Apply".
Names file
Two files are essential for all See5 applications and there are three further optional files, each identified by its extension. The first essential file is the names file (e.g. hypothyroid.names) that describes the attributes and classes. There are two important subgroups of attributes:
The value of an explicitly-defined attribute is given directly in the data. A discrete attribute has a value drawn from a set of nominal values, a continuous attribute has a numeric value, a date attribute holds a calendar date, a time attribute holds a clock time, a timestamp attribute holds a date and time, and a label attribute serves only to identify a particular case.
The value of an implicitly-defined attribute is specified by a formula. (Most attributes are explicitly defined, so you may never need implicitly-defined attributes.)
The file hypothyroid.names looks like this:
diagnosis. | the target attribute
age: continuous.
sex: M, F.
on thyroxine: f, t.
query on thyroxine: f, t.
on antithyroid medication: f, t.
sick: f, t.
pregnant: f, t.
thyroid surgery: f, t.
I131 treatment: f, t.
query hypothyroid: f, t.
query hyperthyroid: f, t.
lithium: f, t.
tumor: f, t.
goitre: f, t.
hypopituitary: f, t.
psych: f, t.
TSH: continuous.
T3: continuous.
TT4: continuous.
T4U: continuous.
FTI:= TT4 / T4U.
referral source: WEST, STMW, SVHC, SVI, SVHD, other.
diagnosis: primary, compensated, secondary, negative.
ID: label.
What's in a name?
Names, labels, classes, and discrete values are represented by arbitrary strings of characters, with some fine print:
Tabs and spaces are permitted inside a name or value, but See5 collapses every sequence of these characters to a single space.
Special characters (comma, colon, period, vertical bar `|') can appear in names and values, but must be prefixed by the escape character `\'. For example, the name "Filch, Grabbit, and Co." would be written as `Filch\, Grabbit\, and Co\.'. (Colons in times and periods in numbers do not need to be escaped.)
Whitespace (blank lines, spaces, and tab characters) is ignored except inside a name or value and can be used to improve legibility. Unless it is escaped as above, the vertical bar `|' causes the remainder of the line to be ignored and is handy for including comments. This use of `|' should not occur inside a value.
Specifying the classes
The first entry in the names file specifies the classes in one of three formats:
A list of class names separated by commas, e.g.
primary, compensated, secondary, negative.
The name of a discrete attribute (the target attribute) that contains the class value, e.g.:
diagnosis.
The name of a continuous target attribute followed by a colon and one or more thresholds in increasing order and separated by commas. If there are t thresholds X1, X2, ..., Xt then the values of the attribute are divided into t+1 ranges:
less than or equal to X1
greater than X1 and less than or equal to X2
. . .
greater than Xt.
Each range defines a class, so there are t+1 classes. For example, a hypothetical entry
age: 12, 19.
would define three classes: age <= 12, 12 < age <= 19, and age > 19.
This first entry defining the classes is followed by definitions of the attributes in the order that they will be given for each case.
Explicitly-defined attributes
The name of each explicitly-defined attribute is followed by a colon `:' and a description of the values taken by the attribute. There are six possibilities:
continuous
The attribute takes numeric values.
date
The attribute's values are dates in the form YYYY/MM/DD or YYYY-MM-DD, e.g. 1999/09/30 or 1999-09-30.
time
The attribute's values are times in the form HH:MM:SS with values between 00:00:00 and 23:59:59.
timestamp
The attribute's values are times in the form YYYY/MM/DD HH:MM:SS or YYYY-MM-DD HH:MM:SS, e.g. 1999-09-30 15:04:00. (Note that there is a space separating the date and time.)
a comma-separated list of names
The attribute takes discrete values, and these are the allowable values. The values may be prefaced by [ordered] to indicate that they are given in a meaningful ordering, otherwise they will be taken as unordered. For instance, the values low, medium, high are ordered, while meat, poultry, fish, vegetables are not. The former might be declared as
grade: [ordered] low, medium, high.
If the attribute values have a natural order, it is better to declare them as such so that See5 can exploit the ordering. (NB: The target attribute should not be declared as ordered.)
discrete N for some integer N
The attribute has discrete, unordered values, but the values are assembled from the data itself; N is the maximum number of such values. This form can be handy for unordered discrete attributes with many values, but its use means that the data values cannot be checked. (NB: This form cannot be used for the target attribute.)
ignore
The values of the attribute should be ignored.
label
This attribute contains an identifying label for each case, such as an account number or an order code. The value of the attribute is ignored when classifiers are constructed, but is used when referring to individual cases. A label attribute can make it easier to locate errors in the data and to cross-reference results to individual cases. If there are two or more label attributes, only the last is used.
Attributes defined by formulas
The name of each implicitly-defined attribute is followed by `:=' and then a formula defining the attribute value. The formula is written in the usual way, using parentheses where needed, and may refer to any attribute defined before this one. Constants in the formula can be numbers (written in decimal notation), dates, times, and discrete attribute values (enclosed in string quotes `"'). The operators and functions that can be used in the formula are
+, -, *, /, % (mod), ^ (meaning `raised to the power')
>, >=, <, <=, =, <> or != (both meaning `not equal')
and, or
sin(...), cos(...), tan(...), log(...), exp(...), int(...) (meaning `integer part of')
The value of such an attribute is either continuous or true/false depending on the formula. For example, the attribute FTI above is continuous, since its value is obtained by dividing one number by another. The value of a hypothetical attribute such as
strange := referral source = "WEST" or age > 40.
would be either t or f since the value given by the formula is either true or false.
If the value of the formula cannot be determined for a particular case because one or more of the attributes appearing in the formula have unknown or non-applicable values, the value of the implicitly-defined attribute is unknown.
Dates, times, and timestamps
Dates are stored by See5 as the number of days since a particular starting point so some operations on dates make sense. Thus, if we have attributes
d1: date.
d2: date.
we could define
interval := d2 - d1.
gap := d1 <= d2 - 7.
d1-day-of-week := (d1 + 1) % 7 + 1.
interval then represents the number of days from d1 to d2 (non-inclusive) and gap would have a true/false value signaling whether d1 is at least a week before d2. The last definition is a slightly non-obvious way of determining the day of the week on which d1 falls, with values ranging from 1 (Monday) to 7 (Sunday).
Similarly, times are stored as the number of seconds since midnight. If the names file includes
start: time.
finish: time.
elapsed := finish - start.
the value of elapsed is the number of seconds from start to finish.
Timestamps are a little more complex. A timestamp is rounded to the nearest minute, but limitations on the precision of floating-point numbers mean that the values stored for timestamps from more than thirty years ago are approximate. If the names file includes
departure: timestamp.
arrival: timestamp.
flight time := arrival - departure.
the value of flight time is the number of minutes from departure to arrival.
Selecting the attributes that can appear in classifiers
An optional final entry in the names file affects the way that See5 constructs classifiers. This entry takes one of the forms
attributes included:
attributes excluded:
followed by a comma-separated list of attribute names. The first form restricts the attributes used in classifiers to those specifically named; the second form specifies that classifiers must not use any of the named attributes.
Excluding an attribute from classifiers is not the same as ignoring the attribute (see `ignore' above). As an example, suppose that numeric attributes A and B are defined in the data, but background knowledge suggests that only their difference is important. The names file might then contain the following entries:
. . .
A: continuous.
B: continuous.
Diff := A - B.
. . .
attributes excluded: A, B.
In this example the attributes A and B could not be defined as ignore because the definition of Diff would then be invalid.
Data file
The second essential file, the application's data file (e.g. hypothyroid.data) provides information on the training cases from which See5 will extract patterns. The entry for each case consists of one or more lines that give the values for all explicitly-defined attributes. If the classes are listed in the first line of the names file, the attribute values are followed by the case's class value. Values are separated by commas and the entry is optionally terminated by a period. Once again, anything on a line after a vertical bar `|' is ignored. (If the information for a case occupies more than one line, make sure that the line breaks occur after commas.)
For example, the first three cases from file hypothyroid.data are:
41,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,1.3,2.5,125,1.14,SVHC,negative,3733
23,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,4.1,2,102,?,other,negative,1442
46,M,f,f,f,f,N/A,f,f,f,f,f,f,f,f,f,0.98,?,109,0.91,other,negative,2965
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -