📄 c4.5文档说明.txt

📁 C4.5文档说明（数据类型
💻 TXT
📖 第 1 页 / 共 4 页
字号:
12 3 4 下一页
See5: An Informal Tutorial 
Welcome to See5, a system that extracts informative patterns from data. The following sections show how to prepare data files for See5 and illustrate the options for using the system. 

In this tutorial, file names and See5 input or output appear in blue fixed-width font while file extensions and other general forms are shown highlighted in green. Buttons and options on the Windows GUI are in maroon. 

Preparing Data for See5 
Application files 
Names file 
What's in a name? 
Specifying the classes 
Explicitly-defined attributes 
Attributes defined by formulas 
Dates, times, and timestamps 
Selecting the attributes that can appear in classifiers 
Data file 
Test and cases files (optional) 
Costs file (optional) 
User Interface 
Constructing Classifiers 
Decision trees 
Discrete value subsets 
Rulesets 
Boosting 
Winnowing attributes 
Softening thresholds 
Advanced pruning options 
Sampling from large datasets 
Cross-validation trials 
Differential misclassification costs 
Using Classifiers 
Cross-Referencing Classifiers and Data 
Generating Classifiers in Batch Mode 
Linking to Other Programs 

--------------------------------------------------------------------------------


Preparing Data for See5
We will illustrate See5 using a medical application -- mining a database of thyroid assays from the Garvan Institute of Medical Research, Sydney, to construct diagnostic rules for hypothyroidism. Each case concerns a single referral and contains information on the source of the referral, assays requested, patient data, and referring physician's comments. Here are three examples: 

        Attribute                 Case 1    Case 2    Case 3    .....

	age                       41        23        46
	sex                       F         F         M
	on thyroxine              f         f         f
	query on thyroxine        f         f         f
	on antithyroid medication f         f         f
	sick                      f         f         f
	pregnant                  f         f         not applicable
	thyroid surgery           f         f         f
	I131 treatment            f         f         f
	query hypothyroid         f         f         f
	query hyperthyroid        f         f         f
	lithium                   f         f         f
	tumor                     f         f         f
	goitre                    f         f         f
	hypopituitary             f         f         f
	psych                     f         f         f
	TSH                       1.3       4.1       0.98
	T3                        2.5       2         unknown
	TT4                       125       102       109
	T4U                       1.14      unknown   0.91
	FTI                       109       unknown   unknown
	referral source           SVHC      other     other
	diagnosis                 negative  negative  negative
	ID                        3733      1442      2965

This is exactly the sort of task for which See5 was designed. Each case belongs to one of a small number of mutually exclusive classes (negative, primary, secondary, compensated). Properties of every case that may be relevant to its class are provided, although some cases may have unknown or non-applicable values for some attributes. There are 24 attributes in this example, but See5 can deal with any number of attributes. 

See5's job is to find how to predict a case's class from the values of the other attributes. See5 does this by constructing a classifier that makes this prediction. As we will see, See5 can construct classifiers expressed as decision trees or as sets of rules. 

Application files
Every See5 application has a short name called a filestem; we will use the filestem hypothyroid for this illustration. All files read or written by See5 for an application have names of the form filestem.extension, where filestem identifies the application and extension describes the contents of the file. 
Here is a summary table of the extensions used by See5 (to be described in later sections): 


names  description of the application's attributes  [required]  
data  cases used to generate a classifier  [required]  
test  unseen cases used to test a classifier  [optional]  
cases  cases to be classified subsequently  [optional]  
costs  differential misclassification costs  [optional]  
tree  decision tree classifier produced by See5  [output]  
rules  ruleset classifier produced by See5  [output]  
out  report produced when a classifier is generated  [output]  
set  settings used for the last classifier  [output]  

The case of letters in both the filestem and extension is important -- file names APP.DATA, app.data, and App.Data, are all different. The extensions must be written in lower case as shown above, otherwise See5 will not recognize the files for your application. 
If See5 cannot seem to find your files even though the filestem and extensions are correct, please check that file extensions are not hidden on your computer. (If extensions are hidden and you write a text file from Wordpad, it automatically adds an extension .txt that makes the file invisible to See5.) Here's what to do: 

Double click "My Computer", select "Tools" (or "View" for Windows 98), then "Folder Options" and then the "View" tab. The box "Hide file extensions for known file types" should not be checked. If it is, uncheck it and click "Apply".
Names file
Two files are essential for all See5 applications and there are three further optional files, each identified by its extension. The first essential file is the names file (e.g. hypothyroid.names) that describes the attributes and classes. There are two important subgroups of attributes: 
The value of an explicitly-defined attribute is given directly in the data. A discrete attribute has a value drawn from a set of nominal values, a continuous attribute has a numeric value, a date attribute holds a calendar date, a time attribute holds a clock time, a timestamp attribute holds a date and time, and a label attribute serves only to identify a particular case. 
The value of an implicitly-defined attribute is specified by a formula. (Most attributes are explicitly defined, so you may never need implicitly-defined attributes.) 
The file hypothyroid.names looks like this: 

	diagnosis.                     | the target attribute

	age:                           continuous.
	sex:                           M, F.
	on thyroxine:                  f, t.
	query on thyroxine:            f, t.
	on antithyroid medication:     f, t.
	sick:                          f, t.
	pregnant:                      f, t.
	thyroid surgery:               f, t.
	I131 treatment:                f, t.
	query hypothyroid:             f, t.
	query hyperthyroid:            f, t.
	lithium:                       f, t.
	tumor:                         f, t.
	goitre:                        f, t.
	hypopituitary:                 f, t.
	psych:                         f, t.
	TSH:                           continuous.
	T3:                            continuous.
	TT4:                           continuous.
	T4U:                           continuous.
	FTI:=                          TT4 / T4U.
	referral source:               WEST, STMW, SVHC, SVI, SVHD, other.
	
	diagnosis:                     primary, compensated, secondary, negative.
	
	ID:                            label.

What's in a name?
Names, labels, classes, and discrete values are represented by arbitrary strings of characters, with some fine print: 
Tabs and spaces are permitted inside a name or value, but See5 collapses every sequence of these characters to a single space. 
Special characters (comma, colon, period, vertical bar `|') can appear in names and values, but must be prefixed by the escape character `\'. For example, the name "Filch, Grabbit, and Co." would be written as `Filch\, Grabbit\, and Co\.'. (Colons in times and periods in numbers do not need to be escaped.) 
Whitespace (blank lines, spaces, and tab characters) is ignored except inside a name or value and can be used to improve legibility. Unless it is escaped as above, the vertical bar `|' causes the remainder of the line to be ignored and is handy for including comments. This use of `|' should not occur inside a value. 
Specifying the classes
The first entry in the names file specifies the classes in one of three formats: 
A list of class names separated by commas, e.g. 
primary, compensated, secondary, negative. 
The name of a discrete attribute (the target attribute) that contains the class value, e.g.: 
diagnosis. 
The name of a continuous target attribute followed by a colon and one or more thresholds in increasing order and separated by commas. If there are t thresholds X1, X2, ..., Xt then the values of the attribute are divided into t+1 ranges: 
less than or equal to X1 
greater than X1 and less than or equal to X2 
. . . 
greater than Xt. 
Each range defines a class, so there are t+1 classes. For example, a hypothetical entry

age: 12, 19.

would define three classes: age <= 12, 12 < age <= 19, and age > 19. 
This first entry defining the classes is followed by definitions of the attributes in the order that they will be given for each case. 

Explicitly-defined attributes
The name of each explicitly-defined attribute is followed by a colon `:' and a description of the values taken by the attribute. There are six possibilities: 
continuous 
The attribute takes numeric values. 
date 
The attribute's values are dates in the form YYYY/MM/DD or YYYY-MM-DD, e.g. 1999/09/30 or 1999-09-30. 
time 
The attribute's values are times in the form HH:MM:SS with values between 00:00:00 and 23:59:59. 
timestamp 
The attribute's values are times in the form YYYY/MM/DD HH:MM:SS or YYYY-MM-DD HH:MM:SS, e.g. 1999-09-30 15:04:00. (Note that there is a space separating the date and time.) 
a comma-separated list of names 
The attribute takes discrete values, and these are the allowable values. The values may be prefaced by [ordered] to indicate that they are given in a meaningful ordering, otherwise they will be taken as unordered. For instance, the values low, medium, high are ordered, while meat, poultry, fish, vegetables are not. The former might be declared as 
grade: [ordered] low, medium, high.

If the attribute values have a natural order, it is better to declare them as such so that See5 can exploit the ordering. (NB: The target attribute should not be declared as ordered.) 
discrete N for some integer N 
The attribute has discrete, unordered values, but the values are assembled from the data itself; N is the maximum number of such values. This form can be handy for unordered discrete attributes with many values, but its use means that the data values cannot be checked. (NB: This form cannot be used for the target attribute.) 
ignore 
The values of the attribute should be ignored. 
label 
This attribute contains an identifying label for each case, such as an account number or an order code. The value of the attribute is ignored when classifiers are constructed, but is used when referring to individual cases. A label attribute can make it easier to locate errors in the data and to cross-reference results to individual cases. If there are two or more label attributes, only the last is used. 
Attributes defined by formulas
The name of each implicitly-defined attribute is followed by `:=' and then a formula defining the attribute value. The formula is written in the usual way, using parentheses where needed, and may refer to any attribute defined before this one. Constants in the formula can be numbers (written in decimal notation), dates, times, and discrete attribute values (enclosed in string quotes `"'). The operators and functions that can be used in the formula are 
+, -, *, /, % (mod), ^ (meaning `raised to the power') 
>, >=, <, <=, =, <> or != (both meaning `not equal') 
and, or 
sin(...), cos(...), tan(...), log(...), exp(...), int(...) (meaning `integer part of') 
The value of such an attribute is either continuous or true/false depending on the formula. For example, the attribute FTI above is continuous, since its value is obtained by dividing one number by another. The value of a hypothetical attribute such as 
	strange := referral source = "WEST" or age > 40.

would be either t or f since the value given by the formula is either true or false. 
If the value of the formula cannot be determined for a particular case because one or more of the attributes appearing in the formula have unknown or non-applicable values, the value of the implicitly-defined attribute is unknown. 

Dates, times, and timestamps
Dates are stored by See5 as the number of days since a particular starting point so some operations on dates make sense. Thus, if we have attributes 
	d1: date.
	d2: date.

we could define 
	interval := d2 - d1.
	gap := d1 <= d2 - 7.
	d1-day-of-week := (d1 + 1) % 7 + 1.

interval then represents the number of days from d1 to d2 (non-inclusive) and gap would have a true/false value signaling whether d1 is at least a week before d2. The last definition is a slightly non-obvious way of determining the day of the week on which d1 falls, with values ranging from 1 (Monday) to 7 (Sunday). 
Similarly, times are stored as the number of seconds since midnight. If the names file includes 

	start: time.
	finish: time.
	elapsed := finish - start.

the value of elapsed is the number of seconds from start to finish. 
Timestamps are a little more complex. A timestamp is rounded to the nearest minute, but limitations on the precision of floating-point numbers mean that the values stored for timestamps from more than thirty years ago are approximate. If the names file includes 

	departure: timestamp.
	arrival: timestamp.
	flight time := arrival - departure.

the value of flight time is the number of minutes from departure to arrival. 
Selecting the attributes that can appear in classifiers
An optional final entry in the names file affects the way that See5 constructs classifiers. This entry takes one of the forms 
	attributes included:
	attributes excluded:

followed by a comma-separated list of attribute names. The first form restricts the attributes used in classifiers to those specifically named; the second form specifies that classifiers must not use any of the named attributes. 
Excluding an attribute from classifiers is not the same as ignoring the attribute (see `ignore' above). As an example, suppose that numeric attributes A and B are defined in the data, but background knowledge suggests that only their difference is important. The names file might then contain the following entries: 

	. . .
	A: continuous.
	B: continuous.
	Diff := A - B.
	   . . .
	attributes excluded: A, B.

In this example the attributes A and B could not be defined as ignore because the definition of Diff would then be invalid. 
Data file
The second essential file, the application's data file (e.g. hypothyroid.data) provides information on the training cases from which See5 will extract patterns. The entry for each case consists of one or more lines that give the values for all explicitly-defined attributes. If the classes are listed in the first line of the names file, the attribute values are followed by the case's class value. Values are separated by commas and the entry is optionally terminated by a period. Once again, anything on a line after a vertical bar `|' is ignored. (If the information for a case occupies more than one line, make sure that the line breaks occur after commas.) 
For example, the first three cases from file hypothyroid.data are: 

	41,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,1.3,2.5,125,1.14,SVHC,negative,3733
	23,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,4.1,2,102,?,other,negative,1442
	46,M,f,f,f,f,N/A,f,f,f,f,f,f,f,f,f,0.98,?,109,0.91,other,negative,2965
12 3 4 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -