📄 tddtinducer.java
字号:
package id3;
import java.util.*;
import shared.*;
import shared.Error;
/** Top-down decision-tree (TDDT) inducer induces decision trees
* top-down by building smaller training sets and inducing trees
* for them recursively. The decision tree built has categorizers
* at each node, and these determine how to branch, i.e., to
* which child to branch, or whether to classify. The common
* cases are: AttrCategorizers, which simply return a given value
* for an attribute in the node, and ThresholdCategorizers, which
* return 0 or one based on whether an attribute is less than or
* greater than a given threshold (valid only for real attributes).
* The leaves are usually constant categorizers, i.e., they just
* return a constant value independent of the instance. <P>
* The induction algorithm calls best_split, a pure virtual
* function, to determine the best root split. Once the split has
* been chosen, the data in the node is split according to the
* categorizer best_split returns. A node is formed, and the
* algorithm is called recursively with each of the children.
* Once each child returns with a subtree, we connect them to the
* root we split. ID3Inducer, for example, implements the
* best_split using information gain, but other methods are
* possible. best_split() can return any categorizer, thus opening
* the possibility for oblique trees with perceptrons at nodes,
* recursive trees, etc. The leaves can also be of any
* classifier, thus perceptron-trees (Utgoff) can be created,
* or a nearest-neighbor within a leaf, etc. <P>
* Complexity : <P>
* The complexity of train() is proportional to the number of
* nodes in the resulting tree times the time for deciding on
* the split() categorizer (done by the derived classes).
* predict() takes time proportional to the sum of the
* categorizers time over the path from the root to a leaf node. <P>
* Enhancements : <P>
* We may speed things up by having an option to test only
* splits where the class label changes. For some measures
* (e.g., entropy), it can be shown that a split will never be
* made between two instances with the same class label
* (Fayyad IJCAI 93 page 1022, Machine Learning journal Vol 8,
* no 1, page 87, 1992). We may wish to discretize the real values
* first. By making them linear discrete, we can use the regular
* counters and things will be faster (note however that the split
* will usually remain special since it's a binary threshold split,
* not a multi-way split). <P>
* Another problem is with attributes that have many values, for
* example social-security-number. Computing all cut points can
* be very expensive. We may want to skip such attributes by
* claiming that each value must have at least some number of
* instances. Utgoff in ML94 (page 322) mentions that ID slows
* his system down considerably. The problem of course is that if
* you threshold, it sometimes make sense to split on such
* attributes. Taken to an extreme, if we had a real "real-value,"
* all values would be different with probability 1, and hence we
* would skip such an attribute. <P>
* To speed things up, we may want to have an Inducer that
* accepts a decision tree and builds stuff in it (vs. getting
* a graph). Other options allow for doing the recursion by
* calling a function instead of creating the actual class.
* The advantage of the current method is that it allows a
* subclass to keep track of the number of levels (useful for
* lookahead or something). Yet another option is to "recycle"
* inducers by using our "this" and just changing the training set. <P>
* We currently split instances but keep the original structure,
* that is, we don't actually delete the attribute tested on. It
* may be faster in some cases to actually create a new List
* without the attribute. The disadvantage is that for multi-valued
* attributes we may wish to branch again, so we can't always delete.
* The same goes for tests which are not attributes (e.g.,
* conjunctions).
* @author James Louis 12/07/2000 Ported to Java.
* @author Steve Gustafson 12/07/2000 Ported to Java.
* @author Chia-Hsin Li 1/03/95 Added Options.
* @author Ronny Kohavi 9/06/93 Initial revision (.h,.c)
*/
abstract public class TDDTInducer extends Inducer
{
//ENUMS
/** Pruning method value.
*/
public static final byte none = 0; /* PruningMethod enum */
/** Pruning method value.
*/
public static final byte confidence = 1; /* */
/** Pruning method value.
*/
public static final byte penalty = 2; /* */
/** Pruning method value.
*/
public static final byte linear = 3; /* */
/** Pruning method value.
*/
public static final byte KLdistance = 4; /* */
/** Pruning method value.
*/
public static final byte lossConfidence = 5; /* */
/** Pruning method value.
*/
public static final byte lossLaplace = 6; /* */
/** LeafDistType value.
*/
public static final byte allOrNothing = 0; /* LeafDistType enum */
/** LeafDistType value.
*/
public static final byte frequencyCounts = 1; /* */
/** LeafDistType value.
*/
public static final byte laplaceCorrection = 2; /* */
/** LeafDistType value.
*/
public static final byte evidenceProjection = 3;/* */
/** Evaluation metric value.
*/
public static final byte error = 0; /* EvalMetric enum */
/** Evaluation metric value.
*/
public static final byte MSE = 1; /* */
/** Evaluation metric value.
*/
public static final byte logLoss = 2; /* */
//END ENUMS
private static int MIN_INSTLIST_DRIBBLE = 5000;
private static String MAX_LEVEL_HELP = "The maximum number of levels to grow. 0 "
+"implies no limit.";
private static int DEFAULT_MAX_LEVEL = 0;
private static String LB_MSW_HELP = "This option specifies the value of lower bound "
+"of the weight while calculating the minimum split "
+"(overrides weight option). Set to 0 to have the value determined "
+"automatically depending on the total weight of the training set.";
private static double DEFAULT_LB_MSW = 1;
private static String UB_MSW_HELP = "This option specifies the value of upper bound "
+"of the weight while calculating the minimum split (overrides lower bound).";
private static double DEFAULT_UB_MSW = 25;
private static String MS_WP_HELP = "This option chooses the value of "
+"the weight percent while calculating the minimum split.";
private static double DEFAULT_MS_WP = 0;
private static String NOM_LBO_HELP = "This option specifies if only the lower bound "
+"will be used for calculating the minimum split for nominal-valued "
+"attributes.";
private static boolean DEFAULT_NOM_LBO = false;
private static String DEBUG_HELP = "This option specifies whether to display the "
+"debug information while displaying the graph.";
private static boolean DEFAULT_DEBUG = false;
/** Indicates edges representing unknown values should be processed. TRUE
indicates unknown edges should be, FALSE otherwise. **/
private static String UNKNOWN_EDGES_HELP = "This option specifies whether or not to "
+"allow outgoing UNKNOWN edges from each node. ";
private static boolean DEFAULT_UNKNOWN_EDGES = true;
// This option is currently not enabled.
//private static String EMPTY_NODE_PARENT_DIST_HELP = "Should empty nodes get the "
//+"distribution of the parent or zeros";
private static boolean DEFAULT_EMPTY_NODE_PARENT_DIST = false;
private static String PARENT_TIE_BREAKING_HELP = "Should ties be broken in favor of "
+"the majority category of the parent";
private static boolean DEFAULT_PARENT_TIE_BREAKING = true;
// The following option is currently not enabled.
//const MString PRUNING_BRANCH_REPLACEMENT_HELP = "Should replacing a node "
//"with its largest subtree be allowed during pruning";
private static boolean DEFAULT_PRUNING_BRANCH_REPLACEMENT = false;
private static String ADJUST_THRESHOLDS_HELP = "Should theshold values be adjusted "
+"to the values of instances";
private static boolean DEFAULT_ADJUST_THRESHOLDS = false;
private static String PRUNING_METHOD_HELP = "Which algorithm should be used for "
+"pruning. (If not NONE and PRUNING_FACTOR is 0, a node will be made a leaf "
+"if its potential children would not improve the error count)";
private static byte DEFAULT_PRUNING_METHOD = confidence;
private String PRUNING_FACTOR_HELP = "Pruning factor in standard deviations. "
+"(high value -> more pruning), zero is no pruning, 2.5 is heavy pruning";
private static double DEFAULT_PRUNING_FACTOR = 0.0; //change this to .6925
private static String CONT_MDL_ADJUST_HELP = "When TRUE, mutual information for "
+"real attributes is lowered based on MDL";
private static boolean DEFAULT_CONT_MDL_ADJUST = false;
private static String SMOOTH_INST_HELP = "Set the number of values on each side "
+"of a given entropy value to use for smoothing (0 turns off smoothing).";
private static int DEFAULT_SMOOTH_INST = 0;
private static String SMOOTH_FACTOR_HELP = "Set the constant for the exponential "
+"distribution used to smooth.";
private static double DEFAULT_SMOOTH_FACTOR = 0.75;
private static String LEAF_DIST_TYPE_HELP = "This option selects the type of "
+"distribution to create at leaf nodes. All-or-nothing picks the majority "
+"category and places all weight there. This is the default. "
+"Frequency-counts compute the distribution as normalized counts of the "
+"occurance of each class at this leaf. Laplace-correction uses the "
+"frequency counts but applies a laplace correction of M_ESTIMATE_FACTOR. "
+"Evidence-projection uses the evidence projection algorithm to correct "
+"the frequency counts. EVIDENCE_FACTOR is the evidence scaling factor "
+"for the correction.";
private static byte defaultLeafDistType = allOrNothing;
private static String M_ESTIMATE_FACTOR_HELP = "This option determines the factor "
+"by which to scale the log number of instances when computing an "
+"evidence projection";
private static double DEFAULT_LEAF_M_ESTIMATE_FACTOR = 0.0;
private static String EVIDENCE_FACTOR_HELP = "This option determines the factor "
+"by which to scale the log number of instances when computing an "
+"evidence projection";
private static double DEFAULT_LEAF_EVIDENCE_FACTOR = 1.0;
private static String EVAL_METRIC_HELP = "The measure by which the induced tree will "
+"be evaluated. Changing this may affect induction and pruning.";
private static byte DEFAULT_EVALUATION_METRIC = error;
private static int totalNodesNum = 0;
private static int callCount = 0;
private static int totalAttr = 0;
private int level;
private CGraph cgraph;
private DTCategorizer decisionTreeCat;
private double totalInstWeight;
private boolean haveContinuousAttributes, errorprune = false;
private TDDTOptions tddtOptions;
/** Constructor.
* @param dscr The description of the inducer.
*/
public TDDTInducer(String dscr)
{
super(dscr);
tddtOptions = new TDDTOptions();
CGraph aCgraph = null;
level = 0;
cgraph = aCgraph;
decisionTreeCat = null;
totalInstWeight = -1; //illegal value;
//this is arbirary = no schema yet
haveContinuousAttributes = false;
tddtOptions.maxLevel = DEFAULT_MAX_LEVEL;
tddtOptions.lowerBoundMinSplitWeight = DEFAULT_LB_MSW;
tddtOptions.upperBoundMinSplitWeight = DEFAULT_UB_MSW;
tddtOptions.minSplitWeightPercent = DEFAULT_MS_WP;
tddtOptions.nominalLBoundOnly = DEFAULT_NOM_LBO;
tddtOptions.debug = DEFAULT_DEBUG;
tddtOptions.unknownEdges = DEFAULT_UNKNOWN_EDGES;
tddtOptions.splitScoreCriterion = SplitScore.defaultSplitScoreCriterion;
tddtOptions.emptyNodeParentDist = DEFAULT_EMPTY_NODE_PARENT_DIST;
tddtOptions.parentTieBreaking = DEFAULT_PARENT_TIE_BREAKING;
tddtOptions.pruningMethod = DEFAULT_PRUNING_METHOD;
tddtOptions.pruningBranchReplacement = DEFAULT_PRUNING_BRANCH_REPLACEMENT;
tddtOptions.adjustThresholds = DEFAULT_ADJUST_THRESHOLDS;
tddtOptions.pruningFactor = DEFAULT_PRUNING_FACTOR;
tddtOptions.contMDLAdjust = DEFAULT_CONT_MDL_ADJUST;
tddtOptions.smoothInst = DEFAULT_SMOOTH_INST;
tddtOptions.smoothFactor = DEFAULT_SMOOTH_FACTOR;
tddtOptions.leafDistType = defaultLeafDistType;
tddtOptions.MEstimateFactor = DEFAULT_LEAF_M_ESTIMATE_FACTOR;
tddtOptions.evidenceFactor = DEFAULT_LEAF_EVIDENCE_FACTOR;
tddtOptions.evaluationMetric = DEFAULT_EVALUATION_METRIC;
}
/** Constructor.
* @param descr The description of this inducer.
* @param aCgraph The CGraph that will hold the decision tree.
*/
public TDDTInducer(String descr, CGraph aCgraph)
{
super(descr);
level = 0;
tddtOptions = new TDDTOptions();
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -