📄 opencvref_ml.htm
字号:
<dt>split_point<dd>Used internally by the training algorithm.
</dl>
<hr><h3><a name="decl_CvDTreeNode">CvDTreeNode</a></h3>
<p class="Blurb">Decision tree node</p>
<pre>
struct CvDTreeNode
{
int class_idx;
int Tn;
double value;
CvDTreeNode* parent;
CvDTreeNode* left;
CvDTreeNode* right;
CvDTreeSplit* split;
int sample_count;
int depth;
...
};
</pre>
<p><dl>
<dt>value<dd>The value assigned to the tree node. It is either a class label,
or the estimated function value.
<dt>class_idx<dd>The assigned to the node
normalized class index (to 0..class_count-1 range), it is used internally in classification trees
and tree ensembles.
<dt>Tn<dd>The tree index in a ordered sequence of trees. The indices are used during and
after the pruning procedure. The root node has the maximum value <code>Tn</code>
of the whole tree, child nodes have <code>Tn</code> less than or equal to
the parent's <code>Tn</code>,
and the nodes with <code>Tn≤<a href="#decl_CvDTree">CvDTree</a>::pruned_tree_idx</code> are not taken
into consideration at the prediction stage (the corresponding branches are
considered as cut-off), even
if they have not been physically deleted from the tree at the pruning stage.
<dt>parent, left, right<dd>Pointers to the parent node, left and right child nodes.
<dt>split<dd>Pointer to the first (primary) split.
<dt>sample_count<dd>The number of samples that fall into the node at the training stage.
It is used to resolve the difficult cases - when the variable for the primary split
is missing, and all the variables for other surrogate splits are
missing too,<br>the sample is
directed to the left if <code>left->sample_count>right->sample_count</code> and
to the right otherwise.
<dt>depth<dd>The node depth, the root node depth is 0, the child nodes depth is the parent's depth + 1.
</dl>
<p>Other numerous fields of <code>CvDTreeNode</code> are used internally at the training stage.</p>
<hr><h3><a name="decl_CvDTreeParams">CvDTreeParams</a></h3>
<p class="Blurb">Decision tree training parameters</p>
<pre>
struct CvDTreeParams
{
int max_categories;
int max_depth;
int min_sample_count;
int cv_folds;
bool use_surrogates;
bool use_1se_rule;
bool truncate_pruned_tree;
float regression_accuracy;
const float* priors;
CvDTreeParams() : max_categories(10), max_depth(INT_MAX), min_sample_count(10),
cv_folds(10), use_surrogates(true), use_1se_rule(true),
truncate_pruned_tree(true), regression_accuracy(0.01f), priors(0)
{}
CvDTreeParams( int _max_depth, int _min_sample_count,
float _regression_accuracy, bool _use_surrogates,
int _max_categories, int _cv_folds,
bool _use_1se_rule, bool _truncate_pruned_tree,
const float* _priors );
};
</pre>
<p><dl>
<dt>max_depth<dd>This parameter specifies the maximum possible depth of the
tree. That is the training algorithms attempts to split a node while its depth
is less than <code>max_depth</code>. The actual depth
may
be smaller if the other termination criteria are met
(see the outline of the training procedure in the beginning of the section),
and/or if the tree is pruned.
<dt>min_sample_count<dd>A node is not split if the number of samples directed to the node
is less than the parameter value.
<dt>regression_accuracy<dd>Another stop criteria - only for regression trees. As soon as
the estimated node value differs from the node training samples responses
by less than the parameter value, the node is not split further.
<dt>use_surrogates<dd>If <code>true</code>, surrogate splits are built. Surrogate splits are
needed to handle missing measurements and for variable importance estimation.
<dt>max_categories<dd>If a discrete variable, on which the training procedure tries to make a split,
takes more than <code>max_categories</code> values, the precise best subset
estimation may take a very long time (as the algorithm is exponential).
Instead, many decision trees engines (including ML) try to find sub-optimal split
in this case by clustering all the samples into <code>max_categories</code> clusters
(i.e. some categories are merged together).<br>
Note that this technique is used only in <code>N(>2)</code>-class classification problems.
In case of regression and 2-class classification the optimal split can be found efficiently
without employing clustering, thus the parameter is not used in these cases.
<dt>cv_folds<dd>If this parameter is >1, the tree is pruned using <code>cv_folds</code>-fold
cross validation.
<dt>use_1se_rule<dd>If <code>true</code>, the tree is truncated a bit more by the pruning procedure.
That leads to compact, and more resistant to the training data noise, but a bit less
accurate decision tree.
<dt>truncate_pruned_tree<dd>If <code>true</code>, the cut off nodes
(with <code>Tn</code>≤<code>CvDTree::pruned_tree_idx</code>) are physically
removed from the tree. Otherwise they are kept, and by decreasing
<code>CvDTree::pruned_tree_idx</code> (e.g. setting it to -1)
it is still possible to get the results from the original unpruned
(or pruned less aggressively) tree.
<dt>priors<dd>The array of a priori class probabilities, sorted by the class label value.
The parameter can be used to tune the decision tree preferences toward a certain class.
For example, if users want to detect some rare anomaly occurrence, the training
base will likely contain much more normal cases than anomalies, so
a very good classification
performance will be achieved just by considering every case as normal. To avoid this, the priors
can be specified, where the anomaly probability is artificially increased
(up to 0.5 or even greater), so the weight of the misclassified anomalies
becomes much bigger,
and the tree is adjusted properly.
<p>A note about memory management: the field <code>priors</code>
is a pointer to the array of floats. The array should be allocated by user, and
released just after the <code>CvDTreeParams</code> structure is passed to
<a href="#decl_CvDTreeTrainData">CvDTreeTrainData</a> or
<a href="#decl_CvDTree">CvDTree</a> constructors/methods (as the methods
make a copy of the array).
</dl>
<p>
The structure contains all the decision tree training parameters.
There is a default constructor that initializes all the parameters with the default values
tuned for standalone classification tree. Any of the parameters can be
overridden then,
or the structure may be fully initialized using the advanced variant of the constructor.</p>
<hr><h3><a name="decl_CvDTreeTrainData">CvDTreeTrainData</a></h3>
<p class="Blurb">Decision tree training data and shared data for tree ensembles</p>
<pre>
struct CvDTreeTrainData
{
CvDTreeTrainData();
CvDTreeTrainData( const CvMat* _train_data, int _tflag,
const CvMat* _responses, const CvMat* _var_idx=0,
const CvMat* _sample_idx=0, const CvMat* _var_type=0,
const CvMat* _missing_mask=0,
const CvDTreeParams& _params=CvDTreeParams(),
bool _shared=false, bool _add_labels=false );
virtual ~CvDTreeTrainData();
virtual void set_data( const CvMat* _train_data, int _tflag,
const CvMat* _responses, const CvMat* _var_idx=0,
const CvMat* _sample_idx=0, const CvMat* _var_type=0,
const CvMat* _missing_mask=0,
const CvDTreeParams& _params=CvDTreeParams(),
bool _shared=false, bool _add_labels=false,
bool _update_data=false );
virtual void get_vectors( const CvMat* _subsample_idx,
float* values, uchar* missing, float* responses, bool get_class_idx=false );
virtual CvDTreeNode* subsample_data( const CvMat* _subsample_idx );
virtual void write_params( CvFileStorage* fs );
virtual void read_params( CvFileStorage* fs, CvFileNode* node );
// release all the data
virtual void clear();
int get_num_classes() const;
int get_var_type(int vi) const;
int get_work_var_count() const;
virtual int* get_class_labels( CvDTreeNode* n );
virtual float* get_ord_responses( CvDTreeNode* n );
virtual int* get_labels( CvDTreeNode* n );
virtual int* get_cat_var_data( CvDTreeNode* n, int vi );
virtual CvPair32s32f* get_ord_var_data( CvDTreeNode* n, int vi );
virtual int get_child_buf_idx( CvDTreeNode* n );
////////////////////////////////////
virtual bool set_params( const CvDTreeParams& params );
virtual CvDTreeNode* new_node( CvDTreeNode* parent, int count,
int storage_idx, int offset );
virtual CvDTreeSplit* new_split_ord( int vi, float cmp_val,
int split_point, int inversed, float quality );
virtual CvDTreeSplit* new_split_cat( int vi, float quality );
virtual void free_node_data( CvDTreeNode* node );
virtual void free_train_data();
virtual void free_node( CvDTreeNode* node );
int sample_count, var_all, var_count, max_c_count;
int ord_var_count, cat_var_count;
bool have_labels, have_priors;
bool is_classifier;
int buf_count, buf_size;
bool shared;
CvMat* cat_count;
CvMat* cat_ofs;
CvMat* cat_map;
CvMat* counts;
CvMat* buf;
CvMat* direction;
CvMat* split_buf;
CvMat* var_idx;
CvMat* var_type; // i-th element =
// k<0 - ordered
// k>=0 - categorical, see k-th element of cat_* arrays
CvMat* priors;
CvDTreeParams params;
CvMemStorage* tree_storage;
CvMemStorage* temp_storage;
CvDTreeNode* data_root;
CvSet* node_heap;
CvSet* split_heap;
CvSet* cv_heap;
CvSet* nv_heap;
CvRNG rng;
};
</pre>
<p>
This structure is mostly used internally for storing both standalone trees and tree ensembles
efficiently. Basically, it contains 3 types of information:
<ol>
<li>The training parameters, <a href="#decl_CvDTreeParams">CvDTreeParams</a> instance.
<li>The training data, preprocessed in order to find the best splits more efficiently.
For tree ensembles this preprocessed data is reused by all the trees.
Additionally, the training data characteristics that are shared by
all trees in the ensemble are stored here: variable types,
the number of classes, class label compression map etc.
<li>Buffers, memory storages for tree nodes, splits and other elements of the trees constructed.
</ol>
<p>
There are 2 ways of using this structure.
In simple cases (e.g. standalone tree,
or ready-to-use "black box" tree ensemble from ML, like <a href=#ch_randomforest>Random Trees</a>
or <a href=#ch_boosting>Boosting</a>) there is no need to care or even to know about the structure -
just construct the needed statistical model, train it and use it. The <code>CvDTreeTrainData</code>
structure will be constructed and used internally. However, for custom tree algorithms,
or another sophisticated cases, the structure may be constructed and used explicitly.
The scheme is the following:
<ol>
<li>The structure is initialized using the default constructor, followed by
<code>set_data</code> (or it is built using the full form of constructor).
The parameter <code>_shared</code> must be set to <code>true</code>.
<li>One or more trees are trained using this data, see the special form of the method
<a href="#decl_CvDTree_train">CvDTree::train</a>.
<li>Finally, the structure can be released only after all the trees using it are released.
</ol>
</p>
<hr><h3><a name="decl_CvDTree">CvDTree</a></h3>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -