📄 64.txt
字号:
发信人: ashun (阿顺), 信区: DataMining
标 题: 数据挖掘术语简介(三)
发信站: 南京大学小百合站 (Wed Aug 29 14:44:23 2001)
CART
Classification And Regression Trees. CART is a method of splitting the indepen
dent variables into small groups and fitting a constant function to the small
data sets. In categorical trees, the constant function is one that takes on a
finite small set of values (e.g., Y or N, low or medium or high). In regressio
n trees, the mean value of the response is fit to small connected data sets.
categorical data
Categorical data fits into a small number of discrete categories (as opposed t
o continuous). Categorical data is either non-ordered (nominal) such as gender
or city, or ordered (ordinal) such as high, medium, or low temperatures.
CHAID
An algorithm for fitting categorical trees. It relies on the chi-squared stati
stic to split the data into small connected data sets.
chi-squared
A statistic that assesses how well a model fits the data. In data mining, it i
s most commonly used to find homogeneous subsets for fitting categorical trees
as in CHAID.
classification
Refers to the data mining problem of attempting to predict the category of cat
egorical data by building a model based on some predictor variables.
classification tree
A decision tree that places categorical variables into classes.
cleaning (cleansing)
Refers to a step in preparing data for a data mining activity. Obvious data er
rors are detected and corrected (e.g., improbable dates) and missing data is r
eplaced.
clustering
Clustering algorithms find groups of items that are similar. For example, clus
tering could be used by an insurance company to group customers according to i
ncome, age, types of policies purchased and prior claims experience. It divide
s a data set so that records with similar content are in the same group, and g
roups are as different as possible from each other. Since the categories are u
nspecified, this is sometimes referred to as unsupervised learning.
confidence
Confidence of rule "B given A" is a measure of how much more likely it is that
B occurs when A has occurred. It is expressed as a percentage, with 100% mean
ing B always occurs if A has occurred. Statisticians refer to this as the cond
itional probability of B given A. When used with association rules, the term c
onfidence is observational rather than predictive. (Statisticians also use thi
s term in an unrelated way. There are ways to estimate an interval and the pro
bability that the interval contains the true value of a parameter is called th
e interval confidence. So a 95% confidence interval for the mean has a probabi
lity of .95 of covering the true value of the mean.)
confusion matrix
A confusion matrix shows the counts of the actual versus predicted class value
s. It shows not only how well the model predicts, but also presents the detail
s needed to see exactly where things may have gone wrong.
consequent
When an association between two variables is defined, the second item (or righ
t-hand side) is called the consequent. For example, in the relationship "When
a prospector buys a pick, he buys a shovel 14% of the time," "buys a shovel" i
s the consequent.
continuous
Continuous data can have any value in an interval of real numbers. That is, th
e value does not have to be an integer. Continuous is the opposite of discrete
or categorical.
cross validation
A method of estimating the accuracy of a classification or regression model. T
he data set is divided into several parts, with each part in turn used to test
a model fitted to the remaining parts.
--
业精于勤荒于嬉,行成于思毁于随。 —— 韩愈
临渊羡鱼不如退而结网。 —— 班固
勿以恶小而为之,勿以善小而不为。 —— 刘备
※ 来源:.南京大学小百合站 http://bbs.nju.edu.cn [FROM: 202.119.80.20]
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -