📄 434.txt
字号:
发信人: yaomc (白头翁&山东大汉), 信区: DataMining
标 题: [合集]对大的数据集的处理?
发信站: 南京大学小百合站 (Wed Jan 16 17:05:42 2002), 站内信件
tyqqre (tyqqre) 于Wed Dec 26 09:15:28 2001)
提到:
不知道,大家对比较大的数据集的处理是怎样办的?是不是先随机抽取一部分来作为训
练例?我看有些论文是将数据集平分成训练集和测试例!
虽然看了一些资料,但是对数据的预处理还是有些模糊。不知道在数据预处理的时候要
考虑些什么?在预处理的时候是否只能针对我有的这个数据集?
fervvac (高远) 于Wed Dec 26 14:48:48 2001提到:
I know some paper uses the following approach:
Divide the data into 10 parts, train your classifier (etc.) using one part and
test against the following parts. Do this 10 times.
helloboy (hello) 于Wed Dec 26 15:48:20 2001提到:
but the result may not be correct.
Since a small part of data can't represent the whole dataset.
Perhaps disk-resident method will do .But I am not sure about it.
roamingo (漫步鸥) 于Wed Dec 26 21:36:43 2001提到:
Alternatively, one can use 9 parts as training set and 1 for test,
a.k.a. n-fold cross-validation.
However, an idea data mining algorithm should scan the database,
sequentially starting from the first record, ONLY once (one scan).
It could be interrupted in the middle and save a partially result,
and at a later time, be continued with the rest records and refine
the result (incremental). With these nice attributes, large dataset
is not a problem.
(Argued in: Paul S. Bradley, U. Fayyad, C. Reina. Scaling EM
(Expectation Maximization) Clustering to Large Databases.
Revised October 1999. Microsoft Research tech. report No. MSR-TR-98-35.
http://citeseer.nj.nec.com/bradley99scaling.html )
fervvac (高远) 于Thu Dec 27 14:12:22 2001提到:
I see. I am not sure which part is used in training, but I do remember the
10-fold term you mentioned.
I think there are two different questions here: one is the data preprocessing
method used in classification process, the other being used in other processes
, say, AR mining. The n-fold methods should be a common practice in
testing the classifiers. For AR mining on large data sets, there are some
approximation mining algorithms, for example, approach that uses sampling.
However, I wonder if there is any algorithm (AR mining) that need only
1 scan over the data set? Even for FP-Tree based methods (which is believed
to be fastest), several scan over the data (might not be the original data
set, but the projected DB) is necessary if memory is not enough.
And for incremental mining, you cannot stop at any arbitrary moment, can you?
I am sorry I didn't read that paper before asking those possibly silly
questions, but I am really busy with other papers, :-( Hope you don't mind...
roamingo (漫步鸥) 于Fri Dec 28 22:15:20 2001提到:
I agree that it is true in the AR sub-field.
The word incremental in that paper has probabaly a little different meaning.
It is said to describe algorithms that needs exactly one scan. If it can
output result at any period during the scan, and only quality diffs, we can
say this algorithm has the incremental feature. Such algorithms exists,
especially in the clustering sub-field.
Absolutely not. I can't understand that paper much. The only benefit for me
after reading it is the desired features for a DM algorithm, just as what I
rephrased here before. Moreover, I believe I am just a newbie in DM. You and
other guys all have very insightful ideas, and I have benefited a lot from
your articals. Let's try our best to make here an invaluable place for DMers.
fervvac (高远) 于Sat Dec 29 15:00:17 2001提到:
I see. I rremember Padley is a very theoretical guy, his paper must be
diffciult to read, :-)
For clustering algorithms, I think it can be done in a "progressive" manner.
I do feel that property important, as most analysis sessions are of interactive
nature (as well as explorative nature). It is thus very important to
proviide a preliminary result to the analyst as soon as possible.
In fact, I am a newbie to DM field. My research interest is Data Warehousing
and OLAP , XML. But I recently read something on sequence pattern mining and
get some basic knowledge in the DM field, :-)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -