434.txt

来自「This complete matlab for neural network」· 文本代码 · 共 103 行

TXT

103 行

发信人: yaomc (白头翁&山东大汉), 信区: DataMining
标  题: [合集]对大的数据集的处理？
发信站: 南京大学小百合站 (Wed Jan 16 17:05:42 2002), 站内信件

tyqqre (tyqqre) 于Wed Dec 26 09:15:28 2001)
提到：

  不知道，大家对比较大的数据集的处理是怎样办的？是不是先随机抽取一部分来作为训
练例？我看有些论文是将数据集平分成训练集和测试例！

  虽然看了一些资料，但是对数据的预处理还是有些模糊。不知道在数据预处理的时候要
考虑些什么？在预处理的时候是否只能针对我有的这个数据集？



fervvac (高远) 于Wed Dec 26 14:48:48 2001提到：

I know some paper uses the following approach:
Divide the data into 10 parts, train your classifier (etc.) using one part and
test against the following parts.  Do this 10 times.



helloboy (hello) 于Wed Dec 26 15:48:20 2001提到：

but the result may not be correct. 
Since a small part of data can't represent the whole dataset.
Perhaps disk-resident method will do .But I am not sure about it.


roamingo (漫步鸥) 于Wed Dec 26 21:36:43 2001提到：

Alternatively, one can use 9 parts as training set and 1 for test, 
a.k.a. n-fold cross-validation.

However, an idea data mining algorithm should scan the database,
sequentially starting from the first record, ONLY once (one scan).  
It could be interrupted in the middle and save a partially result, 
and at a later time, be continued with the rest records and refine 
the result (incremental). With these nice attributes, large dataset 
is not a problem. 
(Argued in: Paul S. Bradley, U. Fayyad, C. Reina. Scaling EM 
(Expectation Maximization) Clustering to Large Databases. 
Revised October 1999. Microsoft Research tech. report No. MSR-TR-98-35. 
http://citeseer.nj.nec.com/bradley99scaling.html )



fervvac (高远) 于Thu Dec 27 14:12:22 2001提到：

I see. I am not sure which part is used in training, but I do remember the
10-fold term you mentioned.

I think there are two different questions here: one is the data preprocessing 
method used in classification process, the other being used in other processes
, say, AR mining. The n-fold methods should be a common practice in 
testing the classifiers. For AR mining on large data sets, there are some
approximation mining algorithms, for example, approach that uses sampling.

However, I wonder if there is any algorithm (AR mining) that need only 
1 scan over the data set? Even for FP-Tree based methods (which is believed
to be fastest), several scan over the data (might not be the original data
set, but the projected DB) is necessary if memory is not enough.

And for incremental mining, you cannot stop at any arbitrary moment, can you?

I am sorry I didn't read that paper before asking those possibly silly 
questions, but I am really busy with other papers, :-( Hope you don't mind...



roamingo (漫步鸥) 于Fri Dec 28 22:15:20 2001提到：

I agree that it is true in the AR sub-field.

The word incremental in that paper has probabaly a little different meaning.
It is said to describe algorithms that needs exactly one scan.  If it can
output result at any period during the scan, and only quality diffs, we can
say this algorithm has the incremental feature.  Such algorithms exists,
especially in the clustering sub-field.

Absolutely not.  I can't understand that paper much.  The only benefit for me
after reading it is the desired features for a DM algorithm, just as what I 
rephrased here before.  Moreover, I believe I am just a newbie in DM.  You and 
other guys all have very insightful ideas, and I have benefited a lot from 
your articals.  Let's try our best to make here an invaluable place for DMers.


fervvac (高远) 于Sat Dec 29 15:00:17 2001提到：

I see. I rremember Padley is a very theoretical guy, his paper must be 
diffciult to read, :-)

For clustering algorithms, I think it can be done in a "progressive" manner. 
I do feel that property important, as most analysis sessions are of interactive
nature (as well as explorative nature). It is thus very important to 
proviide a preliminary result to the analyst as soon as possible.

In fact, I am a newbie to DM field. My research interest is Data Warehousing
and OLAP , XML. But I recently read something on sequence pattern mining and
get some basic knowledge in the DM field, :-)

434.txt - 源码说明

本页面展示了「This complete matlab for neural network」中的 434.txt 源码文件，采用文本编程语言编写，共 103 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。

虫虫下载站收录了大量与complete相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。

⌨️ 快捷键说明

复制代码Ctrl + C

搜索代码Ctrl + F

全屏模式F11

增大字号Ctrl + =

减小字号Ctrl + -

显示快捷键?