📄 490.txt
字号:
发信人: yaomc (白头翁&山东大汉), 信区: DataMining
标 题: 关于Han的BOOK的一些讨论。
发信站: 南京大学小百合站 (Fri Jan 18 15:03:54 2002), 站内信件
(1) BY:wnqian/
First of all, it is a well organized, well written book. However, I
think this book is not suitable for introductory course or research. You
can see that the contents in the book are mostly contributed by
researchers in recent years. MOST of them are not applied in REAL,
LARGE, IMPORTANT (MISSION CRITICAL) applications. Then, this book only
introduced a bunch of concepts and techniques (as the book is named). If
these concepts or techniques will survive is unknown.
Therefore, it MAY mis-lead the beginers that these techniques are
classic ones, if they use this book as introductory textbook. I read
some posters here, and find that somebody is really mis-lead by it. On
the other hand, this book is a good list for recent research work
about data mining. But it is not suitable for research. For each concept
or technique, the book doesn't research it in detail. You can know
nothing if you want to do research on it. If you want to do reseach on
any topic, there should be hundreds of papers waiting for you to read.
From this viewpoint, the only useful part is the references. Then, you
may ask: "why they write this book?" I think the answer is that this
book is for people who want to do data mining from application domain.
They know the data, they know the requests, but they don't know the
technique. So, they should find something out. And they cann't read
papers. So, they MAY use this book to choose the right techniuqes.
Another thing I should mention is that Han's background is English and
database. So, he can write good papers/books from literal aspect. BUT,
many of his work is not solid! His standpoint is database, which may not
be the mainstream in data mining/KDD. The viewpoint of people from
machine learning or statistics may be totally different with his.
I do some research on data mining. My homepage is:
http://www.cs.fudan.edu.cn/ch/third_web/WebDB/wnqian_English.htm
Any discussion is welcome!
(2) BY: ALEX/
That's a good point. I have recently join this group to find out if
thereis anything/anybody helpful to me. I'm now working for Lucent R&D
inshenzhen, and personally find the data mining/KDD will be a trend in
theTelcom Industry(TMN) and which may be one of my goals in future.
However,most of the technical discussions here are somewhat too
theoritical, and toofar away from real application/implementation.
We can keep on the discussion.
Best Regards,
Alex
(3) BY:huj3
The viewpoint makes sense. However I still think Han's book gives a good
introduction to data mining field. Why? Because it covers the major
topics, important algorithms and ideas in a very systematic way.
Although for eachtopic there might be thousands relevant paper, most
of them share some important ideas, which come from those papers as
milestones. Just think how many variations of BIRCH, a hierachical
clustering technique, emerge after BIRCH was proposed. Additionally,
it provides a huge bunch of references, which are valuable for
beginners.
A good introductory book is not necessarily an encyclopedia, which
covers everything. Actually every book, every article has its own
point of view. It focuses on something, and mentions others a little
bit. That's it! Think about survey papers, is there any paper of the
kind that successfully cover every thing like methodologies,
algorithms in the associated field? If so, the field must be too narrow.
Data mining is a multidisciplinary field. Researchers in database,
machine learning, pattern recognition, statistics would make their own
contribution, and meanwhile, benefit from each other. Thinking it
carefully, it is not difficult to find the differences between their
research diagrams. That's another long story.
As to the theory and application, I don't think we pay too much
attention to the theoretical stuffs.Actually we just touch the surface
of the theory yet.. Take telecommunication industry as an example. How
to analyze a datastream in an acceptable time limit is really a
theoretical problem, which comes out from the application field. My
suggestion is that we shouldn't treat the data mining field from an
engnieering point of view, which dominates research for a long time in
China.
(4) BY :Wnqian
Yes, as I said, Han's book do give a brief introduction. But also
not
suitable for beginers.
Its organization for the whole field of data mining is quite good. But,
each chapter, such as Cluster Analysis (Chapter 8), is far from good
enough. Han tries to organize each part in a hieararchical way, which do
help at some time, but fails to find the relationships between the
contents. You can see that in his book, each subsestion usually
corresponds to one paper, but lacks of cross-reference between sections.
:)
BIRCH is important since it is the first clustering algorithm that can
handle VLDB well. But I think Han's book doesn't describe it well in the
context. Similarly, DBSCAN and OPTICS can work because that they have
indexing support, but still, Han et al. don't emphasize on it, which may
greatly mis-lead the readers.
A good introductory book should make it clear that the application and
research is different, and what has been applied in real applications,
what is developing. From this point of view, Han's book does not
satisify the request.
According to the comments on the theory, I agree with you. In fact, I am
also doing work on research. And I like it. But, I think I should
remind that when reading Han's book, we should think more.
Furthermore, people have different background have different viewpoint.
For you or me, we have academic background, of course we will stand
in the point that data mining is a quite good research field. But at the
side of those people from companies, how can you let them think about
the algorithm of BIRCH, CURE, ROCK, DBSCAN, and the related complexity
issues?
It seems that you have quite a lot of thought about clustering and
data stream. It's really interesting. Discussion is welcome!
(5)BY:Alex
Hi Weining and Huj,
I have to say that I only know data mining in surface, but not so
detailed it theory and algorithms.
Actually I come out with this idea from my work. The system we design
is something very much alike a data warehouse or is a data warehouse, we
collect all types of telcom data from different management systems and
stored in a unified modelled the database.
The problem here is how to use the data. The status now can be called
a GIGO(garbage in garbage out). Currently, most of the management system
in Telcom focus on the Equipment level or Network level, but how to use
the data focus on the Service level which is the leading edge and trend of
the telcom management systems.
I think here I should introduce some concept in the Telcom Network which
is totally different to the Data Network(Internet or TCP/IP network) most
us familiar with. Telcom Network is basicly formed up by Telphone/Data
Network( connect to End user, Public Phone user, Leased Line user),
Access Network(interface between the core transport network and end user
network, e.g. ATM, GB router), and Transport Network(SDH/DWDM, core network,
backbone). e.g. China Telcom build the core network and most of the access
network, while some of the ISPs leased a bandwidth and privide the access net
work
to internet users.
The Service Level systems bind closely to the ISPs, what the ISPs care
when they lease a 100M bandwidth from China Telcom: the Qos, in
terminologies, the FM and PM info(Fault Management and Performance Management
), all
those data are already available in different systems provided by different
vendors.
As I have said, this service level requirement becomes hot recently,
in China, this is really in the beginning stages, just think about it:
years ago when there is only China Telcom , how ISPs can request for such
services. But now as the separation of Telcom Industry in China comes
to reality, this becomes hotter.
Here is how I turn to data mining, when millions of circuits and there
in the data ware house, how the find out the most problematic
circuits(performance is low and with a lot of service-affecting
alarms) andaddress where the problem is?(what is a circuit: you can think it
as a
e.g. 2M leased line from ShangHai to Shenzhen and provide the co-located
company a circuit to build its private network.)
It comes out firstly to me something like a statistically issue, but
when thinking deeper in implementation/application level, this becomes data
mining.
I'm quite busy with my assignment and can only study it in my spare
time. Hope I can get clues from you guys. As most of you should know more
about DM than I do.
Best Regards,
Alex
--
Welcome to http://datamining.bbs.lilybbs.net.
※ 修改:.yaomc 於 Jan 18 16:02:13 修改本文.[FROM: 202.204.36.15]
※ 来源:.南京大学小百合站 bbs.nju.edu.cn.[FROM: 202.204.36.15]
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -