📄 396.txt
字号:
发信人: yaomc (白头翁&山东大汉), 信区: DataMining
标 题: [合集]下一步该看些什么
发信站: 南京大学小百合站 (Tue Jan 15 14:53:45 2002), 站内信件
iamgufeng (古风) 于Fri Dec 14 09:26:20 2001提到:
han的书读完了,象书名一样我终于了解了一些concepts。
装了sas从web方面开始实践。但肚里没货走路也不踏实。请问下一步该怎么走,看些什
么。由于本人属于工作之余的兴趣爱好,没有太多的精力和条件去阅读几年的文献。
当前dm发展的障碍在哪里,热点又是什么?
yaomc (白头翁&山东大汉) 于Fri Dec 14 10:41:08 2001提到:
我认为你首先清楚你现在想干什么?
数据挖掘如果不是合作,那么在研究的方向必须自己由相关的专业知识,明白该领域
在哪些方面存在问题,需要解决的是什么东西?而数据挖掘又能够在这方面做什么?
然后在用合适的方法去逐渐的锻炼和反复的实践。
aniky (童话中的红房子) 于Fri Dec 14 12:40:11 2001)
提到:
我正在看这本书那,有什么忠告说来听听?
iamgufeng (古风) 于Sun Dec 16 12:29:45 2001提到:
看完han的书大致明白了dm是个啥玩艺,里面大概有哪些道道,哪些实际工作可以借助它
来进一步提高。但具体要深入到某个方向,某种算法,某方面的项目实施经验,可还得
找其他文献了,好在网上有很多好东东。即使这样,靠单个人的力量,什么都从头编程
实现,也挺难的,最好找些成熟的工具辅助学习,等那天吃透了这个领域再去编程实现
某个项目也许更现实一点。
说不好瞎说,请高手指正,也请你们引引路。
具体到web using mining,有两个请教:
1.session划分时间定为多少比较合理。
2.哪些算法更适合用户浏览模式的log分析。
hwe (xiaohui) 于Sun Dec 16 21:02:40 2001提到:
好像是30分钟
roamingo (漫步鸥) 于Mon Dec 17 16:51:21 2001提到:
I think it's better to use a configurable variable when you are writing
a sessionizor. Then, you can find some reasonable interval by yourself.
For my experiments, 10 - 30 minites are all fine. (This is the inactive
period of a session, not the total time span of it.)
For the second question, there are many, depending on what kind of patterns
you are going to find. For example:
* Association: find those pages that tend to be accessed togeter.
* Sequenatial analysis: find the frequent path.
* Markov chain model: predict the next access, often used to do prefetch.
* Clustering (usually the categorial value oriented method, like ROCK
mentioned in Han's textbook):
- session clustering
- page clustering
* And some combinations of the above.
fervvac (高远) 于Mon Dec 17 19:40:12 2001提到:
One EDBT paper (this year) on boundary finding for transactions from
proxy log, but you will probably wait till the electronic edition is
avilable.
iamgufeng (古风) 于Tue Dec 18 09:12:48 2001提到:
What's the EDBT? please tell me the meaning, thanks
fervvac (高远) 于Tue Dec 18 13:28:41 2001提到:
http://sunsite.informatik.rwth-aachen.de/dblp//db/conf/edbt/index.html
second-class top DB conf.
iamgufeng (古风) 于Tue Dec 18 17:18:10 2001提到:
for different implementation and configuration, the session period is
variable.
ASP for example, the property of session.timeout declares the session
period. The default value is 20 minutes. When you open a new browser or
access another application, a new session(represented by sessionid) will
be created.
the session infomation(sessionid cookie) is included in each HTTP request
header,server checks it for different session. so if we can save these
session info we can sessionize correctly. If we ignore those unnecessary
details, like what you said, the inactive period of a session is a good
choice.What's the "session" and "cookie" definition of most dm soft such
as SAS mean?
roamingo,Which methods did you adopt frequently in your experiments, such as
SWLMS?
fervvac (高远) 于Wed Dec 19 14:13:51 2001提到:
One technical question: Is that session information available in the log?
If so, as you have put it, the exact session boundaris could be obtained
and this will nullify many works, :-)
iamgufeng (古风) 于Wed Dec 19 16:04:59 2001提到:
If you open the IIS MMC to configure your application, you will find many
extended properties can be selected in W3C-Extended-Log-Format. Such as
cs(cookie),cs(Refer),etc. These maybe facilitate many works. I don't know
if other HTTP servers like Apache can log such info as cookie,etc. Since
so many dynamic page languages adopt cookie to label and manage user session
infomation. Perhaps there are a lot of other methods.
roamingo (漫步鸥) 于Fri Dec 21 13:38:52 2001提到:
I happened to be running Apache under Linux, and it is also very easy to
customize the log file format under Apache. I have put the cookie
field in it, and it makes unique user identification very accurate.
Of course, you still have to use the timestamp field to perform session
identification. However, the cookie field will always change for those users
who turn the cookie acceptance option off in their browers. Those
log entries have to be discarded or treated differently.
If no cookie information is available, the IP field with optional
information such as "refer" and "user agent" can be used to do
session identification, more adaptive but not so accurate. Mobasher
and Cooley's work around 1998 have detailed discussion on this topic.
This way is also useful for carrying out experiments on some external
weblog data, such as the freely available http log from Berkeley CS server.
http://www.cs.berkeley.edu/logs/
Ronny Kohavi has proposed some new insights on the level of information
the web usage mining algorithm should be carried out. He think it is
better to do it at the E-Commerce application level, not the raw log file
level.
// sorry for my absence. I have just finished my thesis draft yesterday.
iamgufeng (古风) 于Sun Dec 23 17:16:04 2001提到:
Reading some papers regarding to data preparation,frequently traversal path,
etc.(some maybe written by roamingo:)),many ideas about web using mining
become clearer gradually. However,such methods metioned previously are fitter
for static web sites. With relation to my corporation site,based on php+asp/sql,
there are at least following impediments:
1. My registered user is not the one in log. How can i pull these registered
user's session correctly?
2. A dynamic page implys many info category, simple url cannot identify detail
info. Maybe classification by query string parameters after "?" is alternative,
but how about query with "POST" method?
3. Many dynamic operation info of users cannot be logged fully.
4. ...
Perhaps we need integrate mining thinking into site design. we write some codes
generating enough log info what we needed when events occuring, in conjunctio
n with original
server log, valuable data may be extracted really to consecutive mining.
but how these effect on performance?
roamingo (漫步鸥) 于Sun Dec 23 19:18:44 2001提到:
I agree, and that's the exact way (that I think) used by Blue Martini
(by Ronny Kohavi)'s jsp based E-Commerce + CRM based solutions.
(www.bluemartini.com)
Regarding performance, as long as the information logged are carefully
selected, the amount of information logged will be much less than the common
log data, and performance will probabaly not be a bottleneck.
In addition, I'd like to suggest the following paper:
Juhnyoung Lee, Mark Podlaseck, Edith Schonberg, and Robert Hoch.
Visualization and Analysis of Clickstream Data of Online Stores for
Understanding Web Merchandising. In: Applications of Data Mining to
Electronic Commerce, Special issue of the International Journal of Data
Mining and Knowledge Discovery, January 2001.
Guest editors: Ronny Kohavi and Foster Provost
URL: http://robotics.stanford.edu/~ronnyk/ecommerce-dm/
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -