⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 17.txt

📁 This complete matlab for neural network
💻 TXT
📖 第 1 页 / 共 4 页
字号:
发信人: soullion (river), 信区: DataMining
标  题: Automatic Personalization Based on Web Usage Mini
发信站: 南京大学小百合站 (Fri Mar  1 12:57:56 2002)


http://maya.cs.depaul.edu/~mobasher/personalization/

  

Bamshad Mobasher 

Dept. of Computer Science, DePaul University, Chicago, IL 

mobasher@cs.depaul.edu 


Robert Cooley, Jaideep Srivastava 

Dept. of Computer Science, University of Minnesota, Minneapolis, MN 

cooley@cs.umn.edu, srivasta@cs.umn.edu







Introduction 


The ease and speed with which business transactions can be carried out over th
e Web have been a key driving force in the rapid growth of electronic commerce
. Business-to-business e-commerce is the focus of much attention today, mainly
 due to its huge volume. While there are certainly gains to be made in this ar
ena, most of it is the implementation of much more efficient supply management
, payments, etc. On the other hand, e-commerce activity that involves the end 
user is undergoing a significant revolution. The ability to track users’ brow
sing behavior down to individual mouse clicks has brought the vendor and end c
ustomer closer than ever before. It is now possible for a vendor to personaliz
e his product message for individual customers at a massive scale, a phenomeno
n that is being referred to as mass customization. 


Though the scenario outlined here is from e-commerce, the type of personalizat
ion described is applicable to any Web browsing activity. Web personalization 
can be described, as any action that makes the Web experience of a user person
alized to the user’s taste. The experience can be something as casual as brow
sing the Web or as (economically) significant as trading stocks or purchasing 
a car. The actions can range from simply making the presentation more pleasing
 to an individual to anticipating the needs of the user and providing the righ
t information as well as performing a set of routine book-keeping functions au
tomatically. 


Principal elements of Web personalization include modeling of Web objects (pag
es, etc.) and subjects (users), categorization of objects and subjects, matchi
ng between and across objects and/or subjects, and determination of the set of
 actions to be recommended for personalization. Existing approaches used by ma
ny Web-based companies, as well as approaches based on collaborative filtering
, e.g. GroupLens [KMM+97, HKBR99] and Firefly [SM95], rely heavily on getting 
human input, e.g. user profile, for determining the personalization actions. T
he drawbacks of this are, (a) the input is often a subjective description of t
he users by the users themselves, and thus prone to biases, and (b) the profil
e is static, and thus good for personalization for some time after it is colle
cted; but its performance degrades over time as the profile ages. 


Recently, a number of approaches have been developed dealing with specific asp
ects of Web usage mining for the purpose of automatically discovering user pro
files. For example, Perkowitz and Etzioni [PE98] proposed the idea of optimizi
ng the structure of Web sites based co-occurrence patterns of pages within usa
ge data for the site. Schechter et al [SKS98] have developed techniques for us
ing path profiles of users to predict future HTTP requests, which can be used 
for network and proxy caching. Spiliopoulou et al [SF99], Cooley et al [CMS99]
, and Buchner and Mulvenna [BM99] have applied data mining techniques to extra
ct usage patterns from Web logs, for the purpose of deriving marketing intelli
gence. Shahabi et al [SZA97], Yan et al [YJGD96], and Nasraoui et al [NFJK99] 
have proposed clustering of user sessions to predict future user behavior. 


In this paper we describe an approach to usage-based Web personalization takin
g into account the full spectrum of Web mining techniques and activities. Our 
approach is described by the architecture shown in Figure 1, which heavily use
s data mining techniques, thus making the personalization process both automat
ic and dynamic, and hence up-to-date. Specifically, we have developed techniqu
es for preprocessing of Web usage logs and grouping URL references into sets c
alled user transactions [CMS99]. A user transaction is a unit of semantic acti
vity, and performing data mining on them is more meaningful. We describe and c
ompare three different Web usage mining techniques, based on transaction clust
ering, usage clustering, and association rule discovery, to extract usage know
ledge for the purpose of Web personalization. We also propose techniques for c
ombining this knowledge with the current status of an ongoing Web activity to 
perform real-time personalization. Finally, we provide an experimental evaluation of the proposed techniques using real Web usage data. 

  


Architecture for Usage-based Web Personalization 


The overall process of usage-based Web personalization can be divided into two
 components. The offline component is comprised of the data preparation tasks 
resulting in a user transaction file, and the specific usage mining tasks, whi
ch in our case involve the discovery of association rules and the derivation o
f URL clusters based on two types of clustering techniques. 


Once the mining tasks are accomplished, the frequent itemsets and the URL clus
ters are used by the online component of the architecture to provide dynamic r
ecommendations to users based on their current navigational activity. The onli
ne component is comprised of a recommendation engine and the HTTP server. The 
Web server keeps track of the active user session as the user browser makes HT
TP requests. This can be accomplished by a variety of methods such as URL rewr
iting, or by temporarily caching the Web server access logs. The recommendatio
n engine considers the active user session in conjunction with the URL cluster
s and the discovered association rules to compute a set of recommended URLs. T
he recommendation set is then added to the last requested page as a set of lin
ks before the page is sent to the client browser. 


A generalized architecture for the system is depicted in Figure 1. We now disc
uss the details of each of the architectural components. 


 


Figure 1. General Architecture for A Usage-Based Web Personalization System





Mining Usage Data for Web Personalization 


The offline component of usage-based Web personalization can be divided into t
wo separate stages. The first stage is that of preprocessing and data preparat
ion, including, data cleaning, filtering, and transaction identification. The 
second is the mining stage in which usage patterns are discovered via methods 
such as association-rule mining and clustering. Each of these components is di
scussed below. 


Preprocessing Tasks 


The prerequisite step to all of the techniques for providing users with recomm
endations is the identification of a set of user sessions from the raw usage d
ata provided by the Web server. Ideally, each user session gives an exact acco
unting of who accessed the Web site, what pages were requested and in what ord
er, and how long each page was viewed. Two of the biggest impediments to formi
ng accurate user sessions are local caching and proxy servers. In order to imp
rove performance and minimize network traffic, most Web browsers cache the pag
es that have been requested. As a result, when a user hits the "back" button, 
the cached page is displayed and the Web server is not aware of the repeat pag
e access. Proxy servers provide an intermediate level of caching and create ev
en more problems with identifying site usage. In a Web server log, all request
s from a proxy server have the same identifier, even though the requests poten
tially represent more than one user. Also, due to proxy server level caching, multiple users throughout an extended period of time could actu
ally view a single request from the server. The most reliable methods for reso
lving a server log into user session are the use of cookies or dynamic URLs wi
th an embedded session ID. However, these techniques are not always available 
due to privacy concerns of the users, or limitations of the capabilities of th
e Web server. As described in detail in [CMS99], several simple heuristics usi
ng the referrer and agent fields of a Server log can be used to identify user 
sessions and infer missing references with relative accuracy in the absence of
 additional information such as cookies. 


In addition to identifying user sessions, the raw log must also be cleaned, or
 transformed into a list of page views. Due to the stateless connection proper
ties of the HTTP protocol, several file requests (HTML, images, sounds, etc.) 
are often made as the result of a single user action. The group of files that 
are sent due to a single click are referred to as a page view. Cleaning the se
rver log involves removing all of the file accesses that are redundant, leavin
g only one entry per page view. This includes handling page views that have mu
ltiple frames, and dynamic pages that have the same template name for multiple
 page views. It may also be necessary to filter the log files by mapping the r
eferences to the site topology induced by physical links between pages. This i
s particularly important for usage-based personalization, since the recommenda
tion engine should not provide dynamic links to "out-of-date" or non-existent 
pages. 


Each user session in a user session file can be thought of in two ways; either
 as a single transaction of many page references, or a set of many transaction
s each consisting of a single page reference. The goal of transaction identifi
cation is to dynamically create meaningful clusters of references for each use
r. Based on an underlying model of the user's browsing behavior, each page ref
erence can be categorized as a content reference, auxiliary (or navigational) 
reference, or hybrid. In this way different types of transactions can be obtai
ned from the user session file, including content-only transactions involving 
references to content pages, and navigation-content transactions involving a m
ix of pages types. The details of methods for transaction identification are d
iscussed in [CMS99]. For the purpose of this paper we assume that each user se
ssion is viewed as a single transaction. Finally, the session file may be filt
ered to remove very small transactions and very low support references (i.e., URL references that are not supported by a specified number of 
user transactions). This type of support filtering can be important in removin
g noise from the data, and can provide a form of dimensionality reduction in c
lustering tasks where URLs appearing in the session file are used as features.
 


Given the preprocessing steps outline above, for the rest of this paper we ass
ume that there is a set of n unique URLs appearing in the preprocessed log: 



 


and a set of m user transactions: 


 


where each ti Î T is a non-empty subset of U. 


Discovering Frequent Itemsets and Association Rules 


The association rule discovery methods such as the Apriori algorithm [AS94], i
nitially find groups of items (which in this case are the URLs appearing in th
e preprocessed log) occurring frequently together in many transactions. Such g
roups of items are referred to as frequent item sets. Given a set I = {I1, I2,
 …, Ik} of frequent itemsets, the support of Ii is defined as 


 


Generally, a support threshold is specified before mining and is used by the a
lgorithm for pruning the search space. The itemsets returned by the algorithm 
satisfy this minimum support threshold. Furthermore, support is downward close
d: if an item set does not satisfy the minimum support criteria, then neither 
do any of its supersets. 


Association rules capture the relationships among items based on their pattern
s of co-occurrence across transactions. In the case of Web transactions, assoc
iation rules capture relationships among URL references based on the navigatio
nal patterns of users. An association rule r is an expression of the form 


 


where sr is the support of X È Y, and ar is the confidence for the rule
 r given by s(XÈ Y) / s(X). 


Despite some shortcomings (which we point out in the discussion of our experim
ental results), in many cases frequent itemsets and association rules can be u
sed directly to provide effective recommendations as part of the personalizati
on task. We will describe a simple and efficient technique to do so in the sub
sequent sections. They are also the foundation of our usage clustering techniq
ue based on Association-Rule Hypergraph Partitioning [HKKM97, HKKM98], which i
s used as part of a more general and robust method for computing recommendatio
ns. 


Clustering Transactions 


Traditional collaborative filtering techniques are often based on matching the
 current user's profile against clusters of similar profiles obtained by the s
ystem over time from other users. A similar technique can be used in the conte
xt of Web personalization by first clustering user transactions identified in 
the preprocessing stage. However, in contrast to collaborative filtering, clus
tering user transactions based on mined information from access logs does not 
require explicit ratings or interaction with users. In our case, user transact
ions are mapped into a multi-dimensional space as vectors of URL references. S
tandard clustering algorithms generally partition this space into groups of it
ems that are close to each other based on a measure of distance. In the case o
f Web transactions, each cluster represents a group of transactions that are s
imilar based on co-occurrence patterns of URL references. 


Given a user transaction t ÎT, we can represent the transaction as a (bi
t) vector 


 

where 


 

Some other proposals [SZAS97, YJGD96] have suggested using, instead of binary 
weights, feature weights based on the time a user spends on a particular page 
or the frequency of occurrence of a URL reference within the user transaction.
 However, neither of these seems intuitively or practically justifiable in the
 context of Web transactions. For example, studies have suggested [KMM+97] tha
t for a particular user, the amount of time spent on a page may not generally 
be a good indication of interest. Furthermore, the frequency of a reference is
 not generally a good measure of importance of a page to the user; it only ind
icates the use of that page as a localized navigational nexus for that particu
lar user. On the other hand, whether the URL reference occurs or whether it do
es not is clearly important. We have thus chosen to use only binary feature we
ights for our vector representation. 


To cluster transactions we need a measure of distance between two transactions
. Given two transactions t and s, we denote the similarity between them by sim
(t,s). A variety of measures can be used to compute similarity. In this work w
e use the normalized cosine of the angle between the two vectors. 


Note that we do not take into account the temporal order of URL references wit
hin transactions. While such a constraint can be easily added to the processes
 of deriving user transactions and clustering of transactions, however, sequen
tial navigational patterns [SF99] seem to play a more important role when the 
purpose of Web usage mining is to improve the quality of the site design and t
he flow of traffic rather than providing dynamic recommendations for users. 

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -