📄 17.txt
字号:
Based on the above definition, a similarity matrix can be computed and used as
the basis for clustering. However, distance-based clustering methods often br
eak down when handling very high-dimensional data. In the context of clusterin
g user transactions, the number of dimensions are the URLs appearing in the ac
cess logs and usually number in the thousands. Our experiments suggest that fi
ltering the logs by removing references to low support URLs can provide an eff
ective dimensionality reduction method while actually improving clustering res
ults. Additionally, filtering URL references to pages that are no longer part
of the site graph are accomplished during the preprocessing stage.
A variety of clustering techniques can be used for clustering transactions. Fo
r our experiments, we use a multivariate k-means algorithm to obtain transacti
on clusters. Regardless of which method is used, transaction clustering will r
esult in a set C = {c1, c2, …, ck} of clusters, where each ci is a subset of
T, i.e., a set of user transactions. Essentially, each cluster represents a gr
oup of users with "similar" access patterns. However, transaction clusters by
themselves are not an effective means of capturing an aggregated view of commo
n user profiles. Each transaction cluster may potentially contain thousands of
user transactions involving hundreds of URL references. Our ultimate goal in
clustering user transactions is to reduce these clusters into URL clusters. Th
e URL clusters associated with each transaction cluster will serve as an aggre
gated view and a representative of users whose behavior is captured in the tra
nsaction clusters.
For each cluster c Î C, we compute the mean vector mc. The mean value fo
r each URL in the mean vector is computed by finding the ratio of number of oc
currence of that URL across transactions in the cluster to the total number of
transactions in the cluster. The mean vector can be viewed as a set of URLs e
ach having an associated weight corresponding to the mean value for that URL i
n the mean vector. We then filter out low-support URLs, i.e. those with mean v
alue below a certain threshold m. Thus, a URL cluster associated with a transa
ction cluster c, is the set of all URLs in whose weight is greater than or equ
al to m. For example, if the threshold m is set at 0.5, then each URL cluster
will contain only those URL references which appear in at least 50% of transac
tions within its associated transaction cluster. Each URL cluster, in turn, ca
n be represented as vectors in the original n-dimensional space.
Once the representative URL clusters have been computed, a partial session for
the current user (the active session) can be assigned to a matching cluster.
We must then determine which URLs from this cluster should be provided to the
user as recommendations. The details of the matching and recommendation proces
s are discussed in the next section.
Usage Clusters
Another approach we consider is to directly compute clusters of URL references
based on how often they occur together across user transactions (rather than
clustering transactions, themselves). We call the URL clusters obtained in thi
s way, usage clusters. In general, this technique will result in a different t
ype of URL clusters compared to the transaction clustering technique. The URL
clusters obtain by reducing transaction clusters group together pages that co-
occur commonly across "similar" transactions. On the other hand, usage cluster
s tend to group together frequently co-occurring items across transactions, ev
en if these transactions are themselves not deemed to be similar. This allows
us to obtain clusters that potentially capture overlapping interests of differ
ent types of users. The question of which type of clusters are most appropriat
e for personalization tasks is an open research question, however, the answer
to this question, in part, depends on the structure and content of the specific site. We revisit this issue in the discussion of our experime
ntal results.
However, traditional clustering techniques, such as distance-based methods gen
erally cannot handle this type clustering. The reason is that instead of using
URLs as features, the transactions must be used as features, whose number is
in tens to hundreds of thousands in a typical application. Furthermore, dimens
ionality in this context may not be appropriate, as removing a significant num
ber of transactions as features may lose too much information.
We have found that the Association Rule Hypergraph Partitioning (ARHP) techniq
ue [HKKM97, HKKM98] is well-suited for this task, especially since it provides
automatic filtering capabilities, and does not require distance computations.
Furthermore, ARHP is particularly useful in efficiently clustering high-dimen
sional data sets without requiring dimensionality reduction as a preprocessing
step. The ARHP has been used successfully in a variety of domains, including
the categorization of Web documents [HBG+99].
The set I of frequent itemsets discovered in the mining stage are used as hype
redges to form a hypergraph H = <V, E>, where V Í U and EÍ I. A
hypergraph is an extension of a graph in the sense that each hyperedge can con
nect more than two vertices. The weights associated with each hyperedge are co
mputed based on the confidence of the association rules involving the items in
the frequent itemset. Note that additional constraints can be placed on the i
temsets in I to obtain E. For example, we may be interested in itemsets satisf
ying criteria such as minimum support, minimum size, or temporal constraints a
mong items within the itemset. The hypergraph H is then partitioned into a set
of clusters C = {c1, c2, …, ch}. The similarity among items is captured impl
icitly by the frequent item sets. Each cluster ci represents a group of items
(URLs) which are very frequently accessed together across transactions.
Each cluster is examined to filter out vertices that are not highly connected
to the rest of the vertices of the partition. The connectivity function of ver
tex v (a URL appearing in the frequent itemset) with respect to a cluster c is
defined as follow:
Connectivity measures the percentage of edges with which a vertex is associate
d. A high connectivity value suggests that the vertex has many edges, connecti
ng a good proportion of the vertices in the partition. The vertices with conne
ctivity measure greater than a given threshold value are considered to belong
to the partition, and the remaining vertices are dropped from the partition. I
n the ARHP method, additional filtering of non-relevant items can be achieved
using the support criteria in the association rule discovery components of the
algorithm. Depending on the support threshold. Items that do not meet support
(i.e., URLs appear often in transactions with other URLs) will be pruned.
The connectivity value of an item (URL) defined above is important also becaus
e it is used as the weight associated with that item for the cluster. As noted
in the case of transaction clustering, the weights associated with URLs in ea
ch cluster are used as part of the recommendation process when clusters are ma
tched against an active user session.
The Recommendation Process
The recommendation engine is the online component of the Web personalization s
ystem based on usage mining. The task of the recommendation engine is to compu
te a recommendation set for the current session, consisting of links to pages
that the user may want to visit based on similar usage patterns. The recommend
ation set essentially represents a "short-term" view of potentially useful lin
ks based on the user's navigational activity through the site. These recommend
ed links are then added to the last page in the session accessed by the user b
efore that page is sent to the user browser.
In general there are several factors that we would like to consider in determi
ning the recommendation set. These factors include:
the matching criteria for each cluster or frequent itemset based on the its si
milarity to the current active session;
whether the candidate URLs for recommendation have been visited by the user in
the current active session;
a short-term history depth for the current user representing the portion of th
e user's activity history that we should consider relevant for the purpose of
making recommendations; and
the length of the physical link path from the current active session window to
the candidate URL for recommendation.
Using a fixed-size sliding window over the current active session can capture
the current user’s history depth. For example, if the current session (with a
window size of 3) is <A, B, C>, and the user references the URL D, then the n
ew active session becomes <B, C, D>. This makes sense in the context of person
alization since most users go back and forth while navigating a site to find t
he desired information. For example a user might navigate to a particular "con
tent" page using a path of a certain size before deciding to go back and follo
w a different path to another content page representing an independent piece o
f information. In many cases these sub-sessions have a length of no more than
2 or 3 references. In such a situation, it may not be valuable to use referenc
es a user made in a previous sub-session to make recommendations during the cu
rrent sub-session.
Thus, the sliding window of size n over the active session allows only the las
t n visited pages to influence the recommendation value of items in the recomm
endation set. In this context, the notion of a sliding session window is simil
ar to the notion of N-grammars discussed in [Cha96]. An important question wor
th further investigation is the determination of the optimum window size autom
atically. Our experiments indicate that the average user session length may be
a good measure for determining the window size.
The length of the physical link path can be determined by maintaining a direct
ed graph representing the topology of the site. Each node in the graph is a UR
L representing a page in the site. There is a directed edge from a node x to a
node y, if there is a physical link to the page corresponding to y from the p
age corresponding to x. We would like to weight a candidate URL higher (for th
e purpose of recommendation), if it is farther away form the current active se
ssion. In a sense, we would consider such a candidate URL a better recommendat
ion than one with close proximity to the user's current location.
The physical link distance between two URLs u1 and u2 is the length of a minim
al path from u1 to u2 in the site graph described above. Given a site graph G
= (V, E), where the set of vertices V ÍURL, we denote the physical link
distance factor between the active session s and a URL u Ï s by dist(u,s
,G), which is defined as the smallest physical link distance between u and any
of the URLs in s. Then, the link distance factor of u with respect to s is de
fined as:
If the URL u is in the active session, then ldf(u,s) is taken to be 0. We take
the log of the link distance so that it does not count too heavily compared t
o item weights within clusters or the frequent itemsets.
Computing Recommendations Directly from Frequent Itemsets
An efficient method for computing recommendation sets is to directly utilize t
he frequent itemsets obtained in the offline usage mining stage. After identif
ying user sessions in the preprocessing steps and then discovering frequent it
emsets of URLs, we match the current user session window with itemsets to find
candidate URLs for giving recommendations. Given a window size of w, we only
consider all itemsets of size w+1 satisfying a specified support threshold and
containing the current session window. The recommendation value of each candi
date URL is based on the confidence of the corresponding association rule whos
e consequent is the singleton containing the candidate URL. If the rule satisf
ies a specified confidence threshold requirement, then the candidate URL is ad
ded to the recommendation set.
The basic outline of the algorithm is as follows:
Input: an active session window w of fixed maximum size s
The site graph G (for computing the link distance factor)
Minimum support threshold s
Minimum confidence a
Recommend ç Æ
for each frequent itemset I of size |w|+1 such that w Í I do
if support(I) ³ s then
let c = confidence(w =>{u})
if c ³ a then
u.rec_score ç c * ldf(u, w)
Recommend ç RecommendÈ {u}
end if
end if
end for
Figure 2. Algorithm for Computing Recommendations Using Frequent Itemsets
Note that the search for itemsets (of size |w|+1) containing the current sessi
on window can be done efficiently, since during the mining process the discove
red itemset can be organized into a lexicographic tree [AAP99]. Each node at d
epth d in the tree corresponds to an itemset, I, of size d with its children b
eing itemsets of size d+1 that contain I.
It should be noted that, depending on the support threshold used in the above
algorithm, it might be difficult to find large enough itemsets that could be u
sed for providing recommendations. This is particularly true for sites with ve
ry small average session sizes. An alternative to reducing the support thresho
ld in such cases would be to reduce the session window size. This latter choic
e may itself lead to some undesired effects since we may not be taking enough
of the user's activity history into account. This kind of a situation is furth
er examined in our experimental result section below.
Computing Recommendations Based on URL Clusters
Either of the clustering methods described earlier (usage clustering based on
hypergraph partitioning or transaction clustering and computing the mean clust
er values) will ultimately result in a set of clusters each containing highly
related URLs with respect to user transactions. Depending on the clustering me
thod used, or the mechanism for reducing transaction clusters into a set of UR
Ls, a weight will be associated with each URL u in a cluster c, which we denot
e by weight(u, c). Recall that in the case of transaction clustering the weigh
t for a URL is its mean value in the cluster mean transaction mc. In the case
of usage clusters obtained using the ARHP method, the weight is the connectivi
ty value of the item within the cluster, conn(u, c).
Each of these URL clusters can be viewed as virtual user profile indicating ho
w various groups of users may access a set of links in the site within their r
espective user transactions. Once a new user starts a session, our goal is to
match, at each step, the partial user session with the appropriate clusters an
d provide recommendations to the user. The recommendations would be presented
as a set of links which will be dynamically added to the top of the page corre
sponding to the last URL reference made by the user. So, the first task is to
obtain a matching score for each cluster based on the cluster similarity to th
e current active session.
We represent each cluster c ÎC, as a vector
where
Furthermore, the current active session s is also represented as a vector
where si = 1, if the user has accessed URLi in this session, and si = 0, other
wise.
The cluster matching score is now computed as follows:
In computing the matching score we normalize for the size of the clusters and
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -