Website Data Mining

Collaborative filtering systems that are based on user feedback have two limitations. First, they rely heavily on explicit feedback which has several drawbacks. Second, since the user models are static, recommendations become inaccurate as the user models age. These systems are therefore only employed in subjective domains where the user’s interests stay relatively unchanged and where the user perceives a direct benefit from rating items.

A number of approaches to website data mining have been developed that can be used by collaborative filtering systems to make the recommendation task both automatic and dynamic. The data that these web mining techniques use consist of website data of users navigating through a web site.


A transaction is a list of web pages a user has visited on a web site during a continuous period of time. There are several ways in which the list of web pages can be chosen. The list could contain all the content pages a user has visited during a user session, a time interval in which the user browses through the web site. A list could also consist of one content page and all the navigation pages a user needed to navigate to the content page. Here it is assumed that a transaction is the list of all the web pages a user has visited during a user session.

Transaction Clustering

Transactions can be represented as vectors much like documents are represented in the vector space model. If there are m web pages, a transaction T is represented as an m-dimensional vector, where each dimension i corresponds to weight  assigned to the i-th page. Hence, a transaction T is written as .

Several different weighting schemes can be used. One weighting scheme is to use binary weights where the weight is 1 if the user has visited the page during a session and 0 otherwise. Other weighting schemes include the time users spend reading a page and the number of times a page is accessed during a session.

The cosine measure can be used to calculate the similarity between transaction  and :

Examples of techniques that can be used for transaction clustering are the k-means algorithm and the single pass algorithm. Regardless of which algorithm is used, the input is always a set  of m transaction vectors and the output is always a set  of n clusters. A cluster is a set of transaction vectors representing a group of users with similar access patterns. Transaction clusters themselves however are not suitable for predicting future access patterns since each cluster may contain hundreds of transactions and web pages. After the transactions are grouped into clusters, the centroid of every vector is therefore computed. Pages that appear infrequently in a cluster are sometimes left out by setting their corresponding value in the centroid to zero if a value is below a certain threshold.

Note: This representation method does not take the order in which pages are accessed into account. Although methods exist that use page order, they seem to play a more important role in systems that improve the structure of the web site rather than in systems that provide dynamic recommendations.

Computing Recommendations

The clustering technique described in the previous section provides a set of clusters that represent similar access patterns among users of a web site. These clusters can be compared with the session of an active user in order to find web pages the user hasn’t visited yet but other users with similar access patterns have. The active user session is maintained as a transaction vector P and has to be updated when the user requests a new page. To recommend pages, vector P is compared with the centroids of all the clusters and the pages in the most similar clusters that have not been visited are selected. The selected pages can be ordered by using the similarity values and the centroid weights.

Leave a Reply