Web mining is closely related to data mining, a process that discovers knowledge from large amounts of data without human interference. The term web mining is used when knowledge is discovered from internet data sources. Information filtering systems use web mining techniques for two types of web data. Content-based filtering systems abstract knowledge from web documents while collaborative filtering systems use information about web users.
Several web mining techniques use clustering methods to enhance information filtering systems. Clustering is an unsupervised learning technique that divides a collection of items into groups.
Clustering methods can follow several approaches to group items into clusters. Two examples are the iterative and the hierarchical agglomerative approach. In the iterative approach the items are assigned to a cluster one by one. In the hierarchical agglomerative approach the items are linked together to form a tree structure. Because a hierarchical agglomerative method compares every item to all other items it takes more time than an iterative method. Iterative methods however are dependent upon the order in which the items are received and do not always produce the same result for the same set of items. The two approaches are described in more detail in the next sections.
Most clustering methods require a similarity measure to compare items. Because the items mentioned here are all represented as vectors they can be compared by using the similarity measures of the vector space model. The similarity between two clusters can be determined in several ways. The least (or most) similar pair of items from the two clusters can be compared for example. Another approach is to compare the mean vector of each cluster which is called the centroid.
Hierarchical Agglomerative Clustering
This method is called hierarchical agglomerative because the result of the clustering is a hierarchy. The method starts by constructing an n x n matrix, where n is the total number of items, in which the similarity between every item in the collection is stored. Every item is then placed into its own cluster and the two most similar clusters are combined into a new cluster. After a new cluster is created, the similarity between the new cluster and the other clusters is computed. This process is repeated until only one cluster at the highest level remains.
The fastest and simplest iterative method is the single pass method. In this method each item is processed only once. Every time a new item is processed the similarity between the item and all existing clusters is computed. The item is assigned to the closest cluster or to a new cluster if no existing cluster is similar enough.
A more complex iterative method is the k-means clustering algorithm. This algorithm begins with a set of k empty clusters and selects an item (usually at random) that serves as a prototype for every cluster. The items are then added to a cluster based on their similarity with the prototype of a cluster. After al the items are assigned to a cluster a new prototype is constructed for every cluster, by determining the centroid of every cluster. The clusters are then emptied and the items are added to the clusters again. This process is repeated until the clusters or the prototypes do not change significantly.
Information Filtering Enhancements
Several web mining techniques exist that could improve information filtering systems. Although these techniques are all based on clustering methods, their purpose varies widely. Two potential enhancements for content-based filtering are documents clustering and using a thesaurus. With mining website data more dynamic user models for collaborative filtering systems are created. This increases the number of domains in which collaborative techniques can be used for filtering. Although the technique is usually based on the information found in access logs it can easily be extended with other data. These include explicit ratings and the content representation of the web pages. With a content representation a hybrid information filtering system could be created that is highly dynamic.