Content-based filtering, also referred to as cognitive filtering, recommends items based on a comparison between the content of the items and a user profile. The content of each item is represented as a set of descriptors or terms, typically the words that occur in a document. The user profile is represented with the same terms and built up by analyzing the content of items which have been seen by the user.
Several issues have to be considered when implementing a content-based filtering system. First, terms can either be assigned automatically or manually. When terms are assigned automatically a method has to be chosen that can extract these terms from items. Second, the terms have to be represented such that both the user profile and the items can be compared in a meaningful way. Third, a learning algorithm has to be chosen that is able to learn the user profile based on seen items and can make recommendations based on this user profile.
The information source that content-based filtering systems are mostly used with are text documents. A standard approach for term parsing selects single words from documents. The vector space model and latent semantic indexing are two methods that use these terms to represent documents as vectors in a multi dimensional space.
Relevance feedback, genetic algorithms, neural networks, and the Bayesian classifier are among the learning techniques for learning a user profile. The vector space model and latent semantic indexing can both be used by these learning methods to represent documents. Some of the learning methods also represent the user profile as one or more vectors in the same multi dimensional space which makes it easy to compare documents and profiles. Other learning methods such as the Bayesian classifier and neural networks do not use this space but represent the user profile in their own way.
Choosing a Learning Method
The efficiency of a learning method does play an important role in the decision of which method to choose. The most important aspect of efficiency is the computational complexity of the algorithm, although storage requirements can also become an issue as many user profiles have to be maintained. Neural networks and genetic algorithms are usually much slower compared to other learning methods as several iterations are needed to determine whether or not a document is relevant. Instance based methods slow down as more training examples become available because every example has to be compared to all the unseen documents. Among the best performers in terms of speed are the Bayesian classifier and relevance feedback.
The ability of a learning method to adapt to changes in the user’s preferences also plays an important role. The learning method has to be able to evaluate the training data as instances do not last forever but become obsolete as the user’s interests change. Another criteria is the number of training instances needed. A learning method that requires many training instances before it is able to make accurate predictions is only useful when the user’s interests remain constant for a long period of time. The Bayesian classifier does not do well here. There are many training instances needed before the probabilities will become accurate enough to base a prediction on. Conversely, a relevance feedback method and a nearest neighbor method that uses a notion of distance can start making suggestions with only one training instance.
Learning methods also differ in their ability to modulate the training data as instances age. In the nearest neighbor method and in a genetic algorithm old training instances will have to be removed entirely. The user models employed by relevance feedback methods and neural networks can be adjusted more smoothly by reducing weights of corresponding terms or nodes.
The learning methods applied to content-based filtering try to find the most relevant documents based on the user’s behavior in the past. Such approach however restricts the user to documents similar to those already seen. This is known as the over-specialization problem. As stated before the interests of a user are rarely static but change over time. Instead of adapting to the user’s interests after the system has received feedback one could try to predict a user’s interests in the future and recommend documents that contain information that is entirely new to the user.
A recommender system has to decide between two types of information delivery when providing the user with recommendations:
- Exploitation. The system chooses documents similar to those for which the user has already expressed a preference.
- Exploration. The system chooses documents where the user profile does not provide evidence to predict the user’s reaction.