Content-based Filtering

Content-based filtering, also referred to as cognitive filtering, recommends items based on a comparison between the content of the items and a user profile. The content of each item is represented as a set of descriptors or terms, typically the words that occur in a document. The user profile is represented with the same terms and built up by analyzing the content of items which have been seen by the user.

Several issues have to be considered when implementing a content-based filtering system. First, terms can either be assigned automatically or manually. When terms are assigned automatically a method has to be chosen that can extract these terms from items. Second, the terms have to be represented such that both the user profile and the items can be compared in a meaningful way. Third, a learning algorithm has to be chosen that is able to learn the user profile based on seen items and can make recommendations based on this user profile.

The information source that content-based filtering systems are mostly used with are text documents. A standard approach for term parsing selects single words from documents. The vector space model and latent semantic indexing are two methods that use these terms to represent documents as vectors in a multi dimensional space.

Relevance feedback, genetic algorithms, neural networks, and the Bayesian classifier are among the learning techniques for learning a user profile. The vector space model and latent semantic indexing can both be used by these learning methods to represent documents. Some of the learning methods also represent the user profile as one or more vectors in the same multi dimensional space which makes it easy to compare documents and profiles. Other learning methods such as the Bayesian classifier and neural networks do not use this space but represent the user profile in their own way.

Choosing a Learning Method

The efficiency of a learning method does play an important role in the decision of which method to choose. The most important aspect of efficiency is the computational complexity of the algorithm, although storage requirements can also become an issue as many user profiles have to be maintained. Neural networks and genetic algorithms are usually much slower compared to other learning methods as several iterations are needed to determine whether or not a document is relevant. Instance based methods slow down as more training examples become available because every example has to be compared to all the unseen documents. Among the best performers in terms of speed are the Bayesian classifier and relevance feedback.

The ability of a learning method to adapt to changes in the user’s preferences also plays an important role. The learning method has to be able to evaluate the training data as instances do not last forever but become obsolete as the user’s interests change. Another criteria is the number of training instances needed. A learning method that requires many training instances before it is able to make accurate predictions is only useful when the user’s interests remain constant for a long period of time. The Bayesian classifier does not do well here. There are many training instances needed before the probabilities will become accurate enough to base a prediction on. Conversely, a relevance feedback method and a nearest neighbor method that uses a notion of distance can start making suggestions with only one training instance.

Learning methods also differ in their ability to modulate the training data as instances age. In the nearest neighbor method and in a genetic algorithm old training instances will have to be removed entirely. The user models employed by relevance feedback methods and neural networks can be adjusted more smoothly by reducing weights of corresponding terms or nodes.

Exploration Strategies

The learning methods applied to content-based filtering try to find the most relevant documents based on the user’s behavior in the past. Such approach however restricts the user to documents similar to those already seen. This is known as the over-specialization problem. As stated before the interests of a user are rarely static but change over time. Instead of adapting to the user’s interests after the system has received feedback one could try to predict a user’s interests in the future and recommend documents that contain information that is entirely new to the user.

A recommender system has to decide between two types of information delivery when providing the user with recommendations:

  • Exploitation. The system chooses documents similar to those for which the user has already expressed a preference.
  • Exploration. The system chooses documents where the user profile does not provide evidence to predict the user’s reaction.

10 Responses to "Content-based Filtering"

  1. DIma says:


    Actually, I am searching for the source code of the algorithm
    Do you have any idea about how to get it?

    Thanks in advance

    • Kayode says:

      Hi, if you do not understand how the algorithm works looking for the source code won’t help.

      Ill advice you build the algorithm as each case is unique and will have its own problems.

  2. siddu says:

    I working on filter unwanted message from online social network users wall based on “Content based filtering” for that i need algorithm for same….please send content based filtering algorithm.

  3. Nisy says:

    I want to know the various algorithms for content based message filtering and its source code. Can anyone help me?

  4. Sunny says:

    Is Content-based filtering be achieved through MLlib scala? Please point to any references…

    • arun says:

      hi Sunny
      Even I have the same question..Did you find the Answer ?
      What are various Algorithm for content based recommendations.

      • Premvardhan kumar says:

        There are various algorithm for content based filtering for example You can use bag-of-word, tf-idf, word2vec etc. I think it is not useful to you because you asked this question long ago but many of us can be benefited from it.

  5. Amrutha says:

    I want to know the various algorithms for content based message filtering and its source code.
    please provide at least with the algorithm

  6. LuigiBlu says:

    I am trying to update preferences.

    That is, I have a user u with a vector of preferences P composed by triple (alpha, beta, gamma) and its historian.

    I suggest to user u a list, in which the elements are characterized (suppose a, b, c). The rank of an element e it is given by e (a) * alpha + e (b) * beta + e(c) * gamma.

    When the user u send me the feed that is the element e’ that has chosen (which is not the first), how do I correct the initial vector of preferences?

  7. Daniel says:

    Hi! I tried your program and this numbers appeared.

    2164: 0.1460
    2024: 0.0488
    453: 0.0359
    141: 0.0210
    641: 0.0196

    The numbers on the left are the ID’s for the movies right?
    But what are the numbers on the right side(the decimals)?
    Does these numbers on the right represent the ratings for each movies?

Leave a Reply