A representation that is often used for text documents is the vector space model. In the vector space model a document D is represented as an m-dimensional vector, where each dimension corresponds to a distinct term and m is the total number of terms used in the collection of documents. The document vector is written as, where is the weight of term that indicates its importance. If document D does not contain term then weight is zero.
In the Boolean vector approach the term weights are determined by indicating whether or not a term appears in a document. The term is assigned value 1 if the term does occur in the document, otherwise the term is assigned value 0. A more sophisticated measure is the tf-idf scheme. In this approach the terms are assigned a weight that is based on how often a term appears in a particular document and how frequently it occurs in the entire document collection. The first part of the tf-idf scheme is called the term frequency , the number of occurrences of term in document D. The second part is called the inverse document frequency and is calculated as follows:
where n is the total number of documents in the collection and the number of documents in which term appears at least once. The weighting factor of document i is determined by the product of the term frequency and the inverse document frequency:
The assumptions behind tf-idf are based on two characteristics of text documents. First, the more times a term
appears in a document, the more relevant it is to the topic of the document. Second, the more times a term occurs in all documents in the collection, the more poorly it discriminates between documents.
Many variations of the tf-idf scheme have been studied. Salton and Buckley [Salton & Buckley 1988] describe a tf-idf weight for information retrieval that performs well when the documents in the collection consist of technical vocabulary and meaningful terms:
where max tf is the maximum term frequency over all terms in document i.
In the vector space model user profiles can be represented just like documents by one or more profile vectors. To determine
the degree of similarity between a profile vector P, where , and a document vector D different measures can be used. The most common of these is the cosine measure where the cosine of the angle between two vectors is determined as follows:
Note that the cosine measure normalizes the result of the inner (or dot) product of the document and profile vector by considering their length. This prevents larger vectors from producing higher scores only because they have a higher chance of containing similar terms. Other measures include the Dice coefficient, defined as:
and the Jaccard coefficient, defined as: