Describing documents and profiles with single terms has several drawbacks. The same concept can usually be described with several words (synonymy) which means that profiles and document will need to contain exactly the same terms in order to be matched. The words “car” and “vehicle” for example can be used to refer to the same type of information. Conversely, many words have more than one distinct meaning (polysemy). A profile containing the term “mouse” for example will produce a match with documents that also contain this term, regardless of whether it was used in the context of computers or
animals.
Latent semantic indexing (LSI) is an extension of the vector space model that tries to overcome these deficiencies by incorporating semantic information. Latent semantic indexing starts with the construction of a term-document matrix in which each entry indicates the number of occurrences of a specific term in a specific document. This matrix is decomposed into three matrices by a process called singular value decomposition (SVD). These new matrices represent documents and terms as vectors of factor values that capture the pattern of term usage among documents. By reducing the number of factor values the underlying semantic structure is revealed while irrelevant noise is filtered out.
Note that Latent semantic indexing does not attempt to interpret the meaning of the factors but merely uses them to represent documents and vectors. A mathematical description of the LSI method is given below.
Schematic of the matrix
From the complete collection of documents a term-document matrix X is formed with t rows (one for every term that appears in the set of documents) and d columns (one for every document in the set). A SVD of matrix A results into the product of three matrices:
.
These three matrices consist of a t x m matrix U that contains the term vectors, an m x m diagonal matrix S of positive singular values, and a d x m matrix V that contains the document vectors. The singular values in S are then sorted by magnitude and only the k largest singular values are kept in S while the others are set to zero. Value k is determined via experimentation. The matrices U and V are also altered such that only the corresponding k columns in U and V remain. With these matrices a new matrix can be computed that approximates the original matrix X.
The result of the SVD can be interpreted geometrically as a k dimensional space where the rows of U and V are taken as coordinates of points that represent terms and documents respectively. When the axes are rescaled with the associated diagonal values of S the cosine or dot product between any two points in the k space can be determined. The matrix for example contains the dot products between every term:
The dot products between all the documents are given by the following matrix:
The major advantage of LSI is the fact that terms in documents and profiles can be very different but still produce a match based on the semantic relation. A disadvantage is the computational complexity of the matrix computations which could reduce run-time performance to an unacceptable level.