Text mining attempts to discover knowledge from text documents. Term extraction is usually the first step in a text mining process. Once the terms are found, several other text mining techniques can be used to enhance a content-based filtering system. Two of these text mining techniques are document clustering and using thesauri.
Document Clustering
In order to find interesting documents a content-based filtering system has to search through the entire document collection. By partitioning the document collection into clusters the search space can be reduced. One text mining approach is to use the hierarchical agglomerative clustering method to create clusters of related documents. Documents are represented as vectors in the vector space model and are compared by using the Dice coefficient. The document hierarchy can be searched in several ways. The search can for example start at the root of the tree. The centroids of each child node are compared to a profile vector and the child node with the greatest similarity is then searched. This process repeats itself until the bottom of the tree is reached or the cluster size is smaller then a predefined threshold.
A serious concern for document classification is the computational complexity of clustering methods. The hierarchical agglomerative clustering method for example takes at least time to cluster n documents. Several methods have been proposed to speed up the clustering process. One approach reduces the dimensionality of the document vectors by using the latent semantic indexing representation. Because the vector lengths are much shorter in the LSI space it takes less time to calculate the similarity between two documents.
Thesauri
Documents that are represented in a vector space model based on the selection of single terms cannot be completely identified. There exist a number of text mining approaches that concentrate on the term selection process itself to solve the problems of synonymy and polysemy. The parsing process can be extended to identify phrases for example. Many information retrieval systems identify phrases as frequently occurring pairs of terms that are not separated by a stop word. More complex methods use algorithms from natural language processing to select phrases.
Another approach is to group synonyms of terms together by using a thesaurus. A thesaurus is a set of terms plus a set of relations between these terms. Thesauri can be generated either manually or automatically. A hand-made thesaurus usually contains only domain-specific knowledge since its construction is a very labor intensive process. The automatic construction of a thesaurus requires the clustering of terms that occur frequently together in the document collection. One approach is to represent terms as vectors, where each weight corresponds to the number of occurrences of the term in a certain document. The terms are compared by using the cosine measure. Various clustering methods can then be used to find groups of terms that occur often together. Instead of relying on the number of times two terms appear together in same document, the similarity between two terms can also be determined by the number of times they appear together in the same context (if the terms are in close proximity to each other). Another possibility is to use the number of co-occurrences of two terms in the same document cluster.
Automatically generated thesauri suffer from the fact that a relation between two terms is only found when the terms occur frequently in the document collection.