Terms can be assigned to the items either automatically or manually. When dealing with text documents, various methods can be used to extract terms automatically. Usually the term extraction process consists of the following parsing steps:
- Remove all HTML tags in the case the document is a web page.
- Recognize individual words.
- Ignore the so called stop words. These are words such as “the” and “or” that occur very often in documents and cannot be used as discriminators.
- Reduce the remainder of the words to their stems by removing prefixes and suffixes. For instance the words “computer”, “computers” and “computing” could all be reduced to “comput”.
These steps provide a list of terms which are used as a description of the content. In order to save processing time some systems select the N most discriminating terms according to some measure. Several measures that could be used for this are discussed in the next section.