Department of Software Technology
Vienna University of Technology
The SOMLib Digital Library - Text Representation
Overview
In order to be able to automatically organize documents by content, a numerical representation of the various documents in a collection needs to be created.
This representation is created by full-text indexing the documents.
Template Vector
A list of all words occurring in the document collection, called template vector, is created.
To obtain a better representation of content, words can be reduced to their rough stems by removing the most common suffixes, such as '-ed', '-ing' or plural (i.e. trailing) '-s'.
While some of these primitive stemming rules may produce errroneous results for some words, these do usually not effect the content representation, as e.g. falsely removed trailing '-s' (i.e. not-plural trailing '-s') will be removed at every occurrence of a word ending in '-s', thus stemming it again to a single word stem.
Please note, that during this vector creation procedure, no manually constructed stop-word list is being used, allowing for automatic processing of diverse document collections.
As the result of the wordlist creation process, i.e. the 'template vector' commonly consists of a very huge number of words (about 100.000 words, depending on the document collection), this list should be pruned to a subset relevant for content-based classification of the documents.
This can be achieved by removing all words that appear in a very large number (say more than 90%) of documents, as these words do not contribute to content separation.
This usually includes mostly words frequently referred to as 'stop words', such as articles, pronouns etc. ('the', 'is', 'are', 'he', 'she'), but also of very collection-specific words such as 'computer' in a collection of documents covering only computer-related topics.
Generally, however, the number of words removed by this criterion is rather small (~10-50).
Furthermore, we can remove all words that appear in only very few documents, as they would provide only a very fine-grained content separation between the respective documents.
This is not useful when we want to organize documents by overall content, i.e. classifying them according to their subject matters.
By removing words appearing in less than, say, 3 documents, we also get rid of spelling errors.
This reduction commonly reduces the word list a lot, resulting in lists of about 5.000 - 15.000 words.
The resulting word list is the pruned template vector for the respective document collection.
Individual Vectors
In the second step, a vector description following the template vector for the respective ollection is created for each document.
Basically, every document is described by the words occuring in it.
In the most simple form of document representation, this can be simply a binary indication of the fact whether a word is present or not, leading to a corresponding vector representation of 0 or 1.
this type of representation is usually referred to as binary representation.
A more sophisticated representation can be obtained by counting the number of occurrences of each word in the template vector in every document, yielding a word-histogramm representation for each document.
This representation is also referred to as term frequency (tf) representation.
A variety of different measures can be used to determine the importance of every word in each document based on the term frequencies.
The most common choice for diong this is the so-called term frequency times inverse document frequency (tf x idf) representation, where the importance of each word is judged with respect to the number of documents it appears in.
A word is the more important, the more often it occurs in one single document and the less often it occurs in the colection as such.
Very frequent words, that appear in almost all documents (but too seldom to be removed during the vector reduction process) thus receive a very low weight.
In its simple form, the tf x idf representation for every word could thus be obtained by dividing the term frequency of a word in a given document by the document frequency, i.e. the number of documents this word appears in.
However, today most systems use an enhanced version of the simple tf x idf representation, weighting each word by multiplying the term frequency with the
natural logarithm of (N / df), where N is the number of documents in the colection and df is the document frequency, i.e. the number of documents the respective word appears in.
Last, but not least, the feature vectors may be normalized to unit length
to make up for different document lengths in collections.
Experimental Document Collections and Vector Representations
For our experiments we use a number of different document archives of different sizes.
These include, amongst others, the following (some of the document collections are available on-line for downloading).
- ifs Abstracts Collection
This is a very small collection of abstracts of scientific publications of our department. It is mainly used to test and demonstrate the features of the SOMLib library using a simple and easy-to-understand document collection.
It consists of 50 abstracts, amounting to a total of 102KB ASCII text files.
Parsing these files results in a pruned template vector of about 400 to 500 words, depending on the sophistication of the word stemming and the degree of pruning of the full template vector.
- CIA Wold Factbook, 1990 edition
This is the 1990 edition of the CIA World Factbook, describing various regions of the world in terms of their geographical, social, political, demographic etc. features.
Please note, that for our experiments we used the 1990 edition, i.e. a description of the world based on the situation before the 'fall' of the communist hemisphere.
It consists of 246 documents, or about 2.080 KB of plain ASCII text.
The list below presents some of the vector representations created using various degrees of pruning the vector and different ways of word stemming methods, resulting in vector representations of different dimensionality.
Parsing these files results in a pruned template vector of about 800 - 1500 words.
- TIME Magazine Article Collection of the 1960's
A Collection of articles from the TIME Magazine from the 1960's.
It consists of 420 articles, or a total of 1.550 KB ASCII text.
Parsing these files results in a pruned template vector of about 6000 words.
- Astrophysical Journal Letters Collection
A collection of articles from the Astrophysical Journal Letters.
- TIME Magazine, AsiaWeek Edition 1995-1999
A Collection of recent articles from the TIME Magazine, esp. the AsiaWeek edition of the Time Magazine.
This collection includes all articles from the years 1995 to the first half of 1999, or about 72 MB of HTML-Text.
The single years are treated as individual collections, resulting in pruned template vectors of about 13.000 words.
- Scientific American 1995-1999
A Collection of recent articles from the Scientific American.
This collection includes all articles from the years 1996 to the first half of 1999, or about 13 MB of HTML-Text.
The single years are treated as individual collections, resulting in pruned template vectors of about 7.000 words.
Up to the SOMLib Digital Library Homepage
Comments: rauber@ifs.tuwien.ac.at