For the experiments presented thereafter we used the 1990 edition of the CIA Worldfactbook (http://www.odci.gov/cia/publications/factbook/index.html) as a sample document archive. The CIA Worldfactbook represents a document collection containing information on countries and world regions. The information is split into categories such as Geography, People, Economy, Defense Forces, etc. In total, the 1990 edition of the CIA Worldfactbook consists of 245 documents.
The various documents are represented by means of simple histograms
of word occurrences.
We used all words occuring in more than 15 and less than 220 documents.
Thus, 1056 distinct words, i.e. index terms, remained that were weighted
according to a
weighting scheme [8],
i.e. term frequency times inverse document frequency.
Such a weighting scheme assigns high weights to index terms that occur
frequently within a document but rarely within the whole document
collection.
Finally, the documents are represented by feature vectors where each feature
corresponds to an index term and the specific value of the feature is derived
by means of the
weighting scheme.
These feature vectors are used as the input data during network training.