next up previous
Next: Experimental results Up: Creating an Order in Previous: Hierarchies of self-organizing maps

   
An experimental document archive

For the experiments presented thereafter we used the 1990 edition of the CIA Worldfactbook (http://www.odci.gov/cia/publications/factbook/index.html) as a sample document archive. The CIA Worldfactbook represents a document collection containing information on countries and world regions. The information is split into categories such as Geography, People, Economy, Defense Forces, etc. In total, the 1990 edition of the CIA Worldfactbook consists of 245 documents.

The various documents are represented by means of simple histograms of word occurrences. We used all words occuring in more than 15 and less than 220 documents. Thus, 1056 distinct words, i.e. index terms, remained that were weighted according to a $tf \times idf$ weighting scheme [8], i.e. term frequency times inverse document frequency. Such a weighting scheme assigns high weights to index terms that occur frequently within a document but rarely within the whole document collection. Finally, the documents are represented by feature vectors where each feature corresponds to an index term and the specific value of the feature is derived by means of the $tf \times idf$ weighting scheme. These feature vectors are used as the input data during network training.


next up previous
Next: Experimental results Up: Creating an Order in Previous: Hierarchies of self-organizing maps
Andreas RAUBER
1998-09-10