next up previous
Next: A map of the Up: CIA's view of the Previous: Comparison of both models

   
Document representation

For the experiments presented thereafter we used the 1990 edition of the CIA World Factbook (http://www.odci.gov/cia/publications/factbook) as a sample document archive. The CIA World Factbook represents a text collection containing information on countries and regions of the world. The information is split into different categories such as Geography, People, Government, Economy, Communications, and Defense Forces.

We used full-text indexing to represent the various documents. The complete information on each country is used for indexing. In other words, for the present set of experiments we refrained from identifying the various document segments that contain the information on the various categories. In total, the 1990 edition of the CIA World Factbook consists of 245 documents. The indexing process identified 959 content terms, i.e. terms used for document representation. During indexing we omitted terms that appear in less than 15 documents or more than 196 documents. These terms are weighted according to a simple $tf \times idf$ weighting scheme [9]. With this indexing vocabulary the documents are represented according to the vector-space model of information retrieval. The various vectors representing the documents are further used for neural network training.


next up previous
Next: A map of the Up: CIA's view of the Previous: Comparison of both models
Andreas RAUBER
1998-09-10