Department of Software Technology
Vienna University of Technology

The SOMLib Digital Library - Experiments - CIA World Factbook 1990

Overview

This is the 1990 edition of the CIA World Factbook, describing various regions of the world in terms of their geographical, social, political, demographic etc. features. Please note, that for our experiments we used the 1990 edition, i.e. a description of the world based on the situation before the 'fall' of the communist hemisphere.
It consists of 246 documents, or about 2.080 KB of plain ASCII text

On this page:

Data
Textrepresentation
Trained Self-Organizing Maps (not interactive)
Hierarchical Self-Organizing Maps (not interactive)
Integration of Distributed Self-Organizing Maps (not interactive)
GHSOM - Growing Hierarchical Self-Organizing Map (interactive exploration)

Data

The CIA World Factbook of 1990 consits of 246 documents, or about 2.080 KB of plain ASCII text.

List of countries and regions
Descriptions of countries and regions (individual files)

Text Representation

The list below presents some of the vector representations created using various degrees of pruning the vector and different ways of word stemming methods, resulting in vector representations of different dimensionality.
Parsing the files results in a pruned template vector of about 800 - 1500 words.

Sample template vector 0 - dimensionality = 1056
Sample tf x idf individual vectors for template vector 0
Sample template vector 1 - dimensionality = 690
Sample tf x idf individual vectors for template vector 1
Sample template vector 2 - dimensionality = 977
Sample tf x idf individual vectors for template vector 2
Sample template vector 3 - dimensionality = 889
Sample tf x idf individual vectors for template vector 3

Trained Self-Organizing Maps

Based on the document description as outlined above, we trained a self-organizing map to represent the contents of the document archive. Below we provide a graphical representation of the training result. For ease of identifying the various rows of units in the graphical representation, we separated these rows by horizontal lines. Each unit is either marked by a number of countries (or regions) or by a dot. The name of a country appears if this unit serves as the winner for that particular country (or more precisely for the input vector representing that country). A dot appears if the unit is never selected as winner for any document.

10 x 10 map of the world

The self-organizing map was quite successful in arranging the various input data according to their mutual similarity. It should be obvious that in general countries belonging to similar geographical regions are rather similar with respect to the different categories described in the CIA World Factbook. These geographical regions can be found in the two-dimensional map display as well. In order to ease the interpretation of the self-organizing map's training result, we have marked several regions manually. For example, the area on the left hand side of the map is allocated for documents describing various islands. We should note, that the CIA World Factbook contains a large number of descriptions of islands. It is interesting to note, that the description of the oceans can be found in a map region neighboring the area of islands.
In the lower center of the map we find the European countries. The cluster representing these countries is further decomposed into a cluster of small countries, e.g. San Marino and Liechtenstein, a cluster of Western European countries, and finally a cluster of Eastern European countries. The latter cluster is represented by a single unit in the last row of the output space. This unit has as neighbors other countries that are usually attributed as belonging to the Communist hemisphere, e.g. Cuba, North Korea, Albania, and Soviet Union. At this point it is important to recall that our document archive is the 1990 edition of the CIA World Factbook. Thus, the descriptions refer to a time before the ``fall'' of the Communist hemisphere.
Other clusters of interest are the region containing countries from Latin America (lower right of the map), the cluster containing Arab countries (middle right of the map), or the cluster of African countries (upper right of the map).

A different flat SOM of the WFB data was produced using the GHSOM - the Growing Hierarchical Self-Organizing Map, with parameters set in a way that no hierarchical expansion was created. The resulting map is available for interactive browsing in the GHSOM Section further down this page.

Overall, the representation of the document space is highly successful in that similar documents are located close to one another. Thus, it is easy to find an orientation in this document space.

Trained Self-Organizing Maps - Hierarchical SOMs

In the previous section we have described the results from using self-organizing maps with the data of the CIA World Factbook. The major shortcoming of this neural network model is that the various documents are represented within only one two-dimensional output space making it difficult to identify cluster boundaries without profound insight into the underlying document collection. For the experiment presented thereafter we used a setup of the hierarchical feature map using four layers. The respective maps have the following dimensions: 3x3 on the first layer, 4x4 on the second layer, and 3x3 on the third and fourth layer. This setup has been determined empirically. In the remainder of this discussion we will just present the branch of the hierarchical feature map that contains what we called economically developed countries on unit (1/2) in the middle right area of the top-level map. The other branches, however, are formed quite similarly. Next, we show the arrangement of the second layer within the branch of economically developed countries. In this map, the various countries are separated roughly according to either their geographic location or their political system. The clusters are symbolized by using different shades of grey. Finally, we present the full-blown branch of economically developed countries. In this case it is straight-forward to identify the various cluster boundaries in that each cluster is represented by an individual self-organizing map. Higher level similarities are shown in higher levels of the hierarchical feature map.

top-level map
second layer
full-blown branch of economically developed countries

Integration of Distributed Self-Organizing Maps

For creating a number of first level SOMs the whole set of 245 documents was randomly split into 5 parts with each set comprising 50 documents, i.e. 5 documents are represented twice in different sets. Next, we independently trained 5 maps consisting of nodes using these testsets. Each field in the maps represents a node labeled with the names of the countries for which it is the best-matching representative, i.e. the winner. Units that were not winner for any country appear as empty fields. Each of these 5 maps represents in itself a topologically ordered mapping of the corresponding documents, which means that countries considered similar to each other in terms of the facts given in the country description of the CIA Worldfactbook, are located on the same node or close to each other. In the lower left corner of the first map, for example, we find a number of nodes representing south american countries, which are followed by an european and developed countries area to the right. A cluster of asian and african countries is situated above the south american cluster. Another interesting cluster in the upper middle of the first map is represented by the arctic and antarctic oceans, followed by the antarctic continent and a number of islands. Another european cluster can be found in the upper left part of the map on the right hand side. Similar clusters may be found in all the other maps, e.g. clusters of eastern european countries, countries of the arabic hemisphere. Note, however, that the clustering provided by the mapping does not necessarily represent a geographical structuring of regions. Rather, the countries are organized on the map according to their overall similarity based on the descriptions in the CIA Worldfactbook.

Subset 1
Subset 2
Subset 3
Subset 4
Subset 5

In a second step, these 5 maps are integrated into one single SOM consisting of 7x7 units to represent the whole document collection. Thus we obtain a mapping of all nodes of the 5 lower level maps onto the nodes of the higher level SOM. The idea behind this approach is based on the fact, that nodes in the various lower level SOMs representing similar documents (e.g. the nodes representing oceans which are distributed across 3 lower level maps in our experimental setup) should be mapped onto one node in the higher level SOM, i.e. we should expect one cluster for every region described in the document collection.

Integrating Map of the World

Note, that the main clusters are clearly visible from the map representation due to the accumulation of country descriptions on the cluster center nodes, with the nodes of the lower-level maps being mapped according to a higher level of abstraction. Thus we find a cluster of african countries in the upper left part of the map, with its center on the second node of the first row, being followed by a node representing south american countries in the middle of the first row. In the upper right corner of the map we find a node representing western european and developed countries, followed again by a node representing the former communist hemisphere below. Mind, that the documents used in these examples were taken from the 1990 edition of the CIA Worldfactbook, prior to the `fall' of the communist hemisphere. We further find several clusters of islands as well as a single node representing the oceans mentioned before, situated in the lower right part of the map. Please note that the countries that were present in the testbed twice (Austria, Comoros, Iceland, Japan and Mozambique) are now all mapped onto identical nodes. Generally, the nodes in the various maps representing highly similar or identical information, are now mapped onto the same area in the higher level representation of the text collection. Thus we find the higher level map to form an orderly mapping of all the input data used for training the single lower-order maps, with these maps now being integrated at a higher level of abstraction.

GHSOM - Growing hierarchical Self-Organizing Map

Below we provide some of the results of our experiments training the GHSOM model on the CIA World Factbook text collection describing countries and regions of the world by the geography, climate, economy, population, political system etc.

Flat Hierarchy GHSOM of the CIA World Factbook.
The parameter t1 was set such that at each layer of the GHSOM hierarchy the map has to represent the data from the higher-layer unit that it originates from at a significantly larger degree of detail. This results in a rather flat hierarchy.
Deep Hierarchy GHSOM of the CIA World Factbook.
The parameter t1 was set to a smaller value, such that the new layer in the hierarchy has to represent the data at a more detailedlevel, but less so than with the flat GHSOM example. This results in a rather deep hierarchy of smaller maps, which is favourable for large datasets where the various clusters are better separated.
Flat GHSOM of the CIA World Factbook.
The parameter t1 was set such that at already at the first layer of the GHSOM the map grew large enough to represent the data at a granularity above threshold t2. Thus, none of the units was expaned at a second layer, leaving us with one small map of the entire CIA World Factbook. In this case the GHSOM resembles the conventional Growing Grid SOM.

Up to the SOMLib Digital Library Homepage
Comments: rauber@ifs.tuwien.ac.at