Department of Software Technology
Vienna University of Technology

The SOMLib Digital Library - Self-Organizing Maps

The self-organizing map (SOM) forms the basic core of the SOMLib system by organizing documents on a 2-dimensional map according to their contents. This sectrion provides a brief introduction to the topic of self-organizing maps and related architectures.

Overview

Today's information age may be characterized by constant massive production and dissemination of written information. Powerful tools for exploring, searching, and organizing this mass of information are needed. Particularly the aspect of exploration has found only limited attention. While the importance of exploration of document archives has been widely agreed on, current information retrieval technology still relies on systems that retrieve documents based on the similarity between keyword-based document and query representations.

However, the map metaphor for displaying the contents of a document library in a two-dimensional map has gained some interest . Maps are used to visualize the similarity between documents in terms of distances within a two-dimensional map display. Hence, similar documents may be found in neighboring regions of the map display, which is comparable to the situation encountered in conventional libraries, where books are organized in topical sections. Finding one book on a desired topic using any of the available information retrieval (IR) techniques then leaves you with several others on the same topic nearby. Apart from interactive retrieval of documents within one topical cluster, a map representation also allows users to browse an unfamiliar library in order to get an overview of the documents found in the repository. Given an enhanced visualization it not only provides information on which topics are covered in a specific collection, but also to which extent they are covered, i.e. how much information is available on a certain topic. Thus, the map serves as a kind of intuitive index to the document repository similar to road signs or overview maps of conventional libraries.

A way for automatically organizing a set of documents on such a map display is provided by the self-organizing map (SOM), a popular unsupervised neural network. It provides a topology preserving mapping from a high-dimensional document space to a two-dimensional output space. On the SOM output space the documents are organized according to their topical cluster structure, i.e. documents on similar topics are grouped together, following the map metaphor of library representation.

The Self-Organizing Map (SOM)

The self-organizing map (SOM, Kohonen-Map) is one of the most prominent artificial neural network models adhering to the unsupervised learning paradigm. The model consists of a number of neural processing elements, i.e. units. Each of the units i is assigned an n-dimensional weight vector m_i. It is important to note that the weight vectors have the same dimensionality as the input patterns.

The training process of self-organizing maps may be described in terms of input pattern presentation and weight vector adaptation. Each training iteration t starts with the random selection of one input pattern x(t). This input pattern is presented to the self-organizing map and each unit determines its activation. Usually, the Euclidean distance between weight vector and input pattern is used to calculate a unit's activation. The unit with the lowest activation is referred to as the winner, c, of the training iteration, i.e. $m_c(t) = \min_i{\vert\vert x(t)-m_i(t)\vert\vert}$ . Finally, the weight vector of the winner as well as the weight vectors of selected units in the vicinity of the winner are adapted. This adaptation is implemented as a gradual reduction of the component-wise difference between input pattern and weight vector, i.e. $m_i(t+1) = m_i(t) \cdot \alpha(t) \cdot h_{ci}(t) \cdot [x(t)-m_i(t)]$ . Geometrically speaking, the weight vectors of the adapted units are moved a bit towards the input pattern. The amount of weight vector movement is guided by a so-called learning rate, $\alpha$ , decreasing in time. The number of units that are affected by adaptation is determined by a so-called neighborhood function, h_ci. This number of units also decreases in time. This movement has as a consequence, that the Euclidean distance between those vectors decreases and thus, the weight vectors become more similar to the input pattern. The respective unit is more likely to win at future presentations of this input pattern. The consequence of adapting not only the winner alone but also a number of units in the neighborhood of the winner leads to a spatial clustering of similar input patters in neighboring parts of the self-organizing map. Thus, similarities between input patterns that are present in the n-dimensional input space are mirrored within the two-dimensional output space of the self-organizing map. The training process of the self-organizing map describes a topology preserving mapping from a high-dimensional input space onto a two-dimensional output space where patterns that are similar in terms of the input space are mapped to geographically close locations in the output space.

Consider Figure 1 for a graphical representation of self-organizing maps. The map consists of a square arrangement of $7 \times 7$ neural processing elements, i.e. units, shown as circles on the left hand side of the figure. The black circle indicates the unit that was selected as the winner for the presentation of input pattern x(t). The weight vector of the winner, m_c(t), is moved towards the input pattern and thus, m_c(t+1) is nearer to x(t) than was m_c(t). Similar, yet less strong, adaptation is performed with a number of units in the vicinity of the winner. These units are marked as shaded circles in Figure 1. The degree of shading corresponds to the strength of adaptation. Thus, the weight vectors of units shown with a darker shading are moved closer to x than units shown with a lighter shading.

**Figure 1:** Architecture of a $7 \times 7$ self-organizing map
$\begin{figure}\begin{center} \leavevmode \epsfxsize=50mm % \epsffile{somarch.eps} \end{center}\end{figure}$

As a result of the training process, similar input data are mapped onto neighboring regions of the map. In the case of text document classification, documents on similar topics as indicated by their feature vector representations, are grouped accordingly. For examples of resulting document mappings, see below.

SOM and Related Architectures

While the SOM itself offers a very stable and convenient architecture for organizing text document collections, related architectures can be included in the SOMLib system as well, as long as the document representation and classification is based on feature vectors of similar semantics.
Especially the hierarchical featuer map has shown to provide a convenient metaphor for document representation.

The key idea of hierarchical feature maps as proposed by R. Miikkulainen in 1990 is to use a hierarchical setup of multiple layers where each layer consists of a number of independent self-organizing maps. One self-organizing map is used at the first layer of the hierarchy. For every unit in this map a self-organizing map is added to the next layer of the hierarchy. This principle is repeated with the third and any further layers of the hierarchical feature map. In Figure 2 we provide an example of a hierarchical feature map with three layers. The first layer map consists of $2 \times 2$ units, thus we find four independent self-organizing maps on the second layer. Since each map on the second layer consists again of $2 \times 2$ units, there are 16 maps on the third layer.

**Figure 2:** Architecture of a three-layer hierarchical feature map
$\begin{figure}\begin{center} \leavevmode \epsfxsize=40mm % \epsffile{hfmarch.eps} \end{center}\end{figure}$

The training process of hierarchical feature maps starts with the self-organizing map on the first layer. This map is trained according to the standard training process of self-organizing maps as described above. When this first self-organizing map is stable, i.e. only minor further adaptation of the weight vectors are recorded, training proceeds with the maps of the second layer. Here, each map is trained with only that portion of the input data that is mapped on the respective unit in the higher layer map. By this, the amount of training data for a particular self-organizing map is reduced on the way down the hierarchy. Additionally, the vectors representing the input patterns may be shortened on the transition from one layer to the next. This shortage is due to the fact that some input vector components can be expected to be equal among those input data that are mapped onto the same unit. These equal components may be omitted for training the next layer maps without loss of information. There is no loss of information because these portions of the input vector are already represented by the higher layer unit.

Other possible architectures include the Incremental Grid Growing or Growing Grid models.

GHSOM

One of the shortcommings of the SOM and Hierarchical Feature Map lies with their fixed architecture that has to be defined a-priori. Dynamically growing variants of the SOM, on the other hand, tend to produce huge maps that are hard to handle. This lead to the development of the GHSOM , a new architecture which grows both in a hierarchical way according to the data distribution, allowing a hierarchical decomposition and navigation in sub-parts of the data, and in a horizontal way, meaning that the size of each individual map adapts itself to the requirements of the input space. This provides a convenient interface for navigating large digital libraries, as it closely follows the model of conventional libraries, which are also structured by, e.g. different floors, sections and shelves.

The starting point for the growth process is the overall deviation of the input data as measured with the single-unit SOM at layer 0. This unit is assigned a weight vector , $m_0 = [\mu_{0_1}, \mu_{0_2}, \dots, \mu_{0_n}]^T$ , computed as the average of all input data. The deviation of the input data, i.e. the mean quantization error of this single unit, is computed as given in Expression (1) with representing the number of input data . We will refer to the mean quantization error of a unit as mqe in lower case letters.

$\begin{displaymath} {\bf mqe}_0 = \frac{1}{d} \cdot \vert\vert m_0 - x \vert\vert \end{displaymath}$

(1)

After the computation of ${\bf mqe}_0$ , training of the GHSOM starts with its first layer SOM. This first layer map initially consists of a rather small number of units, e.g. a grid of $2 \times 2$ units. Each of these units is assigned an -dimensional weight vector , $m_i = [\mu_{i_1}, \mu_{i_2}, \dots, \mu_{i_n}]^T$ , $m_i \in \Re^n$ , which is initialized with random values. It is important to note that the weight vectors have the same dimensionality as the input patterns.

The learning process of SOMs may be described as a competition among the units to represent the input patterns. The unit with the weight vector being closest to the presented input pattern in terms of the input space wins the competition. The weight vector of the winner as well as units in the vicinity of the winner are adapted in such a way as to resemble more closely the input pattern.

The degree of adaptation is guided by means of a learning-rate parameter $\alpha$ , decreasing in time. The number of units that are subject to adaptation also decreases in time such that at the beginning of the learning process a large number of units around the winner is adapted, whereas towards the end only the winner is adapted. These units are chosen by means of a neighborhood function $h_{ci}$ which is based on the units' distances to the winner as measured in the two-dimensional grid formed by the neural network. In combining these principles of SOM training, we may write the learning rule as given in Expression (2), where represents the current input pattern, and refers to the winner at iteration .

$\begin{displaymath} m_{i}(t+1)=m_{i}(t)+\alpha(t) \cdot h_{ci}(t) \cdot [x(t)-m_{i}(t)] \end{displaymath}$

(2)

In order to adapt the size of this first layer SOM, the mean quantization error of the map is computed ever after a fixed number $\lambda$ of training iterations as given in Expression (3). In this formula, refers to the number of units contained in the SOM . In analogy to Expression (1), ${\bf mqe}_i$ is computed as the average distance between weight vector and the input patterns mapped onto unit . We will refer to the mean quantization error of a map as MQE in upper case letters.

$\begin{displaymath} {\bf MQE}_m = \frac{1}{u} \cdot \sum_i {\bf mqe}_i \end{displaymath}$

(3)

The basic idea is that each layer of the GHSOM is responsible for explaining some portion of the deviation of the input data as present in its preceding layer. This is done by adding units to the SOMs on each layer until a suitable size of the map is reached. More precisely, the SOMs on each layer are allowed to grow until the deviation present in the unit of its preceding layer is reduced to at least a fixed percentage $\tau_m$ . Obviously, the smaller the parameter $\tau_m$ is chosen the larger will be the size of the emerging SOM. Thus, as long as ${\bf MQE}_m \ge \tau_m \cdot {\bf mqe}_0$ holds true for the first layer map , either a new row or a new column of units is added to this SOM. This insertion is performed neighboring the unit with the highest mean quantization error, ${\bf mqe}_e$ , after $\lambda$ training iterations. We will refer to this unit as the error unit. The distinction whether a new row or a new column is inserted is guided by the location of the most dissimilar neighboring unit to the error unit. Similarity is measured in the input space. Hence, we insert a new row or a new column depending on the position of the neighbor with the most dissimilar weight vector. The initialization of the weight vectors of the new units is simply performed as the average of the weight vectors of the existing neighbors. After the insertion, the learning-rate parameter $\alpha$ and the neighborhood function $h_{ci}$ are reset to their initial values and training continues according to the standard training process of SOMs. Note that we currently use the same value of the parameter $\tau_m$ for each map in each layer of the GHSOM. It might be subject to further research, however, to search for alternative strategies, where layer or even map-dependent quantization error reduction parameters are utilized.

Consider the following figure for a graphical representation of the insertion of units. In this figure the architecture of the SOM prior to insertion is shown on the left-hand side where we find a map of $2 \times 3$ units with the error unit labeled by e and its most dissimilar neighbor signified by d. Since the most dissimilar neighbor belongs to another row within the grid, a new row is inserted between units e and d. The resulting architecture is shown on the right-hand side of the figure as a map of now $3 \times 3$ units.

**Figure 1:** Insertion of units to a self-organizing map

As soon as the growth process of the first layer map is finished, i.e. ${\bf MQE}_m < \tau_m \cdot {\bf mqe}_0$ , the units of this map are examined for expansion on the second layer. In particular, those units that have a large mean quantization error will add a new SOM to the second layer of the GHSOM. The selection of these units is based on the mean quantization error of layer 0. A parameter $\tau_u$ is used to describe the desired level of granularity in input data discrimination in the final maps. More precisely, each unit fulfilling the criterion given in Expression (4) will be subject to hierarchical expansion.

$\begin{displaymath} {\bf mqe}_i > \tau_u \cdot {\bf mqe}_0 \end{displaymath}$

(4)

The training process and unit insertion procedure now continues with these newly established SOMs. The major difference to the training process of the second layer map is that now only that fraction of the input data is selected for training which is represented by the corresponding first layer unit. The strategy for row or column insertion as well as the termination criterion is essentially the same as used for the first layer map. The same procedure is applied for any subsequent layers of the GHSOM.

The training process of the GHSOM is terminated when no more units require further expansion. Note that this training process does not necessarily lead to a balanced hierarchy, i.e. a hierarchy with equal depth in each branch. Rather, the specific requirements of the input data is mirrored in that clusters might exist that are more structured than others and thus need deeper branching. Consider Figure 2 for a graphical representation of a trained GHSOM. In particular, the neural network depicted in this figure consists of a single-unit SOM at layer 0, a SOM of $2 \times 3$ units in layer 1, three SOMs in layer 2, i.e. one for each unit in the layer 1 map. Note that each of these maps might have a different number and different arrangements of units as shown in the figure. Finally, we have several SOMs in layer 3 which were expanded from one of the layer 2 units, just indicated by dotted arrows.

**Figure 2:** Architecture of a trained *GHSOM*

To summarize, the growth process of the GHSOM is guided by two parameters $\tau_u$ and $\tau_m$ . The parameter $\tau_u$ specifies the desired quality of input data representation at the end of the training process. Each unit with ${\bf mqe}_i > \tau_u \cdot {\bf mqe}_0$ will be expanded, i.e. a map is added to the next layer of the hierarchy, in order to explain the input data in more detail. Contrary to that, the parameter $\tau_m$ specifies the desired level of detail that is to be shown in a particular SOM. In other words, new units are added to a SOM until the MQE of the map is a certain fraction, $\tau_m$ , of the mqe of its preceding unit. Hence, the smaller $\tau_m$ the larger will be the emerging maps. Conversely, the larger $\tau_m$ the deeper will be the hierarchy.

Experiments

Below we provide some of the results of our experiments using the various data collections.

7 x 7 SOM of the ifs Abstracts Collection.
Time Magazine Article Collection of the 1960's
- 10 x 15 map of the TIME Article Collection 10x15 SOM.
- A second 10 x 15 map of the TIME Article Collection 10x15 SOM.
- For a second set of experiments, the Time Article Collection was split into 6 independent subsets of articles to simulate the subsequent release of various editions. Each subset was parsed separately and used to train a single map.
  - Subset 1: articles T000 - T099: 6 x 10 SOM - set 0.
  - Subset 2: articles T100 - T199: 7 x 10 SOM - set 1.
  - Subset 2: Articles T200 - T299: 7 x 10 SOM - set 2.
  - Subset 3: Articles T300 - T399: 7 x 9 SOM - set 3.
  - Subset 4: Articles T400 - T499: 6 x 9 SOM - set 4.
  - Subset 5: Articles T500 - T599: 6 x 7 SOM - set 5.

Publications

Some selected publications covering the topic of text classification using self-organizing maps and related architectures.

M. Dittenbach, D. Merkl, and A. Rauber: The Growing Hierarchical Self-Organizing Map. In: Proceedings of the International Joint Conference on Neural Networks 2000 (IJCNN'2000), 24. - 27. 7. 2000, Como, Italy, 2000.
gnu-zipped Postscript (93 KB), PDF (219 KB)
D. Merkl and A. Rauber: Document Classification with Unsupervised Neural Networks, in Fabio Crestani, Gabriella Pasi (Eds.), Soft Computing in Information Retrieval, Physica Verlag & Co, ISBN:3790812994, pp. 102 - 121,Germany, 2000.
HTML, gnu-zipped Postscript (108 KB), PDF (233 KB)
D. Merkl and A. Rauber. Uncovering the Hierarchical Structure of Text Archives by Using an Unsupervised Neural Network with Adaptive Architecture. Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2000), April 18. - 20. 2000, Kyoto, Japan, Springer Lecture Notes in Artificial Intelligence (LNCS-LNAI 1805), pp. 384 - 395, Springer, Germany, 2000.
HTML, gnu-zipped Postscript (129 KB), PDF (185 KB)
The SOMLib Digital Library System. A.~Rauber and D.~Merkl
Proceedings of the 3rd Europ. Conf. on Research and Advanced Technology for Digital Libraries (ECDL'99), Paris, France, September 22. - 24. 1999, Lecture Notes in Computer Science (LNCS 1696), Springer, 1999.
HTML, gnu-zipped Postscript, gnu-zipped PDF,
CIA's view of the world and what neural networks learn from it: A comparison of geographical document space representation metaphors, D.~Merkl and A.~Rauber
Proc. 9th International Conf. on Database and Expert Systems (DEXA98), Vienna, Austria, 1998
HTML, gnu-zipped postscript,
Finding Structure in Text Archives, A.~Rauber and D.~Merkl
Proc European Symposium on Artificial Neural Networks (ESANN98), Bruges, Belgium, 1998
HTML, gnu-zipped postscript,
Text Data Mining, D.~Merkl
in: R. Dale, H. Moisl, and H. Somers (Eds.). A Handbook of Natural Language Processing: Techniques and Applications for the Processing of Language as Text. Marcel Dekker. New York. 1998.
gnu-zipped postscript,
Lessons Learned in Text Document Classification, D.~Merkl
Proceedings of the Workshop on Self-Organizing Maps (WSOM'97). Espoo. Finland. June 4-6. 1997.
gnu-zipped PostScript
Exploration of Document Collections with Self-Organizing Maps: A Novel Approach to Similarity Representation, D.~Merkl
Proceedings of the 1st European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD'97). Trondheim. Norway. June 25-27. 1997.
gnu-zipped PostScript
Exploration of Text Collections with Hierarchical Feature Maps, D.~Merkl Proceedings of the 20th Annual Int'l ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'97). Philadelphia, PA. July 27-31. 1997.
gnu-zipped PostScript
Exploration of Legal Text Corpora with Hierarchical Neural Networks: A Guided Tour in Public International Law, D.~Merkl and E. Schweighofer
Proceedings of the 6th Int'l Conference on Artificial Intelligence and Law (ICAIL'97). Melbourne. Australia. June 30 - July 3. 1997.
gnu-zipped PostScript
En Route to Data Mining in Legal Text Corpora: Clustering, Neural Computation, and International Treaties, D.~Merkl and E. Schweighofer
Proceedings of the 8th Int'l Workshop on Database and Expert Systems Applications (DEXA'97). Toulouse. France. 1997.
gnu-zipped PostScript

Up to the SOMLib Digital Library Homepage
Comments: rauber@ifs.tuwien.ac.at