Department of Software Technology
Vienna University of Technology

The SOMLib Digital Library - LabelSOM

Overview

While the self-organizing map provides an intuitively ordered representation of the data, interpreting a trained SOM remains a somewhat tedious task in spite of the recent development of enhanced visualization techniques for SOMs, such as the U-Matrix, Adaptive Coordinates (AC) or Cluster Conncetions (CC). This is because the reasons for a specific cluster assignment are not explicit in the standard SOM representation, requiring the map to be labeled manually. The LabelSOM method addresses this problem by automatically assigning labels to the units of the SOM describing the features of the data points mapped onto the respective unit. This results in a labeled SOM giving the contents of the documents in the various areas of the map, allowing the map to be actually read similar to a map of a conventional library.

LabelSOM

With no a priori knowledge on the data, even providing information on the cluster boundaries does not reveal information on the relevance of single attributes for the clustering and classification process. In the LabelSOM approach we determine those vector elements (i.e. features of the input space) that are most relevant for the mapping of an input vector onto a specific unit. This is basically done by determining the contribution of every element in the vector towards the overall Euclidean distance between an input vector and the winners' weight vector, which forms the basis of the SOM training process.

The LabelSOM method is built upon the observation, that, after SOM training, the weight vector elements resemble as far as possible the corresponding input vector elements of all input signals that are mapped onto this particular unit as well as to some extent those of the input signals mapped onto neighboring units. Vector elements having about the same value within the set of input vectors mapped onto a certain unit describe the unit in so far as they denominate a common feature of all data signals of this unit. If a majority of input signals mapped onto a particular unit exhibit a highly similar input vector value for a particular feature, the corresponding weight vector value will be highly similar as well. We can thus select those weight vector elements, which show by and large the same vector element value for all input signals mapped onto a particular unit to serve as a descriptor for that very unit. The quantization error for all individual features serves as a guide for their relevance as a class label. We select those vector elements that exhibit a quantization error vector element value of close to 0. The quantization error vector is computed for every unit i as the accumulated distance between the weight vector elements of all input signals mapped onto unit i and the unit's weight vector elements. More formally, this is done as follows: Let C_i be the set of input patterns $x_j \in \Re^n$ mapped onto unit i. Summing up the distances for each vector element k over all the vectors x_j ( $x_j \in C_i$ ) yields a quantization error vector q_i for every unit i (Equation 1).

$\begin{displaymath} q_{i_k} = \sum_{x_j \in C_i}{\sqrt{(m_{i_k} - x_{j_k})^2}},\qquad k=1..n \end{displaymath}$

(1)

Selecting those weight vector elements that exhibit a corresponding quantization error of close to 0 thus results in a list of attributes that are shared by and large by all input signals on the respective unit and thus describe the characteristics of the data on that unit. These attributes thus serve as labels for regions of the map for data mining applications.

In text mining applications we are usually faced with a further restriction. Due to the high dimensionality of the vector space and the characteristics of the $tf \times idf$ representation of the document feature vectors, we usually find a high number of input vector elements that have a value of 0, i.e. there is a large number of terms that is not present in a group of documents. These terms obviously yield a quantization error value of 0 and would thus be chosen as labels for the units. Doing that would result in labeling the units with attributes that are not present in the data on the respective unit. While this may be perfectly ok for some data analysis tasks, where even the absence of an attribute is a distinctive characteristics, it is definitely not the goal in text mining applications where we want to describe the present features that are responsible for a certain clustering rather than describe a cluster via the features that are not present in its data. Hence, we need to determine those vector elements from each weight vector which, on the one hand, exhibit about the same value for all input signals mapped onto that specific unit as well as, on the other hand, have a high overall weight vector value indicating its importance. To achieve this we define a threshold $\tau$ in order to select only those attributes that, apart from having a very low quantization error, exhibit a corresponding weight vector value above $\tau$ . In these terms, $\tau$ can be thought of indicating the minimum importance of an attribute with respect to the $tf \times idf$ representation to be selected as a label.

Additionally to the increased interpretability, cluster detection is facilitated using the labels derived from the LabelSOM method. Clear cluster boundaries can be defined by combining units sharing a set of labels. These shared labels can then be used as higher-level class identifiers.

Experiments

Below we provide some of the results of our experiments using the various data collections.

ifs Abstracts Collection:
- 7 x 7 SOM of the ifs Abstracts Collection.
- Labels for the 7 x 7 SOM of the ifs Abstracts Collection, created with the LabelSOM>/BF> method.
- Clusters identified in the map by analyzing sets of labels shared bei neighboring units.
Time Magazine Article Collection of the 1960's
- First experiment: 10x15 map, labels and map separate
  - 10 x 15 map of the TIME Article Collection 10x15 SOM.
  - A 10 x 15 map of the TIME Article Collection labeled with the LabelSOM method.
- Second experiment: 10x15 map, labels and map separate
  - 10 x 15 map of the TIME Article Collection 10x15 SOM.
  - A 10 x 15 map of the TIME Article Collection labeled with the LabelSOM method.
- For a second set of experiments, the Time Article Collection was split into 6 independent subsets of articles to simulate the subsequent release of various editions. Each subset was parsed separately and used to train a single map. Here we provide the maps and the labeled maps for each subset as well as the integrating map.
  - Integrating Map: All Articles represented by 6 independently trained SOMs integrated in a 10 x 15 map.
    - 10 x 15 Integrating SOM.
    - Labels for 10 x 15 Integrating SOM
    - Clusters identified in integrating map by analyzing sets of labels shared bei neighboring units.
  - Subset 0: Articles 000 - 099 (63 articles)
    - 6 x 10 SOM - set 0.
    - Labels for 6 x 10 SOM - set 0
  - Subset 1: Articles 100 - 199 (85 articles)
    - 7 x 10 SOM - set 1.
    - Labels for 7 x 10 SOM - set 1
  - Subset 2: Articles 200 - 299
    - 7 x 10 SOM - set 2.
    - Labels for 7 x 10 SOM - set 2
  - Subset 3: Articles 300 - 399
    - 7 x 9 SOM - set 3.
    - Labels for 7 x 9 SOM - set 3
  - Subset 4: Articles 400 - 499
    - 6 x 9 SOM - set 4.
    - Labels for 6 x 9 SOM - set 4
  - Subset 5: Articles 500 - 599
    - 6 x 7 SOM - set 5.
    - Labels for 6 x 7 SOM - set 5

Publications

Some selected publications on the labeling of self-organizing maps using the LabelSOM method.

The SOMLib Digital Library System. A.~Rauber and D.~Merkl
Proceedings of the 3rd Europ. Conf. on Research and Advanced Technology for Digital Libraries (ECDL'99), Paris, France, September 22. - 24. 1999, Lecture Notes in Computer Science (LNCS 1696), Springer, 1999.
HTML, gnu-zipped Postscript, gnu-zipped PDF,
LabelSOM: On the Labeling of Self-Organizing Maps, A.~Rauber
Proceedings of the International Joint Conference on Neural Networks (IJCNN'99), Washington, DC, July 10 - 16, 1999.
HTML, gnu-zipped postscript, gnu-zipped PDF,
Automatic Labeling of Self-Organizing Maps: Making a Treasure-Map Reveal its Secrets, A.~Rauber, D.~Merkl
Proceedings of the 3. Pacific-Asia Conference on Knowledge Discovery and Data Mining} (PAKDD'99), Bejing, China, April 26--28, 1999.
LNCS / Lecture Notes in Artificial Intelligence, LNAI 1574, pp. 228 - 237, Springer Verlag
HTML, gnu-zipped postscript,

Up to the SOMLib Digital Library Homepage
Comments: rauber@ifs.tuwien.ac.at