With no a priori knowledge on the data, even providing information on the cluster boundaries does not reveal information on the relevance of single attributes for the clustering and classification process. In the LabelSOM approach we determine those vector elements (i.e. features of the input space) that are most relevant for the mapping of an input vector onto a specific unit. This is basically done by determining the contribution of every element in the vector towards the overall Euclidean distance between an input vector and the winners' weight vector, which forms the basis of the SOM training process.
The LabelSOM method is built upon the observation, that, after SOM training, the weight vector elements resemble as far as possible the corresponding input vector elements of all input signals that are mapped onto this particular unit as well as to some extent those of the input signals mapped onto neighboring units. Vector elements having about the same value within the set of input vectors mapped onto a certain unit describe the unit in so far as they denominate a common feature of all data signals of this unit. If a majority of input signals mapped onto a particular unit exhibit a highly similar input vector value for a particular feature, the corresponding weight vector value will be highly similar as well. We can thus select those weight vector elements, which show by and large the same vector element value for all input signals mapped onto a particular unit to serve as a descriptor for that very unit. The quantization error for all individual features serves as a guide for their relevance as a class label. We select those vector elements that exhibit a quantization error vector element value of close to 0. The quantization error vector is computed for every unit i as the accumulated distance between the weight vector elements of all input signals mapped onto unit i and the unit's weight vector elements. More formally, this is done as follows: Let Ci be the set of input patterns mapped onto unit i. Summing up the distances for each vector element k over all the vectors xj ( ) yields a quantization error vector qi for every unit i (Equation 1).
Selecting those weight vector elements that exhibit a corresponding quantization error of close to 0 thus results in a list of attributes that are shared by and large by all input signals on the respective unit and thus describe the characteristics of the data on that unit. These attributes thus serve as labels for regions of the map for data mining applications.
In text mining applications we are usually faced with a further restriction. Due to the high dimensionality of the vector space and the characteristics of the representation of the document feature vectors, we usually find a high number of input vector elements that have a value of 0, i.e. there is a large number of terms that is not present in a group of documents. These terms obviously yield a quantization error value of 0 and would thus be chosen as labels for the units. Doing that would result in labeling the units with attributes that are not present in the data on the respective unit. While this may be perfectly ok for some data analysis tasks, where even the absence of an attribute is a distinctive characteristics, it is definitely not the goal in text mining applications where we want to describe the present features that are responsible for a certain clustering rather than describe a cluster via the features that are not present in its data. Hence, we need to determine those vector elements from each weight vector which, on the one hand, exhibit about the same value for all input signals mapped onto that specific unit as well as, on the other hand, have a high overall weight vector value indicating its importance. To achieve this we define a threshold in order to select only those attributes that, apart from having a very low quantization error, exhibit a corresponding weight vector value above . In these terms, can be thought of indicating the minimum importance of an attribute with respect to the representation to be selected as a label.
Additionally to the increased interpretability, cluster detection is facilitated using the labels derived from the LabelSOM method. Clear cluster boundaries can be defined by combining units sharing a set of labels. These shared labels can then be used as higher-level class identifiers.