Data
The ifs abstracts collection consists of 50 abstracts, amounting to a total of 102KB of ASCII text files, not all of which have been used for all experiments.
The 50 abstracts were randomly sampled from the list of publications of the department available at the time when the document set was created.
Text Representation
Parsing these files results in a pruned template vector of about 400 to 500 words, depending on the sophistication of the word stemming and the degree of pruning of the full template vector.
A variety of different representations of the importance of single words has been experimented with. The most appropriate representation in terms of content representation has shown to be the tf x idf (term frequency times inverse document frequency) representation, with the vectors usually being normalized to unit length before SOM training.
Trained Self-Organizing Maps
A 7 x 7 SOM is trained with the scientific abstracts data. It is intended to
provide a clustering of the documents based on contents similar to the organization of documents in a conventional library.
The units are labeled with the names of
the document vectors, which consist of the first 3 letters of the author's name followed by the short name of the conference or
workshop the paper was published at. Without any additional knowledge on either the conferences or the authors, the given
representation is hard to interpret, although we might draw some conclusions on the cluster structure by considering the authors names
as indicators.
Due to the small size of the data collection, interpreting the resulting SOM is rather intuitive and can be achieved by reading the various abstracts mapped onto the units of the map.
The abstracts can be accessed by clicking on the article names on the map.
Labeled Self-Organizing Maps
The trained SOMs are labeled automatically using the LabelSOM method.
The various labels can then be used to identify clusters within the map by identifying regions which are labeled with identical keywords.
Having a set of 10 labels automatically assigned to the the single nodes in the figure leaves us with a somewhat clearer picture of the
underlying text archive and allows us to understand the reasons for a certain cluster assignment as well as identify overlapping topics
and areas of interest within the document collection. For example, in the upper left corner we find a group of nodes sharing labels
like skeletal plans, clinical, guideline, patient, health which deal with the development and representation of skeletal plans for
medical applications. Another homogeneous cluster can be found in the upper right corner which is identified by labels like gait,
pattern, malfunction and deals with the analysis of human gait patterns to identify malfunctions and supporting diagnosis and therapy.
A set of nodes in the lower left corner of the map is identified by a group of labels containing among others software, process, reuse
and identifies a group of papers dealing with software process models and software reuse. This is followed by a large cluster to the
right labeled with cluster, intuitive, document, archive, text, input containing papers on cluster visualization and its application in the
context of document archives. Further clusters can be identified in the center of the map on plan validation, and quality analysis,
neural networks, etc.