Next: Sammon's Mapping Up: Finding Structure in Text Previous: Finding Structure in Text

Introduction

Traditional text archives exhibit a kind of structure, which allows the user to understand the overall organization of the text collection and provides a means to search and to browse the collection to retrieve relevant texts. However, with the increasing amount of electronically available text collections, exploration of those electronic text archives becomes a challenging task both for users of large text corpora as well as for researchers trying to provide the means necessary for intuitive analysis. Generally, exploration is not primarily a problem of query processing and retrieval of relevant documents, but rather one comprising the whole complex of understanding the text collection and its structure at a higher level of abstraction. Basically, the individual texts present in any text collection span a high-dimensional input space defined by the words occurring in the various text documents. The goal is to provide a method that allows easy and intuitive access as well as aids in the understanding of this high-dimensional document space, enabling both the retrieval of documents based on queries as well as interactive browsing to locate relevant documents and to make the overall structure of the text collection intelligible to the user.

Numerous approaches to the problem of structure analysis of text corpora have been developed, either trying to impose a hierarchy on a given text collection or to provide some other way of clustering, using both supervised or unsupervised analysis methods [2]. However, most systems primarily provide a method for convenient and `intelligent' document retrieval based on query systems of differing degrees of sophistication with too little emphasis on visualization so far. As a consequence, interactive exploration is usually not supported. One well-known technique for the visualization of high-dimensional data spaces is Sammon's Mapping (SM) [5], aiming to represent the distances between data points in the high-dimensional input space as closely as possible in a 2-dimensional plot. Recent approaches use neural networks to structure large text corpora and to provide an interface for intuitive browsing of these collections. A prominent neural network architecture based on unsupervised learning is the self-organizing map (SOM), which has repeatedly been used to analyze and to visualize text archives, the most prominent example probably being the WEBSOM project [1].

The standard map display to represent the results of SOM training has its limits in that cluster boundaries are difficult to detect. To overcome this problem, we apply a new visualization technique based on an extended learning rule for SOM resulting in an intuitive representation of clusters as groups of nodes in a 2-dimensional output space. The basic idea of this Adaptive Coordinate (AC) approach [4] is to have the nodes of the SOM arrange themselves in a 2-dimensional output space during the training process in such a way as to approximate their geometric relationship in the high-dimensional vector space as faithfully as possible. The resulting visualization of the trained SOM is by its very idea similar to the SM, but stems from the self-organization during the learning process.

In this paper we demonstrate the application of SOM enhanced with the AC visualization technique to the problem of structure visualization of free form text corpora. We further compare the resulting AC visualization both with the standard SOM visualization as well as with the corresponding SM to analyze its capabilities in the fields of text archive exploration.

Next: Sammon's Mapping Up: Finding Structure in Text Previous: Finding Structure in Text

Andreas RAUBER
1998-04-28