Department of Software Technology
Vienna University of Technology

The SOMLib Digital Library - Integration of Distributed Libraries

Overview

In case of a digital library that exists only distributed over several sites, it might be more efficient to have independent self-organizing maps that represent the various parts of the digital library than transfering the whole information to one site for training. However, when some form of uniform access to the data is requested, the contents of the various sites has to be integrated. With our approach to digital library organization we suggest to utilize self-organizing maps to perform such an integration. In particular, the map that shall integrate different portions of the digital library may be trained by using the weight vectors of the maps to be integrated. Such a strategy may be applied recursively in order to build hierarchies of arbitrary depth as shown in Figure 1. In this figure a $3 \times 3$ and a $4 \times 5$ map are integrated in a $3 \times 4$ map. Note that also selected parts of self-organizing maps may be integrated by using essentially the same architecture. The user simply selects areas of interest scattered across different maps for which an integration shall be performed. By this, the user may tie together pieces of information to build her own library fine-tuned to her particular interests.

**Figure 1:** Integration of two self-organizing maps
$\begin{figure}\begin{center} \leavevmode \epsfxsize=25mm \epsffile{hiersom.eps} \end{center}\end{figure}$

The effect of such an integration, obviously, is that input data items that are separated in different low level maps are grouped together in the high level map. Input data that are mapped onto the same low level unit are represented together in the high level map.

Basically, there are two different types of SOMs in the SOMLib architecture. First, there is a set of independent, small SOMs, referred to as first order maps, which are trained with the feature vectors obtained by parsing the documents. Thus every node represents a set of documents, with the whole map representing a topographically ordered mapping of all documents in the library.
In a second step, higher order maps are trained using the weight vectors of those first order maps as input vectors. Note, however, that the vocabulary and thus the vector structures of those separate libraries differ from each other. Thus, a unique feature vector setup has to be created based upon the different vectors of the libraries to be included by merging the vector structures to train the higher-order map. The resulting map is conceptually identical to the various library maps it is based upon, with the nodes now representing a set of other nodes from the various lower order maps. Analogously, small SOMs trained with relevant documents can be used as user profiles to enhance keyword queries.

Referencing of Integrated Libraries

From the library administrator's point of view there are two different situations to be considered. On the one hand there are the first level SOMs which are trained with the feature vectors created by parsing (a subset of) existing documents. The resulting maps are relatively small since they only need to represent the very documents present in the library. New documents can be added to the map by parsing them using the previously extracted vector structure and mapping the resulting feature vectors. As long as the general scope of the library does not change extensively, new documents can be added without destroying the topology preserving mapping. As new topics emerge, the small first level libraries need to be retrained. The 'old' SOMLib map can either be retained to serve other referencing higher order maps, or the nodes of the old map can be mapped onto the corresponding nodes in the new SOMLib map by determining the winning node on presentation of the (modified to match the new vector structure) weight vectors of the old map's nodes. If a first order map tends to grow too big, one can choose to split the underlying documents into groups to create separate first order SOMLibs, which are then combined in a higher order map. Secondly, there are higher order maps to be administered. These are based on several lower order maps, the structure vectors of which are merged to create a new vector. The modified weight vectors of the lower order maps are then used to create the higher level SOMLib map. In many cases a natural hierarchy will evolve in institutional arenas, say several university departments will have their own SOMLibs as first order maps, which are then integrated in a single second order map at university level, which in turn may be combined at a national level and so on. Others may choose to combine first or higher order SOMLibs of institutions covering a certain topic of interest, with the possibility for mutual referencing, to create their personal library system.

Experiments

Below we provide some of the results of our experiments.

The Time Article Collection was split into 6 independent subsets of articles to simulate the subsequent release of various editions. Each subset was parsed separately and used to train a single map.
- Subset 1: articles T000 - T099: 6 x 10 SOM - set 0.
- Subset 2: articles T100 - T199: 7 x 10 SOM - set 1.
- Subset 2: Articles T200 - T299: 7 x 10 SOM - set 2.
- Subset 3: Articles T300 - T399: 7 x 9 SOM - set 3.
- Subset 4: Articles T400 - T499: 6 x 9 SOM - set 4.
- Subset 5: Articles T500 - T599: 6 x 7 SOM - set 5.
Next, the weight vector structures of these independent maps were merged to create a uniform weight vector structure, allowing the integartion of the maps by training one single SOM using the weight vectors of the individual maps as input.
- 10 x 15 Integrating SOM.

Publications

Some selected publications covering the aspect of integrating distributed SOMLib maps.

The SOMLib Digital Library System. A.~Rauber and D.~Merkl
Proceedings of the 3rd Europ. Conf. on Research and Advanced Technology for Digital Libraries (ECDL'99), Paris, France, September 22. - 24. 1999, Lecture Notes in Computer Science (LNCS 1696), Springer, 1999.
HTML, gnu-zipped Postscript, gnu-zipped PDF,
Organization of Distributed Digital Libraries: A Neural Network-based Approach, A.~Rauber and D.~Merkl
Proc. Intl. Symposium on Intelligent Data Engineering and Learning (IDEAL98), Hong Kong, 1998
HTML, gnu-zipped postscript,
Creating an Order in Distributed Digital Libraries by Integrating Independent Self-Organizing Maps, A.~Rauber and D.~Merkl
Proc. International Conf. on Artificial Neural Networks (ICANN98), Skövde, Sweden, 1998
HTML, gnu-zipped postscript,
SOMLib: A Distributed Digital Library System Based on Self-Organizing Maps, A.~Rauber
Proc 10th Italian Workshop on Neural Nets (WIRN98), Vietri sul Mare, Italy, 1998.
HTML, gnu-zipped postscript,

Up to the SOMLib Digital Library Homepage
Comments: rauber@ifs.tuwien.ac.at