Department of Software Technology
Vienna University of Technology
The SOMLib Digital Library - Integration of Distributed Libraries
Overview
In case of a digital library that exists only distributed over
several sites, it might be more efficient to have independent
self-organizing maps that represent the various parts of the
digital library than transfering the whole information to one site
for training.
However, when some form of uniform access to the data is requested, the
contents of the various sites has to be integrated.
With our approach to digital library organization we suggest to utilize
self-organizing maps to perform such an integration.
In particular, the map that shall integrate different portions of the
digital library may be trained by using the weight vectors of the
maps to be integrated.
Such a strategy may be applied recursively in order to build hierarchies
of arbitrary depth as shown in Figure 1.
In this figure a
and a
map are integrated
in a
map.
Note that also selected parts of self-organizing maps may be integrated
by using essentially the same architecture.
The user simply selects areas of interest scattered across different maps
for which an integration shall be performed.
By this, the user may tie together pieces of information to build her
own library fine-tuned to her particular interests.
Figure 1:
Integration of two self-organizing maps
|
The effect of such an integration, obviously, is that input data items that
are separated in different low level maps are grouped together in the high
level map.
Input data that are mapped onto the same low level unit are represented
together in the high level map.
Basically, there are two different types of SOMs in the SOMLib architecture. First, there is a set of independent, small SOMs, referred to as first order
maps, which are trained with the feature vectors obtained by parsing the documents. Thus every node represents a set of documents, with the whole map representing a
topographically ordered mapping of all documents in the library.
In a second step, higher order maps are trained using the weight vectors of those first order maps as
input vectors. Note, however, that the vocabulary and thus the vector structures of those separate libraries differ from each other. Thus, a unique feature vector setup
has to be created based upon the different vectors of the libraries to be included by merging the vector structures to train the higher-order map. The resulting map is
conceptually identical to the various library maps it is based upon, with the nodes now representing a set of other nodes from the various lower order maps.
Analogously, small SOMs trained with relevant documents can be used as user profiles to enhance keyword queries.
Referencing of Integrated Libraries
From the library administrator's point of view there are two different situations to be considered. On the one hand there are the first level SOMs which are trained
with the feature vectors created by parsing (a subset of) existing documents. The resulting maps are relatively small since they only need to represent the very
documents present in the library. New documents can be added to the map by parsing them using the previously extracted vector structure and mapping the resulting
feature vectors. As long as the general scope of the library does not change extensively, new documents can be added without destroying the topology preserving
mapping. As new topics emerge, the small first level libraries need to be retrained. The 'old' SOMLib map can either be retained to serve other referencing higher
order maps, or the nodes of the old map can be mapped onto the corresponding nodes in the new SOMLib map by determining the winning node on presentation of
the (modified to match the new vector structure) weight vectors of the old map's nodes. If a first order map tends to grow too big, one can choose to split the
underlying documents into groups to create separate first order SOMLibs, which are then combined in a higher order map.
Secondly, there are higher order maps to be administered. These are based on several lower order maps, the structure vectors of which are merged to create a new
vector. The modified weight vectors of the lower order maps are then used to create the higher level SOMLib map. In many cases a natural hierarchy will evolve in
institutional arenas, say several university departments will have their own SOMLibs as first order maps, which are then integrated in a single second order map at
university level, which in turn may be combined at a national level and so on. Others may choose to combine first or higher order SOMLibs of institutions covering a
certain topic of interest, with the possibility for mutual referencing, to create their personal library system.
Experiments
Below we provide some of the results of our experiments.
- The Time Article Collection was split into 6 independent subsets of articles to simulate the subsequent release of various editions.
Each subset was parsed separately and used to train a single map.
Next, the weight vector structures of these independent maps were merged to create a uniform weight vector structure, allowing the integartion of the maps by training one single SOM using the weight vectors of the individual maps as input.
Publications
Some selected publications covering the aspect of integrating distributed
SOMLib maps.
- The SOMLib Digital Library System.
A.~Rauber and D.~Merkl
Proceedings of the 3rd Europ. Conf. on Research and Advanced
Technology for Digital Libraries (ECDL'99),
Paris, France, September 22. - 24. 1999,
Lecture Notes in Computer Science (LNCS 1696), Springer, 1999.
HTML,
gnu-zipped Postscript,
gnu-zipped PDF,
- Organization of Distributed Digital Libraries:
A Neural Network-based Approach,
A.~Rauber and D.~Merkl
Proc. Intl. Symposium on Intelligent Data Engineering and Learning
(IDEAL98), Hong Kong, 1998
HTML,
gnu-zipped postscript,
- Creating an Order in Distributed Digital Libraries by
Integrating Independent Self-Organizing Maps,
A.~Rauber and D.~Merkl
Proc. International Conf. on Artificial Neural Networks (ICANN98),
Skövde, Sweden, 1998
HTML,
gnu-zipped postscript,
- SOMLib: A Distributed Digital Library System Based on
Self-Organizing Maps,
A.~Rauber
Proc 10th Italian Workshop on Neural Nets (WIRN98),
Vietri sul Mare, Italy, 1998.
HTML,
gnu-zipped postscript,
Up to the SOMLib Digital Library Homepage
Comments: rauber@ifs.tuwien.ac.at