Department of Software Technology
Vienna University of Technology
The SOMLib Digital Library - Experiments
Overview
For our experiments we use a number of different document archives of differing sizes.
This section is intended to serve as a kind of guided tour through the various stages of the SOMLib digital library system based on the various document archives.
Some of the experiments are available on-line for interactive exploration, providing links to the various experimental results, while for others we are currently only able to provide descriptions of the experiments. For those who want to explore the various modules of the SOMLib system, we recommend to take a look at the experiments based on the TIME Magazine Collection from the 1960's presented below.
Further collections will be added as they become available.
Document Archives
- Experiments available on-line:
- ifs Abstracts Collection
This is a very small collection of abstracts of scientific publications of our department. It is mainly used to test and demonstrate the features of the SOMLib library using a simple and easy-to-understand document collection.
It consists of 50 abstracts, amounting to a total of 102KB ASCII text files.
Available Results: For the smallest of our data collections we provide the data set, extracted feature vectors, interactive SOMs andlabels for the map
Experiments are available online
- CIA Wold Factbook, 1990 edition
This is the 1990 edition of the CIA World Factbook, describing various regions of the world in terms of their geographical, social, political, demographic etc. features.
Please note, that for our experiments we used the 1990 edition, i.e. a description of the world based on the situation before the 'fall' of the communist hemisphere.
It consists of 246 documents, or about 2.080 KB of plain ASCII text.
Available Results: For this experimental setting, the following results are provided: data collection, feature vectors, descriptions of trained SOMs, Hierarchical Feature Maps (HFM) and integrated maps (non-interactive descriptions only)
Experiments are available online
- TIME Magazine Article Collection of the 1960's
A Collection of articles from the TIME Magazine from the 1960's.
It consists of 420 articles, or a total of 1.550 KB ASCII text.
Available results: While being a rather small collections of publicly available articles, we are able to provide the most extensive set of experimental results for this setting. Results available include the full article set, feature vectors, SOMs, labeled SOMs, integration of distributed archives, a hirarchical archive using the GHSOM, as well as libViewer visualizations.
Experiments are available online
- Der Standard 1999
A collection of all articles from the Austrian Daily Newspaper Der Standard.
This collection includes about 50.000 German language text files, or about 355 MB of HTML-Text.
Available results: For datasets of this size, flat SOMs become too big for interaction. We thus provide results of a hierarchical
classification using our GHSOM model. Results available include the feature vectors, and labeled hirarchical archives using the GHSOM to allow
comparison of the effectso f different parameter settings.
Experiments are available online
- Russian Information Agency Nowosti (RIAN)
A collection of articles from the Russian News Agency Novosti in several languages, such as Russian, English, French, German, Arabic, providing an
ideal setting for multilingual experiments using the SOMLib system.
A non-parallel corpus of articles from a 14-day period in March 2001 is used to demonstrate the language-independence of the SOMLib system.
Furthermore, all articles were automatically translated to provide a single view of a multi-lingual document collection. In spite of the noise
introduced by the low-quality automatic translation, the SOMLib system succeeds in detecting correct topic hierarchies.
Available results: Both separate GHSOM topic hierarchies for the individual languages, as well as the combined translated hierarchy
are available for interactive exploration, together with the articles and respective feature vectors.
Experiments are available online
- Other collections analyzed:
- Astrophysical Journal Letters Collection
A collection of articles from the Astrophysical Journal Letters .
- Scientific American 1995-1999
A collection of recent articles from the Scientific American.
This collection includes all articles from the years 1996 to the first half of 1999, or about 13 MB of HTML-Text.
- TIME Magazine, AsiaWeek Edition 1995-1999
A Collection of recent articles from the TIME Magazine, esp. the AsiaWeek edition of the Time Magazine.
This collection includes all articles from the years 1995 to the first half of 1999, or about 72 MB of HTML-Text.
Up to the SOMLib Digital Library Homepage
Comments: rauber@ifs.tuwien.ac.at