Department of Software Technology
Vienna University of Technology

The SOMLib Digital Library Project - Downloads

www.ifs.tuwien.ac.at/~andi/somlib

Introduction

Many people have asked us to provide the modules for our SOMLib Digital Library System. So, here we go: below you will find links to a variety of modules that we use for creating our SOMLib library maps. These include some modules for preprocessing abd text cleansing, feature extraction, SOM and GHSOM training, scripts to put the various things together, etc.
The Modules are originally implemented in a variety of languages, such as C, C++, Java. Below we provide the compiled codes for Intel Platforms runnning the Linux operating system. You should be able to use ost of the modules pretty straight forward.

However: Please note, that the present system is a Research Prototype under constant development, extension, modifications, etc. and by no means a production version. You are free to use it for non-commercial purposes (you would not seriously consider using it for commercial purposes anyway) as you please, and we would be happy to learn from your experiences. We will also try to help wherever we can - however, please be aware that we willnot be able to provide any guaranteed support for the system :-) If you think you really want to play around with the system in more detail, and possible fix some bugs that we have left there, or plan on extending or adapting it to your needs, we are happy to share the source code with you. In this case, just drop us an e-mail at rauber@ifs.tuwien.ac.at, and we'll send you whatever you think you want or need.

Modules

Step-by-Step-Guide: Here you can find a rudimentary step-by-step guide of how to build a SOMLib System with the modules listed below.
Datafiles Descriptions: The parser modules as well as the SOM/GHSOM programs store their results in the form of simple ASCII-files. This report provides a coarse description of the various files, their formats and semantics.
HTML, gnuzipped Postscript, PDF
Demo-Collection: A collection of 51 short scientific abstracts from the Department of Software Technology (IFS). (Note: the gnu-zipped tar archive does NOT expand into a separate subdirectory)
democollection
html2txt: A program converting html text into plain ASCII text. Reads from stdin, writes to stdout. NOTE: you may neet to procees files twice to get rid of nested HTML tags.
html2txt (binary LINUX x86)
SOMLib Java Package: A collection of JAVA programs that can be used to create SOMLib library systems. Includes (1) Feature Extraction (2) Feature space pruning (3) feature vector creation (4) feature vector normalization (5) SOM training (6)SOM Labeling (7) libViewer template generation. NOTE: as the training of a SOM is computationally demanding, you might consider using a non-JAVA implementation for that part for large text collections.
More information can be found on the SOMLib Java Package Quick Reference.
SOMLIB.tar.gz (Java source and class files, gnu-zipped tar archive, 283 KB)
SOMLib Parser Script: The SOMLib Parser Script performs all the necessary parsing steps one after the other, calling the respective modules from the SOMLib JAVA package.
somlib_parser_script (shell script, 10 KB)
GHSOM The GHSOM is capable of producing of producing (1) "conventional" SOMs, (2) growing SOMs and (3) growing hierarchical SOMs.
The source code provided is tested to compile under Linux, but should work equally well on most other systems. Just do a make configure and then make to compile the system. Otherwise, a compiled binary is provided as well.
The links provided here always link to the latest release of the GHSOM software. For older versions, as well as for more detaile dinformation, property files, etc. refer to the download-page of the GHSOM project
- ghsom-binary: (binary LINUX x86)
- GHSOM.tgz: Source code (gnu-zipped tar archive, 210 KB)
- ghsom-guide: A short guide describing the file formats, the parameters of the GHSOM to be specified in the property file, etc.
- If you want to do several runs with the same data collection, it is useless to compute the mqe0 every time (and rather time consuming). It is thus possible to once compute the mqe0 for a given vector file and store it in an mqe0-file. This filename can the be specified in the property-file. If it is found the GHSOM does not need to compute the mqe0 at the beginning of the training iteration.
  calc_mqe0.c: a small program that calculates the mqe0 for a given vector file and stores it into an mqe0-file to be listed in the property-file.
- Sample property files for:
GHSOM MATLAB Toolbox We also have a Matlab implementation of the GHSOM available, which was developed in a joint project with the University of Aberdeen. Detailed documentation and examples can be found at the GHSOM Toolbox homepage.
GHSOM_MATLAB.tar.gz (gnu-zipped tar archive, 138 KB)
LibViewer and libServer The java sources and class files of both the libViewer, as well as of the 1-to-1-Server are made available due to frequent requests, together with a set of demo-files.
LIBVIEWER.tar.gz (gnu-zipped tar archive, 135 KB)
LIBSERVER.tar.gz (gnu-zipped tar archive, 86 KB)
Labeling SOMLib with KEA Phrases Instead of extracting keywords using the LabelSOM method, key phrases may be extracted to label the document clusters. We use KEA for the extraction of these key labels. Sources for integrating KEA into SOMLib as well as a step-by-step guide with links to KEA sources are provided below.

Version 2.0:
(adresses some scalability issues and offers additional options wrt output generation)
step-by-step guide: a guide on how to integrate KEA into SOMLib
LabelSOM_II-2.0.tgz: The binaries as well as some java classes
parse_unitfile-2.0.tgz: C-Source for unit file parser
unit2html-2.0.tgz: C-source for unit-file to html converter

Version 1.0 step-by-step guide: a guide on how to integrate KEA into SOMLib
LabelSOM_II.tgz: The binaries as well as some java classes
parse_unitfile.tgz: C-Source for unit file parser
unit2html.tgz: C-source for unit-file to html converter

We will keep adding further modules as soon as they become sufficiently stable so that we dare putting them up on this page (and as soon as we find the time to put them here). If you have any comments, questions, requests, corrections, improvements, etc. feel free to contact us. Have fun! :-D

Up

Comments: rauber@ifs.tuwien.ac.at