Department of Software Technology
Vienna University of Technology
The SOMLib Digital Library Project - Downloads
www.ifs.tuwien.ac.at/~andi/somlib
Introduction
Many people have asked us to provide the modules for our SOMLib Digital Library System. So, here we go: below you will find links to a variety of modules that we use for creating our SOMLib library maps. These include some modules for preprocessing abd text cleansing, feature extraction, SOM and GHSOM training, scripts to put the various things together, etc.
The Modules are originally implemented in a variety of languages, such as C, C++, Java. Below we provide the compiled codes for Intel Platforms runnning the Linux operating system. You should be able to use ost of the modules pretty straight forward.
However: Please note, that the present system is a Research Prototype under constant development, extension, modifications, etc. and by no means a production version. You are free to use it for non-commercial purposes (you would not seriously consider using it for commercial purposes anyway) as you please, and we would be happy to learn from your experiences. We will also try to help wherever we can - however, please be aware that we willnot be able to provide any guaranteed support for the system :-)
If you think you really want to play around with the system in more detail, and possible fix
some bugs that we have left there, or plan on extending or adapting it to your needs, we are happy to share the source code with you. In this case, just drop us an e-mail at rauber@ifs.tuwien.ac.at, and we'll send you whatever you think you want or need.
Modules
- Step-by-Step-Guide:
Here you can find a rudimentary step-by-step guide of how to build a SOMLib System with the modules listed below.
- Datafiles Descriptions:
The parser modules as well as the SOM/GHSOM programs store their results in the form of simple ASCII-files. This report provides a coarse description of the various files, their formats and semantics.
HTML, gnuzipped Postscript, PDF
- Demo-Collection:
A collection of 51 short scientific abstracts from the Department of Software Technology (IFS).
(Note: the gnu-zipped tar archive does NOT expand into a separate subdirectory)
democollection
- html2txt:
A program converting html text into plain ASCII text. Reads from stdin, writes to stdout. NOTE: you may neet to procees files twice to get rid of nested HTML tags.
html2txt (binary LINUX x86)
- SOMLib Java Package:
A collection of JAVA programs that can be used to create SOMLib library systems.
Includes (1) Feature Extraction (2) Feature space pruning (3) feature vector creation (4) feature vector normalization (5) SOM training (6)SOM Labeling (7) libViewer template generation.
NOTE: as the training of a SOM is computationally demanding, you might consider using a non-JAVA implementation for that part for large text collections.
More information can be found on the SOMLib Java Package Quick Reference.
SOMLIB.tar.gz (Java source and class files,
gnu-zipped tar archive, 283 KB)
- SOMLib Parser Script:
The SOMLib Parser Script performs all the necessary parsing steps one after the other, calling the respective modules from the SOMLib JAVA package.
somlib_parser_script (shell script, 10 KB)
- GHSOM
The GHSOM is capable of producing of producing (1) "conventional" SOMs, (2) growing SOMs and (3) growing hierarchical SOMs.
The source code provided is tested to compile under Linux, but should
work equally well on most other systems. Just do a make configure
and then make to compile the system. Otherwise, a compiled binary
is provided as well.
The links provided here always link to the latest release of the GHSOM software. For older versions, as well as for more detaile dinformation, property files, etc. refer to the download-page of the GHSOM project
- ghsom-binary: (binary LINUX x86)
- GHSOM.tgz: Source code
(gnu-zipped tar archive, 210 KB)
- ghsom-guide: A short guide describing the
file formats, the parameters of the GHSOM to be specified in the property
file, etc.
- If you want to do several runs with the same data collection, it is useless
to compute the mqe0 every time (and rather time consuming). It is thus
possible to once compute the mqe0 for a given vector file and store it in an
mqe0-file. This filename can the be specified in the property-file. If it is
found the GHSOM does not need to compute the mqe0 at the beginning of the
training iteration.
calc_mqe0.c: a small program that
calculates the mqe0 for a given vector file and stores it into an mqe0-file
to be listed in the property-file.
- Sample property files for:
- GHSOM MATLAB Toolbox
We also have a Matlab implementation of the GHSOM available, which was
developed in a joint project with the University of Aberdeen. Detailed documentation and
examples can be found at the GHSOM Toolbox
homepage.
GHSOM_MATLAB.tar.gz (gnu-zipped tar
archive, 138 KB)
- LibViewer and libServer
The java sources and class files of both the libViewer, as well as of the 1-to-1-Server
are made available due to frequent requests, together with a set of demo-files.
LIBVIEWER.tar.gz (gnu-zipped tar archive, 135 KB)
LIBSERVER.tar.gz (gnu-zipped tar archive, 86 KB)
- Labeling SOMLib with KEA Phrases
Instead of extracting keywords using the LabelSOM method, key phrases may be extracted to label the document clusters.
We use KEA for the extraction of these key labels. Sources for integrating KEA into SOMLib as well as a step-by-step
guide with links to KEA sources are provided below.
Version 2.0:
(adresses some scalability issues and offers additional options wrt
output generation)
step-by-step guide: a guide on how to
integrate KEA into SOMLib
LabelSOM_II-2.0.tgz: The
binaries
as
well as some java classes
parse_unitfile-2.0.tgz:
C-Source
for unit file parser
unit2html-2.0.tgz: C-source for
unit-file to html converter
Version 1.0
step-by-step guide: a guide on how to
integrate KEA into SOMLib
LabelSOM_II.tgz: The binaries as
well as some java classes
parse_unitfile.tgz: C-Source
for unit file parser
unit2html.tgz: C-source for
unit-file to html converter
We will keep adding further modules as soon as they become sufficiently stable so that we dare putting them up on this page (and as soon as we find the time to put them here). If you have any comments, questions, requests, corrections, improvements, etc. feel free to contact us. Have fun! :-D
Up
Comments: rauber@ifs.tuwien.ac.at