Department of Software Technology
Vienna University of Technology

Creating a SOMLib Digital Library

A Step-by-Step Guide

www.ifs.tuwien.ac.at/~andi/somlib

Motivation

This step-by-step guide describes how you can create a SOMLib digital library based on a given text collection. It provides links to the various modules required as well as short descriptions how they are combined one after the other.
The collection used for this demo, consisting of only 50 documents, is incredibly small - almost too small to reasonably demonstrate all the features of the system. Yet, it is straightforward to go through, and processing times are a matter of seconds rather than minutes, making it an easy to follow demonstration.
This section not only describes how to use the various programs (which, basically, involves calling two programs, and possibly one or two scripts inbetween), but also gives rationale for how to set which parameter, how to analyze intermediate results, and how to evaluate the outcome of each processing step.

The various steps described are

Preprocessing: Cleaning of the text, e.g. stripping of html-tags, stemming, i.e. removing of prefixes and suffixes, splitting into several segments for longer files etc.
Parsing: Creating a vector representation of the texts that can be used for training a self-organizing map.
Training: The training process will create a clustering of the articles by topic. We describe the training of both a standard SOM as well as a Growing Hierarchical SOM (GHSOM). This also includes the extraction of labels for the trained SOMs describing the various clusters.
libViewer Representation: This section describes how you can create a metadescription of your resulting library allowing it to be displayed by the libViewer interface.

Please note, that the present system is a Research Prototype under constant development, extension, modifications, etc. and by no means a production version. You are free to use it for non-commercial purposes (you would not seriously consider using it for commercial purposes anyway) as you please, and we would be happy to learn from your experiences. We will also try to help wherever we can - however, please be aware that we willnot be able to provide any guaranteed support for the system :-)

1.) Preprocessing

In order to obtain a high-quality topical clustering of the text documents in the library, some preprocessing steps may be performed to improve the quality of the data itself. Some of these steps, such as the removal of formatting instructions, are completely language-independent, whereas others, such as stemming (if you want to use them - they are not required steps) require language-specific tools.
If you have your documents in plain ASCII already, you can skip this section and move right on to the parsing process.

For the following steps we assume that all documents are available in some ASCII-file format such as plain text or HTML in a directory named experiments/files.

Step 1: Create a directory called experiments/files, preferably in your www subdirectory branch, and copy all files to be included in the SOMLib library into this directory in ASCII File Format (e.g. plain text or HTML)

 
bash$ mkdir experiments
bash$ cd experiments
bash$ mkdir files
bash$ cd files

Alternatively (and preferably), you can create a link to the appropriate directory where the files are stored.

 
bash$ mkdir experiments
bash$ cd experiments
bash$ ln -s PATH_TO_YOUR_EXPERIMENTS_FILES files

For our experiments we will use a collection of abstracts from scientific papers, which you may download..

  
bash$ wget http://www.ifs.tuwien.ac.at/~andi/somlib/download/democollection.tar.gz
--16:06:28--  http://www.ifs.tuwien.ac.at:80/%7Eandi/somlib/download/democollection.tar.gz
           => `democollection.tar.gz'
Connecting to www.ifs.tuwien.ac.at:80... connected!
HTTP request sent, awaiting response... 200 OK
Length: 20,561 [application/x-tar]

    0K -> .......... ..........                                  [100%]

16:06:28 (542.68 KB/s) - `democollection.tar.gz' saved [20561/20561]

Unzip and untar the file to extract the articles in it, afterwards, the original tar.gz archive can be deleted.

bash$ tar -xvzf democollection.tar.gz 
alb_lisa98.html
ber_hicss98.html
[...]
tjo_02.html
win_nlp96.html

bash$ rm democollection.tar.gz

We now have a collection of 51 html files in our directory, such as the abstract of a paper published at WIRN 1998: experiments/files/wirn98.html

bash$ ls -al
total 208
drwxr-xr-x    2 andi     ifs_staf     4096 Mar  8 16:08 .
drwxr-xr-x    3 andi     ifs_staf     4096 Mar  8 15:50 ..
-rw-r--r--    1 andi     ifs_staf     2093 Mar  8 16:01 alb_lisa98.html
-rw-r--r--    1 andi     ifs_staf     1774 Mar  8 16:01 ber_hicss98.html
-rw-r--r--    1 andi     ifs_staf     1773 Mar  8 16:01 duf_vv98.html
-rw-r--r--    1 andi     ifs_staf     1885 Mar  8 16:01 ell_compsac96.html
-rw-r--r--    1 andi     ifs_staf     1704 Mar  8 16:01 ell_dexa96a.html
-rw-r--r--    1 andi     ifs_staf     1849 Mar  8 16:01 ell_dexa96b.html
-rw-r--r--    1 andi     ifs_staf     2249 Mar  8 16:01 ell_sast96.html
-rw-r--r--    1 andi     ifs_staf     1733 Mar  8 16:01 han_idamap97.html
-rw-r--r--    1 andi     ifs_staf     1540 Mar  8 16:01 has_eann97.html
-rw-r--r--    1 andi     ifs_staf     2280 Mar  8 16:01 hor_jcbm97.html
-rw-r--r--    1 andi     ifs_staf     1938 Mar  8 16:01 hor_jieee98.html
-rw-r--r--    1 andi     ifs_staf     1683 Mar  8 16:01 koh_acnn96.html
-rw-r--r--    1 andi     ifs_staf     1597 Mar  8 16:01 koh_cbms97.html
-rw-r--r--    1 andi     ifs_staf     1769 Mar  8 16:01 koh_eann97.html
-rw-r--r--    1 andi     ifs_staf     1236 Mar  8 16:01 koh_esann96.html
-rw-r--r--    1 andi     ifs_staf     1958 Mar  8 16:01 koh_icann96.html
-rw-r--r--    1 andi     ifs_staf     1554 Mar  8 16:01 koh_icann98.html
-rw-r--r--    1 andi     ifs_staf     1755 Mar  8 16:01 kor_twd98.html
-rw-r--r--    1 andi     ifs_staf     1502 Mar  8 16:01 mer_aiem96.html
-rw-r--r--    1 andi     ifs_staf     2157 Mar  8 16:01 mer_cise98.html
-rw-r--r--    1 andi     ifs_staf     2252 Mar  8 16:01 mer_codas96.html
-rw-r--r--    1 andi     ifs_staf     1758 Mar  8 16:01 mer_dexa97.html
-rw-r--r--    1 andi     ifs_staf     2145 Mar  8 16:01 mer_dexa98.html
-rw-r--r--    1 andi     ifs_staf     1485 Mar  8 16:01 mer_eann98.html
-rw-r--r--    1 andi     ifs_staf     1360 Mar  8 16:01 mer_fns97.html
-rw-r--r--    1 andi     ifs_staf     2276 Mar  8 16:01 mer_icail97.html
-rw-r--r--    1 andi     ifs_staf     1881 Mar  8 16:01 mer_nlp98.html
-rw-r--r--    1 andi     ifs_staf     1987 Mar  8 16:01 mer_pkdd97.html
-rw-r--r--    1 andi     ifs_staf     2004 Mar  8 16:01 mer_sigir97.html
-rw-r--r--    1 andi     ifs_staf     2124 Mar  8 16:01 mer_wirn97.html
-rw-r--r--    1 andi     ifs_staf     1662 Mar  8 16:01 mer_wsom97.html
-rw-r--r--    1 andi     ifs_staf     1467 Mar  8 16:01 mer_wsom97a.html
-rw-r--r--    1 andi     ifs_staf     2354 Mar  8 16:01 mik_aa97.html
-rw-r--r--    1 andi     ifs_staf     2088 Mar  8 16:01 mik_aips98.html
-rw-r--r--    1 andi     ifs_staf     1828 Mar  8 16:01 mik_bidamap97.html
-rw-r--r--    1 andi     ifs_staf     1800 Mar  8 16:01 mik_ecp97.html
-rw-r--r--    1 andi     ifs_staf     2272 Mar  8 16:01 mik_ijcai97.html
-rw-r--r--    1 andi     ifs_staf     2193 Mar  8 16:01 mik_jaim96.html
-rw-r--r--    1 andi     ifs_staf     1949 Mar  8 16:01 mik_keml97.html
-rw-r--r--    1 andi     ifs_staf     2407 Mar  8 16:01 mik_scamc96.html
-rw-r--r--    1 andi     ifs_staf     2567 Mar  8 16:01 rau_caise98dc.html
-rw-r--r--    1 andi     ifs_staf     1551 Mar  8 16:01 rau_esann98.html
-rw-r--r--    1 andi     ifs_staf     1953 Mar  8 16:01 rau_icann98.html
-rw-r--r--    1 andi     ifs_staf     1704 Mar  8 16:01 rau_wirn98.html
-rw-r--r--    1 andi     ifs_staf     2264 Mar  8 16:01 sha_aime97.html
-rw-r--r--    1 andi     ifs_staf     2119 Mar  8 16:01 sha_jaim98.html
-rw-r--r--    1 andi     ifs_staf     1964 Mar  8 16:01 sha_scamc96.html
-rw-r--r--    1 andi     ifs_staf     2134 Mar  8 16:01 tjo_01.html
-rw-r--r--    1 andi     ifs_staf     1414 Mar  8 16:01 tjo_02.html
-rw-r--r--    1 andi     ifs_staf     1335 Mar  8 16:01 win_nlp96.html

Removal of Formatting Information

In order to allow the SOM to cluster documents by content, rather than by markup-tags, formatting keywords need to be removed from the documents. For HTML-Files we need some program that strips the files from all HTML-tags. This can be done by any of the available public html2txt converters. In our case we will use the html2txt converter. Download the the converter and put it in a directory called programs in your experiments directory. Then, from the directory files containing your HTML files, call html2txt for each file to create a pure ASCII text only version:

mkdir files_cleaned
cd files
bash$ for file in *; do
> ../programs/html2txt -t -a < $file > ../files_cleaned/$file
> done
cd ../files_cleaned

For some HTML-files you may have to repeat the process to get rid of nested HTML tags, storing the intermediate files into a temporary directory or piping it through the command twice, before finally obtaining a clean ASCII text only version:

bash$ for file in *; do
> ../programs/html2txt -t -a < $file | ../programs/html2txt -t -a > ../files_cleaned/$file
> done
cd ../files_cleaned

Splitting Documents into Sections

Long documents may be split into several smaller subsections to ensure more homogenous topic representation for each segment. This can be achieved using the csplit command.

Stemming

Prefixes and suffixes may be removed to obatin word stems, which again improve the quality of content representation. This can be achieved using a stemming program. Mind, however, that stemming is highly language-dependent. For the english language, the well-kown and well-established Porter's Stemmer provides high-quality stemming for content representation purposes. (Note: requirements for stemming depend to a large degree on the tasks to be performed afterwards. For content representation a rather crude stemmer creating approximate word stems is sufficient, whereas for Information Retrieval purposes and lexical analysis Natural Language stems need to be obtained.) However, experiments have shown that it is not a must to apply stemming to your documents, so if you do not have an apropriate stemmer for the language of your documents, you can proceed happily without applying stemming techniques.

mkdir stemmed
cd cleaned
for file in *; do
> porterstem.pl $file > ../complete_stemmed/$file
> done

At the end of the preprocessing-stage you should have a directory contatining the pure ascii-files that you want to proces with the SOMLib Digital Library system.

2.) Parsing

The parsing process creates feature vectors describing the contents of the documents. For details on the feature vector creation process, see the Section on Text Representation at the SOMLib Project Homepage.

We use the feature extraction programs of the SOMLib Java package to obtain the feature vectors. Download the SOMLib Java package and put it in a directory called programs in your experiments directory. Unpacking the program extracts all necessary class-files into a somlib subdirectory. Make sure you set the classpath to include the current directory. A detailed description of the various modules in the SOMLib Java Package is provided at the packet's homepage.

Instead of calling all modules separately, you may use the somlib parser script to call the appropriate modules. Calling the module without any parameters provides a listing of the parameters applicable.

bash$  programs/somlib_parser_script
ERROR:  Usage:    programs/somlib_parser_script Name InputDir MinWordLength Min_df Max_df Verbosity
        Example:  programs/somlib_parser_script somlib_test inp_dir 3 0.01 0.6 2
or simple version:
ERROR:  Usage:    programs/somlib_parser_script Name InputDir
        Example:  programs/somlib_parser_script somlib_test inp_dir

It is recommended to set all parameters individually, especially the upper and lower boundaries for the pruning process, as the percentual values need to be adopted to the number of files in the data set. Some notes on the parameters:

name: name of the experiment run, used as filename for all files produced
MinWordLength: minimum number of characters in a string to be considered a word. words shorter than the number inidcated are not considered as features. 3 is a pretty useful setting, removing some articles and single characters while retaining most abbreviations.
Mindf: minimum document frequency for a word to be included in the feature vector. CHnaging this parameter only slightly results in tremendous reduction in the feature space. It hsould be guided by the minimum granularity of the resulting cluster representation. It is a percentual value of the number of files in the collection, good estimates are percentual values resulting in a minimum document frequency of about 5 files for small collections, up to 15 for larger collections.
Max_df: maximum document frequency for a word not to be removed from the feature space. This does not influence the dimensionality a lot, it is mainly used to get rid of typical stop-words. It can be set from somewhere between 0.4, (i.e. 40%) down to 0.01 (i.e. 1%) for large collections.
Verbosity: controls the output of the program, 2 is a good value listing the file currently being processed.

[andi@student experiments]$ somlib_parser_script demo_1 files_cleaned 3 0.05 0.6 2
demo_1
Thu Apr  5 10:28:24 CEST 2001
somlib_parser_script demo_1 files_cleaned 3 0.05 0.6 2
(Usage:    somlib_parser_script Name InputFiles MinWordLength Min_df Max_df Verbosity)
------------------------------------------------------
somlib_parser_script: created directory parser/ for parsing files
somlib_parser_script: created directory parser/histo/ for histogram files
somlib_parser_script: created directory vectors/ for vector files
somlib_parser_script: created local symbolic link to /usr/local/somlib/bin/somlib_java/ directory
------------------------------------------------------
somlib_parser_script: calling
   java -Xmx10000m somlib.textrepresentation.wordsexc -i files_cleaned -o parser/histo -m 3 -v 2
to create wordhistograms
somlib_parser_script: finished somlib.textrepresentation.wordsexc
------------------------------------------------------
somlib_parser_script: calling
   java -Xmx10000m somlib.textrepresentation.templatevectorexc -i parser/histo -o parser/demo_1.tv.hash -v 2
to extract template vector
somlib_parser_script: finished somlib.textrepresentation.templatevectorexc
------------------------------------------------------
somlib_parser_script: calling
   java -Xmx10000m somlib.textrepresentation.reducerexc -i parser/demo_1.tv.hash -o parser/demo_1.tv.red.hash -n 0.05 -x 0.6 -r vectors/demo_1.removed.txt -v 2
to create reduced templatevector
somlib_parser_script: finished somlib.textrepresentation.reducerexc
------------------------------------------------------
somlib_parser_script: calling
   java -Xmx10000m somlib.textrepresentation.extractorexc -i ./histo -j ./demo_1.gen.red.hash -o ./vectors/demo_1  -f t  -v 2
to create individual vectors
somlib_parser_script: finished somlib.textrepresentation.extractorexc
------------------------------------------------------
somlib_parser_script: creating html file demo_1.parser.html
------------------------------------------------------
ls -al
total 28
drwxr-xr-x    5 andi     ifs_staf     4096 Apr  5 10:28 .
drwxr-xr-x   11 andi     ifs_staf     4096 Apr  5 09:56 ..
-rw-r--r--    1 andi     ifs_staf     1247 Apr  5 10:28 demo_1.parser.html
-rw-r--r--    1 andi     ifs_staf     2265 Apr  5 10:28 demo_1.parser.log
lrwxrwxrwx    1 andi     ifs_staf       44 Apr  5 09:57 files -> /home/lehre/vo_dl/collections/democollection
drwxr-xr-x    2 andi     ifs_staf     4096 Apr  5 10:06 files_cleaned
drwxr-xr-x    3 andi     ifs_staf     4096 Apr  5 10:28 parser
lrwxrwxrwx    1 andi     ifs_staf       33 Apr  5 10:28 somlib -> /usr/local/somlib/bin/somlib_java
drwxr-xr-x    2 andi     ifs_staf     4096 Apr  5 10:28 vectors
ls -al vectors/*
-rw-r--r--    1 andi     ifs_staf    14471 Apr  5 10:28 vectors/demo_1.removed.txt
-rw-r--r--    1 andi     ifs_staf    74061 Apr  5 10:28 vectors/demo_1.tfxidf
-rw-r--r--    1 andi     ifs_staf    14186 Apr  5 10:28 vectors/demo_1.tv
------------------------------------------------------
content parser done
Thu Apr  5 10:28:30 CEST 2001
------------------------------------------------------

During the parsing procedure, 2 directories are created, namely a parser directory containing hash-files and histogram files of the documents. This directory can savely be removed after the parsing procedure. The second directory called vectors contains the feature vectors as well as the list of pruned (i.e. removed) words. You should find there the following files: demo_1.removed.txt, demo_1.tfxidf, demo_1.tv

[andi@student experiments]$ dir vectors/
total 124
drwxr-xr-x    2 andi     ifs_staf     4096 Apr  4 12:49 .
drwxr-xr-x    7 andi     ifs_staf     4096 Apr  4 12:45 ..
-rw-r--r--    1 andi     ifs_staf    14471 Apr  4 12:48 demo_1.removed.txt
-rw-r--r--    1 andi     ifs_staf    74061 Apr  4 12:48 demo_1.tfxidf
-rw-r--r--    1 andi     ifs_staf    14186 Apr  4 12:48 demo_1.tv

By analyzing the list of removed words we can check, whether the setting for tf_min and tf_max was appropriate. This file contains the list of all removed words, plus a flag indicating whether it was removed due to the upper (H) or lower threshold (L) in the first column. The number in the second column lists the document frequency, i.e. the number of document the respective term ocurredin, whereas the third column lists the term frequency, i.e. the number of times the given term appeared in the collection in total. Usually, only a small number of terms is removed due to the threshold on the maximum document frequency, which are the typical stop words such as articles etc. (typically between 20 and 300 terms) A quite significant reduction in dimensionality can be obtained by increasing the lower document frequency threshold (up to tens of thousands of terms), i.e. by removing rare words that drepresent content, but are too rare to diffrentiate between different content clusters. It more or less indicates a kind of minimum cluster size or topic granularity.

[andi@student vectors]$ grep L demo_1.removed.txt | sort +1 -rg | more
L 2 9 mortality
L 2 7 validation
L 2 7 protocols
L 2 7 criteria
L 2 6 tree
L 2 6 assessment
L 2 5 security
L 2 5 logit
L 2 5 asbruview
L 2 4 web
L 2 4 som
L 2 4 semantic
L 2 4 resources
:
:

[andi@student vectors]$ grep H demo_1.removed.txt | sort +1 -g  | more
H 31 48 that
H 31 53 this
H 32 68 are
H 44 125 for
H 44 222 and
H 50 102 step
H 50 103 technology
H 50 50 comments
H 50 50 guide
H 50 50 ifs
H 50 50 rauber
H 50 50 tuwien
H 50 51 creating
H 50 51 university
H 50 51 vienna
H 50 52 somlib
:
:
:

Also worth analyzing in this context is the template vector file, to find out, which words were not removed (and probably should be): If we take a look at the words with the highest document frequencies and find them to be mostly stop words we may decide to lower the respective threshold to remove them from the list. The same applies to the lower threshold, if we want to further reduce thedimensionality of the feature space.

[andi@student vectors]$  sort +2 -g demo_1.tv | more
$TYPE template
$VEC_DIM 492
$XDIM 7
$YDIM 50
100 services 3 3 1 1 1.0
103 straight 3 3 1 1 1.0
105 active 3 3 1 1 1.0
109 collected 3 4 1 2 1.3333333333333333
110 operations 3 3 1 1 1.0
112 regarded 3 4 1 2 1.3333333333333333
113 compensation 3 3 1 1 1.0
115 identify 3 3 1 1 1.0
116 pattern 3 4 1 2 1.3333333333333333
119 propose 3 3 1 1 1.0
122 explicitly 3 3 1 1 1.0
126 understanding 3 3 1 1 1.0
127 technique 3 4 1 2 1.3333333333333333



[andi@student vectors]$  sort +2 -rg demo_1.tv | more
229 with 29 43 1 3 1.4827586206896552
159 based 28 50 1 5 1.7857142857142858
137 data 27 94 1 12 3.4814814814814814
470 from 26 42 1 5 1.6153846153846154
416 paper 25 26 1 2 1.04
378 abstract 24 25 1 2 1.0416666666666667
313 neural 23 46 1 6 2.0
294 such 22 32 1 4 1.4545454545454546
258 which 22 31 1 3 1.4090909090909092
187 using 21 27 1 2 1.2857142857142858
177 knowledge 21 40 1 3 1.9047619047619047
167 representation 21 30 1 4 1.4285714285714286
74 model 20 23 1 2 1.15
251 approach 20 30 1 3 1.5
148 results 20 21 1 2 1.05
450 classification 19 37 1 5 1.9473684210526316

If you find the thresholds either too high or too low you may need to re-run the parsing process to obtain better weight vector representations. The resulting feature space at the end typically lies somewhere between 3.000 and 15.000 dimensions, with lower dimensionalities greatly reducing computation times for thraining the subsequent map. Before training the SOM we might want to take a look at the input vector file conatining the documents feature vectors as well as the name of each vector.

[andi@student vectors]$ cd ~/www/experiments/vectors/
[andi@student experiments]$ more vectors/demo_1.tfxidf 
$TYPE vec_tfxidf
$XDIM 50
$YDIM 1
$VEC_DIM 492
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0986123 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0986123 0 0 0 0 2.4849067 0 0 0 0 0 0 2.3025851 0 0 0 2.7725887 0 0 0  0 0 0 0 0 0 0 5.5451775 0 0 0 1.0986123 0 1.609438 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0986123 0 0 0 1.609438 2.7725887 0 0 0 0 0 0 0 0 2.0794415 0 0 0  0 0 0 0 0 0 0 2.7725887 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0 0 0 0 2.4849067 0 0 0 0 0 0.6931472 0 0 0 0 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0 0.6931472  0 0 0 0 0 0 0 0 0 2.0794415 0 0 2.4849067 0 0 0 0 1.9459101 0 0.6931472 0 0 0 0 0 0 0 0 0 0 0 0 1.9459101 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.6931472 0 0 0 0 0  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7.45472 0 0 0 0 0 1.0986123 0 2.0794415 0 0 0 0 0 0 0 0 0 0 0 2.0794415 0 0 0 0 2.4849067 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.3862944 0 0 0 0 0 0 0 0 0 4.9698133 0 0 0 2.4849067 0 0 0 0 0.6931472 0 0 0 0 0 0 0 0 0 0 0 0 0 2.0794415 0 0 0 0 0 1.945910 1 0 0 0 0 0 0 2.0794415 0 0 0 1.3862944 0 0 1.0986123 0 0 0 0 0 0 0 0 0 2.7725887 2.7725887 0 0 0 0 0 0 0 0 2.7725887 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  0 0 4.9698133 0 0 4.9698133 0 0 1.9459101 0 0 2.3025851 0 0 0 0 0 0 0 0 0 0 2.4849067 0 0 0 2.7725887 0 1.9459101 0 0 0 0 0 0 0 0 0 1.9459101 0 1.609438 0  1.0986123 1.609438 0 2.7725887 0 0 0 0 0 0 0 0 0 0 0 0 2.3025851 0 0 0 0 0.6931472 0 0 0 0 0 0 0 0 2.7725887 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2.7725887 0 0 0 0 0 parser/histo/alb_lisa98.html.txt.idv
0 0 3.583519 0 0 0 0 0 0 0 0 0 0.... etc.

This file contains the tfxidf values of the various attributes, i.e. words, as described in the template vector file. The last entry in each line is the name of the vector, i.e. the name of the file. This filename will later be used to create the link to the documents, allowing users to browse the library.

3.) Training

Follwing the vector creation process we can train the self-organizing maps. For this we use the GHSOM program, the growing Hierarchical Self-Organizing Map, which is capable of producing (1) "conventional" SOMs, (2) growing SOMs and (3) growing hierarchical SOMs. Download the GHSOM and put it in your programs directory.

The GHSOM program can be used to create three different flavours of SOMs, namely (1) the traditional, static SOM which requires a fixed map size to be specified for the training process, a (2) growing SOM, where rows and columns are added to a SOM until it has reached a certain size suffizient for explaining the data to a certain degree of granularity, and (3) the growing hierarchical SOM (GHSOM), which dapts both it's size as well as it's hierarchical structure according to the data requirements. Examples for all three kinds of maps are provided below.
The GHSOM reads all parameters from a so-called property file. Create separate directories for the property-files (which contain the parameters for the experiment runs you want to perform) and the outputs of your runs, eg.

mkdir properties
mkdir output

Create / Edit the property files in the properties directory according to your needs as follows (sample property files are provided below):
The property-file is a simple plain-text file consisting of several property - value pairs like this:

property1=value1
property2=value2
property3=value3
...

ATTENTION: no white-spaces are allowed between property/value and the equal sign. Furthermore, no trailing white spaces should be present after the value.

If you don't provide one or more of several of the following properties, a default value for them will be set.

Property Type Range Description

EXPAND_CYCLES int >=1 # of cycles after which the map is checked for eventual expansion; 1 cycle actually means # of input vectors;
Example: 100 input vectors, 10 cycles = 1000 times a randomly chosen pattern is presented to the SOM for learning

TAU_1 real [0-1] percentage of remaining error that has to be explained by each map, ako stopping criterion for horozontal growth. The smaller this value, the larger each map will grow, and the flatter the hierarchy will be
A good starting point may be a value of about 0.25

TAU_2 real [0-1] final degree of granularity represented by the maps in the lowest layer. The smaller, the more detailed the data representation will be, and thus the bigger the overall GHSOM structure.
An appropriate value for testing may be 0.1 or less; if you set this property to 1, only one single SOM in the first layer will be trained

INITIAL_LEARNRATE real [0-1] determines how strong the winner and its neighboring units are initially adapted, decreases over time
good starting point: 0.8

INITIAL_NEIGHBOURHOOD int >=0 initial neighborhood range, decreases over time
If you are training a GHSOM starting with a 2x2 initial map, a value of 2 or 3 is sufficient. If you are using the GHSOM to tarin a conventional SOM of size XxY, you might want to set it to X or Y whichever is higher

HTML_PREFIX string - prefix for the output files. All files will be labeled that way,, followed by an underscore and subsequent numbering

DATAFILE_EXTENSION string may be empty suffix for the reference of the data files in the HTML tables;
we usually name the vectors in the inputvector-file to link to the actual files but omit the extension to get "better looking" maps; if you do so, you have to provide the extension to get the correct links to the document files; for browsing, the document files are always expected in a subdirectory files of the directory where the HTML files are located

randomSeed int any initial seed value for the random number generator to enable repeatable training-runs

inputFile string - path (relative to the current directory you are in or absolute) + name of the input vector file (vectors/test.in)

descriptionFile string - path (relative to the current directory you are in or absolute) + name of the input vector file (vectors/test.tv)

savePath string - directory where the output files are written (without trailing slash). Note: make sure that this directory exists, and that you have write permissions on it! :) (output)

normInputVectors string NONE
| LENGTH
| INTERVAL if and how the input vectors are normalized; NONE=raw input data will be used; LENGTH=vectors are normalized to length 1; INTERVAL=vector elements are transformed into the interval [0-1]

INITIAL_X_SIZE int >=1 initial size of new maps in x-direction. For any growing map you will want to set this to 2, from which the map will start to grow. However, you can set it to any desired size right away.

INITIAL_Y_SIZE int >=1 initial size of new maps in y-direction. For any growing map you will want to set this to 2, from which the map will start to grow. However, you can set it to any desired size right away. If you set this value to 1, you will create a 1-dimensional SOM, that grows only linearly, resulting, if expanded hierarchically, in a tree-like representation fo your data.

LABELS_NUM int >=0 max # of labels per unit; 0 = no labels.
The labelSOM method is used to select those features that are most characteristic of the respective unit to describe it.

LABELS_ONLY bool true | false if 'true', only the labels will be shown on nodes which have been expanded into the next layer along with a link labeled "down". Setting this property to 'false' is only useful for testing small data sets to see which data is mapped onto the according map in the next layer.

LABELS_THRESHOLD real [0-1] features which are most important are used as labels; a value of 0.8 means that only the features with values in the top 20% of all are printed as labels; the lower this value the more labels will be shown (limited by LABELS_NUM)

Static SOM
This property file setting emulates a static SOM: by setting both TAU_1 and TAU_2 to 1.0, the SOM basically fulfills the stopping criteri for both horizontal as well as hierarchical growth immediately. Thus, after EXPAND_CYCLES number of iterations, when the MQE's are evaluated, the training process stops and the map file is stored to disk.
EXPAND_CYCLES must be set to rather high levels, because the complete map is trained without any training repetition as no expansion is required. INITIAL_X_SIZE and INITIAL_Y_SIZE are directly set to the final map size values, resulting in a 5x6 map for the given example.

EXPAND_CYCLES=100                   # Iteration= #cycles * #input_vecs
MAX_CYCLES=0                        # max nr. of cycles, 0= unlimited
TAU_1=1.0                           # stopping criterion for horizontal growth
TAU_2=1.0                           # absolute stopping criterion
INITIAL_LEARNRATE=0.5
INITIAL_NEIGHBOURHOOD=3
HTML_PREFIX=static_demo1            # output filename
DATAFILE_EXTENSION=                 # link filename extension
randomSeed=17
inputFile=vectors/demo_1.tfxidf     # path to vector files
descriptionFile=vectors/demo_1.tv   # path to template vector
savePath=output                     # directory for results
printMQE=false                      # debug
normInputVectors=LENGTH             # normalize vecs to unit length
saveAsHTML=true                     # save html result files
saveAsSOMLib=true                   # save somlib datafiles 
INITIAL_X_SIZE=6                    # size of SOM horizontal
INITIAL_Y_SIZE=5                    # size of SOM, vertical
LABELS_NUM=15                       # max nr of labels
LABELS_ONLY=true                    # only labels for expanded units plus "down" link
LABELS_THRESHOLD=0.35               # threshold for label selection
ORIENTATION=false                   # ignore orientation of lower level maps, overrides X and Y size if set true

Flat Growing SOM
The folowing property file results in a flat growing SOM, i.e. a map, where, starting from initially 2x2 units, rows and columns are added until the data is explained to a sufficient degree as indicated by the parameter TAU_1. By setting TAU_2 again to 1.0, no hierarchical expansion is required, as that stopping criterion is immediately met.
The EXPAND_CYCLES can now be set to a somewhat lower level, as several training cycles will be performed anyway as new rows and columns are inserted. Note, that INITIAL_X_SIZE and INITIAL_Y_SIZE are set to 2 initially, from which size the map will start to grow.

EXPAND_CYCLES=40
MAX_CYCLES=0
TAU_1=0.01
TAU_2=1
INITIAL_LEARNRATE=0.5
INITIAL_NEIGHBOURHOOD=3
HTML_PREFIX=growing_demo1
DATAFILE_EXTENSION=
randomSeed=17
inputFile=vectors/demo_1.tfxidf
descriptionFile=vectors/demo_1.tv
savePath=output
printMQE=false
normInputVectors=LENGTH
saveAsHTML=true
saveAsSOMLib=true
INITIAL_X_SIZE=2
INITIAL_Y_SIZE=2
LABELS_NUM=15
LABELS_ONLY=true
LABELS_THRESHOLD=0.35
ORIENTATION=true

GHSOM 1
The following property file allows for both horizontal and hierarchical growth. As the resulting maps will be rather small, the number of EXPAND_CYCLES can be set to smaller values still.
Starting from a 2x2 SOM new rows and columns will be added until the MQE of the map falls below 10% of the MQE of it's parent unit (i.e. MQE_0 for the first layer map), since TAU_1 is set to 0.1.
Afterwards, all units will be checked, whether their MQE is below the ultimate threshold of 1% of the initial data distribution, i.e. MQE_0 for all units due to TAU_2 being set to 0.01. Units which satisfy this criterion cease training, whereas other units are expanded at further layers.

EXPAND_CYCLES=4
MAX_CYCLES=0
TAU_1=0.1
TAU_2=0.01
INITIAL_LEARNRATE=0.5
INITIAL_NEIGHBOURHOOD=3
HTML_PREFIX=ghsom_demo1_a
DATAFILE_EXTENSION=
randomSeed=17
inputFile=vectors/demo_1.tfxidf
descriptionFile=vectors/demo_1.tv
savePath=output
printMQE=false
normInputVectors=LENGTH
saveAsHTML=true
saveAsSOMLib=true
INITIAL_X_SIZE=2
INITIAL_Y_SIZE=2
LABELS_NUM=15
LABELS_ONLY=true
LABELS_THRESHOLD=0.35
ORIENTATION=true

GHSOM 2
the following property file is similar to the one above, producing a GHSOM which grows both horizontally as weell as hierarchically. the difference lies in the settings of the thresholds for expansion. With TAU_1 being set to a higher value of 0.15 the growing process finishes as soon as the MQE of a map has fallen to 15% of its parent unit's MQU, as opposed to 10% in th eprevious example. thus, the map has not to grow that much, stopping horizontal growth earlier. Obviously, the resulting MQE's of the individual units will be higher, so more of the will be expanded into next layers to meet the final stopping criterion. This results in deeper hierarchies of smaller maps.
In order from preventing the map to grow into too deep hierarchies for the small dataset, we also set the parameter TAU_2 to a higher value of 0.05, so that the map does not expand to too many lower layers. But basically, the granularity could be set to the same level of detail as inthe example above, resulting in identical quality of data representation, yet producing different views.

EXPAND_CYCLES=4
MAX_CYCLES=0
TAU_1=0.15
TAU_2=0.05
INITIAL_LEARNRATE=0.5
INITIAL_NEIGHBOURHOOD=3
HTML_PREFIX=ghsom_demo1_b
DATAFILE_EXTENSION=
randomSeed=17
inputFile=vectors/demo_1.tfxidf
descriptionFile=vectors/demo_1.tv
savePath=output
printMQE=false
normInputVectors=LENGTH
saveAsHTML=true
saveAsSOMLib=true
INITIAL_X_SIZE=2
INITIAL_Y_SIZE=2
LABELS_NUM=15
LABELS_ONLY=true
LABELS_THRESHOLD=0.35
ORIENTATION=true

After editing the properties-files, simply run the ghsom program with the properties-file you want to use, e.g.

[andi@student experiments]$ ghsom properties/som_static.prop
[andi@student experiments]$ nice -19 ghsom properties/som_flat_growing.prop &
[andi@student experiments]$ nice -19 ghsom properties/som_ghsom1.prop & 
[andi@student experiments]$ nice -19 ghsom properties/som_ghsom2.prop &

Note: we use "nice -19" to assign a lower process priority to the training process, so as to be able to do something else interactively on the machine, while the GHSOM ist trained.

The GHSOM training process is performed, and the result files are written into the output directory specified in the properties file.

[andi@student experiments]$ ghsom properties/ghsom1.prop           
EXPAND_CYCLES = 4
MAX_CYCLES = 0
TAU_1 = 0.1
TAU_2 = 0.01
INITIAL_LEARNRATE = 0.5
INITIAL_NEIGHBOURHOOD = 3
HTML_PREFIX = ghsom_demo1_a
DATAFILE_EXTENSION = 
randomSeed = 17
inputFile = vectors/demo_1.tfxidf
descriptionFile = vectors/demo_1.tv
savePath = output
printMQE = false
normInputVectors = LENGTH
saveAsHTML = true
saveAsSOMLib = true
INITIAL_X_SIZE = 2
INITIAL_Y_SIZE = 2
LABELS_NUM = 15
LABELS_ONLY = true
LABELS_THRESHOLD = 0.35
ORIENTATION = true
added alb_lisa98.html
added ber_hicss98.html
added duf_vv98.html
added ell_compsac96.html
added ell_dexa96a.html
added ell_dexa96b.html
added ell_sast96.html
added han_idamap97.html
added has_eann97.html
added hor_jcbm97.html
added hor_jieee98.html
added koh_acnn96.html
added koh_cbms97.html
added koh_eann97.html
added koh_esann96.html
added koh_icann96.html
added koh_icann98.html
added kor_twd98.html
added mer_aiem96.html
added mer_cise98.html
added mer_codas96.html
added mer_dexa97.html
added mer_dexa98.html
added mer_eann98.html
added mer_fns97.html
added mer_icail97.html
added mer_nlp98.html
added mer_pkdd97.html
added mer_sigir97.html
added mer_wirn97.html
added mer_wsom97.html
added mer_wsom97a.html
added mik_aa97.html
added mik_aips98.html
added mik_bidamap97.html
added mik_ecp97.html
added mik_ijcai97.html
added mik_jaim96.html
added mik_keml97.html
added mik_scamc96.html
added rau_caise98dc.html
added rau_esann98.html
added rau_icann98.html
added rau_wirn98.html
added sha_aime97.html
added sha_jaim98.html
added sha_scamc96.html
added tjo_01.html
added tjo_02.html
added win_nlp96.html
calculating MQE0
MQE: 44.8245
....MQE ; 9.77425, to go : 4.48245
neuron with max MQE : 0,1
inserting column:1
....MQE ; 5.9087, to go : 4.48245
neuron with max MQE : 1,1
inserting row:1
....MQE ; 3.69849, to go : 4.48245
MQE: 3.69849
UL: 0.000448 / 0.008255
UR: 0.002435 / 0.005777
LL: 0.000797 / 0.010301
LR: 0.007144 / 0.012635
....MQE ; 0.676031, to go : 0.339001
neuron with max MQE : 0,1
inserting row:1
....MQE ; 0.26785, to go : 0.339001
MQE: 0.26785
....MQE ; 0.329538, to go : 0.186332
neuron with max MQE : 0,0
inserting row:1
....MQE ; 0.19162, to go : 0.186332
neuron with max MQE : 0,0
inserting row:1
....MQE ; 0.0432246, to go : 0.186332
MQE: 0.0432246
....MQE ; 0.533793, to go : 0.305209
neuron with max MQE : 1,1
inserting row:1
....MQE ; 0.226509, to go : 0.305209
MQE: 0.226509
....MQE ; 0.751386, to go : 0.52348
neuron with max MQE : 0,1
inserting row:1
....MQE ; 0.567074, to go : 0.52348
neuron with max MQE : 1,0
inserting row:1
....MQE ; 0.288368, to go : 0.52348
MQE: 0.288368
....MQE ; 1.9259, to go : 1.04072
neuron with max MQE : 0,0
inserting column:1
....MQE ; 1.50876, to go : 1.04072
neuron with max MQE : 2,1
inserting row:1
....MQE ; 1.09421, to go : 1.04072
neuron with max MQE : 0,0
inserting column:1
....MQE ; 0.543406, to go : 1.04072
MQE: 0.543406
....MQE ; 0.226696, to go : 0.155404
neuron with max MQE : 1,0
inserting row:1
....MQE ; 0.13269, to go : 0.155404
MQE: 0.13269
....MQE ; 0.13099, to go : 0.173675
MQE: 0.13099
....MQE ; 0.770394, to go : 0.53574
neuron with max MQE : 1,0
inserting column:1
....MQE ; 0.494487, to go : 0.53574
MQE: 0.494487
....MQE ; 0.0268877, to go : 0.0690829
MQE: 0.0268877
....MQE ; 0.0207163, to go : 0.0550141
MQE: 0.0207163
....MQE ; 0.109155, to go : 0.0513641
neuron with max MQE : 0,0
inserting row:1
....MQE ; 0.0383167, to go : 0.0513641
MQE: 0.0383167
....MQE ; 0.0182991, to go : 0.0470159
MQE: 0.0182991
....MQE ; 0.0243392, to go : 0.0553909
MQE: 0.0243392
....MQE ; 0.0317231, to go : 0.0653689
MQE: 0.0317231
....MQE ; 0.311141, to go : 0.196105
neuron with max MQE : 0,1
inserting column:1
....MQE ; 0.12854, to go : 0.196105
MQE: 0.12854
....MQE ; 0.0928975, to go : 0.058814
neuron with max MQE : 1,1
inserting column:1
....MQE ; 0.025703, to go : 0.058814
MQE: 0.025703
....MQE ; 0.174803, to go : 0.0797663
neuron with max MQE : 0,1
inserting column:1
....MQE ; 0.0184409, to go : 0.0797663
MQE: 0.0184409
....MQE ; 0.0934771, to go : 0.0566015
neuron with max MQE : 1,1
inserting column:1
....MQE ; 0.0554816, to go : 0.0566015
MQE: 0.0554816
....MQE ; 0.0600263, to go : 0.0449625
neuron with max MQE : 0,1
inserting column:1
....MQE ; 0.0596367, to go : 0.0449625
neuron with max MQE : 2,0
inserting column:2
....MQE ; 0.000341286, to go : 0.0449625
MQE: 0.000341286
0.26
saving output/ghsom_demo1_a_1_1_0_0.html
saving output/ghsom_demo1_a_2_2_0_0.html
saving output/ghsom_demo1_a_3_2_1_0.html
saving output/ghsom_demo1_a_4_2_2_0.html
saving output/ghsom_demo1_a_5_2_0_1.html
saving output/ghsom_demo1_a_6_2_1_1.html
saving output/ghsom_demo1_a_7_2_2_1.html
saving output/ghsom_demo1_a_8_2_0_2.html
saving output/ghsom_demo1_a_9_2_1_2.html
saving output/ghsom_demo1_a_10_2_2_2.html
saving output/ghsom_demo1_a_11_3_1_2.html
saving output/ghsom_demo1_a_12_3_0_3.html
saving output/ghsom_demo1_a_13_3_1_3.html
saving output/ghsom_demo1_a_14_3_1_0.html
saving output/ghsom_demo1_a_15_3_2_0.html
saving output/ghsom_demo1_a_16_3_2_1.html
saving output/ghsom_demo1_a_17_3_3_2.html
saving output/ghsom_demo1_a_18_3_1_0.html
saving output/ghsom_demo1_a_19_3_0_1.html
saving output/ghsom_demo1_a_20_3_2_1.html
saving output/ghsom_demo1_a_1_1_0_0.mapdescr
saving output/ghsom_demo1_a_1_1_0_0.wgt
saving output/ghsom_demo1_a_1_1_0_0.unit
saving output/ghsom_demo1_a_2_2_0_0.mapdescr
saving output/ghsom_demo1_a_2_2_0_0.wgt
saving output/ghsom_demo1_a_2_2_0_0.unit
saving output/ghsom_demo1_a_3_2_1_0.mapdescr
saving output/ghsom_demo1_a_3_2_1_0.wgt
saving output/ghsom_demo1_a_3_2_1_0.unit
saving output/ghsom_demo1_a_4_2_2_0.mapdescr
saving output/ghsom_demo1_a_4_2_2_0.wgt
saving output/ghsom_demo1_a_4_2_2_0.unit
saving output/ghsom_demo1_a_5_2_0_1.mapdescr
saving output/ghsom_demo1_a_5_2_0_1.wgt
saving output/ghsom_demo1_a_5_2_0_1.unit
saving output/ghsom_demo1_a_6_2_1_1.mapdescr
saving output/ghsom_demo1_a_6_2_1_1.wgt
saving output/ghsom_demo1_a_6_2_1_1.unit
saving output/ghsom_demo1_a_7_2_2_1.mapdescr
saving output/ghsom_demo1_a_7_2_2_1.wgt
saving output/ghsom_demo1_a_7_2_2_1.unit
saving output/ghsom_demo1_a_8_2_0_2.mapdescr
saving output/ghsom_demo1_a_8_2_0_2.wgt
saving output/ghsom_demo1_a_8_2_0_2.unit
saving output/ghsom_demo1_a_9_2_1_2.mapdescr
saving output/ghsom_demo1_a_9_2_1_2.wgt
saving output/ghsom_demo1_a_9_2_1_2.unit
saving output/ghsom_demo1_a_10_2_2_2.mapdescr
saving output/ghsom_demo1_a_10_2_2_2.wgt
saving output/ghsom_demo1_a_10_2_2_2.unit
saving output/ghsom_demo1_a_11_3_1_2.mapdescr
saving output/ghsom_demo1_a_11_3_1_2.wgt
saving output/ghsom_demo1_a_11_3_1_2.unit
saving output/ghsom_demo1_a_12_3_0_3.mapdescr
saving output/ghsom_demo1_a_12_3_0_3.wgt
saving output/ghsom_demo1_a_12_3_0_3.unit
saving output/ghsom_demo1_a_13_3_1_3.mapdescr
saving output/ghsom_demo1_a_13_3_1_3.wgt
saving output/ghsom_demo1_a_13_3_1_3.unit
saving output/ghsom_demo1_a_14_3_1_0.mapdescr
saving output/ghsom_demo1_a_14_3_1_0.wgt
saving output/ghsom_demo1_a_14_3_1_0.unit
saving output/ghsom_demo1_a_15_3_2_0.mapdescr
saving output/ghsom_demo1_a_15_3_2_0.wgt
saving output/ghsom_demo1_a_15_3_2_0.unit
saving output/ghsom_demo1_a_16_3_2_1.mapdescr
saving output/ghsom_demo1_a_16_3_2_1.wgt
saving output/ghsom_demo1_a_16_3_2_1.unit
saving output/ghsom_demo1_a_17_3_3_2.mapdescr
saving output/ghsom_demo1_a_17_3_3_2.wgt
saving output/ghsom_demo1_a_17_3_3_2.unit
saving output/ghsom_demo1_a_18_3_1_0.mapdescr
saving output/ghsom_demo1_a_18_3_1_0.wgt
saving output/ghsom_demo1_a_18_3_1_0.unit
saving output/ghsom_demo1_a_19_3_0_1.mapdescr
saving output/ghsom_demo1_a_19_3_0_1.wgt
saving output/ghsom_demo1_a_19_3_0_1.unit
saving output/ghsom_demo1_a_20_3_2_1.mapdescr
saving output/ghsom_demo1_a_20_3_2_1.wgt
saving output/ghsom_demo1_a_20_3_2_1.unit

First, the input vectors are read. In the second step, the initial MQE of layer 0 is computed, which will furthermore be used to guide the training process and decide on the ultimate stopping criterion for the lowest-level granularity.
During the training process, after every 4 training cycles (as given by the expand cycles parameter in the property file, indicated by dots inthe console output) the MQE of the map is evaluated. If it has not fallen below the given threshold, a new row or column is added to the map.
If the map fulfills the criterion, i.e. its MQE has fallen below the threshold tau_1 as a percentage of the parent unit's quantization error (for the first map, this is the initial MQE_0), the individual units are evaluated. If their quantization error is below tau_2 * MQE_0, than no expansion is required. Otherwise, these units are expanded to a new map in the next layer, and training continues for the maps with the apropriate subset of data.
After convergence of the training process, the resulting GHSOM files are written to the directory specified inthe output path of the property file. For each map 4 files are written, which are:

Map Description File: providing basic data on the map
HTML File: clickable interface for evaluation
Unit File: describing the units in the map
Weight Vector File: containing the weight vectors of the units

The formats of the datafiles are described in more detail in the SOMLib Datafiles Description.

The results can be analyzed using any browser. For the GHSOM, the primary entry files is always called xxxxxx_1_1_0_0.html To allow direct access to the files we should also include a link from the output directory to the source files.

[andi@student experiments]$ cd ~/www/experiments/
[andi@student experiments]$ ln -s ../files output/files
[andi@student experiments]$ netscape output/ghsom_demo1_a_1_1_0_0.html &

The resulting HTML-Files should now be analyzed with respect to:

map size: is the size of the architecture reasonable, i.e. are there too many empty nodes, are too many documents mapped onto a single unit, is the hierarchy too flat, too deep.
If not: retrain with different training parameters, i.e. tau_1 and tau_2, or different map size for the static architectures.
labels: are the labels resonable, i.e. do the keywords make sense as keywords.
If not: adapt the labelsom parameter in the properties file - you might want to take a look at the weight vector files created to get an idea of the value ranges.
topical clusters: analyze the topical organization of the map, try to find and describe:
- the overall topical organization: which clusters are located where on the map, what are the labels for these sections.
- two or three a particularly good clusters: overall topic, finer sub-topics, labels for these, in how far do they match, example documents
- problematic areas: are documents, that do not cover the same topic mapped together into one cluster; do the labels not fit the cluster; - what might be the reasons?

For your convenience we provide the training results of 4 different SOMs described in the property files above (although you should have them available at your system as well by now ;) For a general description of the various clusters in this data set, see the corresponding section on the IFS Abstracts Collection Experiments in the experiments section of this site.

Static SOM: static 5x6 SOM
Growing SOM: the flat growing SOM, groding in size until all data is explained sufficiently
GHSOM 1: the flat GHSOM, larger maps, but less levels of hierarchy
GHSOM 2: the GHSOM with a somewhat deeper hierarchical structure

4.) libViewer Representation

The libViewer provides a graphical representation of your document as a library of books in shelves. It reads an library description file, which sepcifies the graphical look and feel of the documents and their position according to the trained maps.
Such a libViewer description file may now be created, providing a graphical user interface to a document collection. For this, you need to have some metadata on your documents available, that you want to depict in a graphical way.

(Description of libViewer file format etc. to be added... If you need the information right away, take a look at the demo-files provided with the packages available for download on th web, and/or send me an e-mail if you have problems with those.)

5.) Download

Below we provide links to the programs used for the experiments described above. These programs are either scripts, java programs, which in this case will be provided as compiled java class files, and compiled programs, compiled for x86 architectures running linux.

Demo-Collection: A collection of 51 short scientific abstracts from the Department of Software Technology (IFS). (Note: the gnu-zipped tar archive does NOT expand into a separate subdirectory)
democollection
html2txt: A program converting html text into plain ASCII text. Reads from stdin, writes to stdout. NOTE: you may neet to procees files twice to get rid of nested HTML tags.
html2txt (binary LINUX x86)
porterstem: There are numerous implementations of Porter's stemming algorithm available. We found the perl-script below to be very useful
porterstem.pl (Perl Script)
SOMLib Java Package: A collection of JAVA programs that can be used to create SOMLib library systems. Includes (1) Feature Extraction (2) Feature space pruning (3) feature vector creation (4) feature vector normalization (5) SOM training (6)SOM Labeling (7) libViewer template generation. NOTE: as the training of a SOM is computationally demanding, you might consider using a non-JAVA implementation for that part for large text collections. somlib.tar.gz (JAVA class files, gnu-zipped tar archive, 185 KB)
SOMLib Parser Script: The SOMLib Parser Script performs all the necessary parsing steps one after the other, calling the respective modules from the SOMLib JAVA package. somlib_parser_script (shell script, 10 KB)
GHSOM The GHSOM is capable of producing of producing (1) "conventional" SOMs, (2) growing SOMs and (3) growing hierarchical SOMs.
ghsom binary
GHSOM-1.5.tgz: Source code (gnu-zipped tar archive, 210 KB)
Sample property files for: static SOM, Flat Growing SOM, GHSOM 1, GHSOM 2,

Up

Comments: rauber@ifs.tuwien.ac.at

Property	Type	Range	Description
EXPAND_CYCLES	int	>=1	# of cycles after which the map is checked for eventual expansion; 1 cycle actually means # of input vectors; Example: 100 input vectors, 10 cycles = 1000 times a randomly chosen pattern is presented to the SOM for learning
TAU_1	real	[0-1]	percentage of remaining error that has to be explained by each map, ako stopping criterion for horozontal growth. The smaller this value, the larger each map will grow, and the flatter the hierarchy will be A good starting point may be a value of about 0.25
TAU_2	real	[0-1]	final degree of granularity represented by the maps in the lowest layer. The smaller, the more detailed the data representation will be, and thus the bigger the overall GHSOM structure. An appropriate value for testing may be 0.1 or less; if you set this property to 1, only one single SOM in the first layer will be trained
INITIAL_LEARNRATE	real	[0-1]	determines how strong the winner and its neighboring units are initially adapted, decreases over time good starting point: 0.8
INITIAL_NEIGHBOURHOOD	int	>=0	initial neighborhood range, decreases over time If you are training a GHSOM starting with a 2x2 initial map, a value of 2 or 3 is sufficient. If you are using the GHSOM to tarin a conventional SOM of size XxY, you might want to set it to X or Y whichever is higher
HTML_PREFIX	string	-	prefix for the output files. All files will be labeled that way,, followed by an underscore and subsequent numbering
DATAFILE_EXTENSION	string	may be empty	suffix for the reference of the data files in the HTML tables; we usually name the vectors in the inputvector-file to link to the actual files but omit the extension to get "better looking" maps; if you do so, you have to provide the extension to get the correct links to the document files; for browsing, the document files are always expected in a subdirectory `files` of the directory where the HTML files are located
randomSeed	int	any	initial seed value for the random number generator to enable repeatable training-runs
inputFile	string	-	path (relative to the current directory you are in or absolute) + name of the input vector file (vectors/test.in)
descriptionFile	string	-	path (relative to the current directory you are in or absolute) + name of the input vector file (vectors/test.tv)
savePath	string	-	directory where the output files are written (without trailing slash). Note: make sure that this directory exists, and that you have write permissions on it! :) (output)
normInputVectors	string	NONE \| LENGTH \| INTERVAL	if and how the input vectors are normalized; NONE=raw input data will be used; LENGTH=vectors are normalized to length 1; INTERVAL=vector elements are transformed into the interval [0-1]
INITIAL_X_SIZE	int	>=1	initial size of new maps in x-direction. For any growing map you will want to set this to 2, from which the map will start to grow. However, you can set it to any desired size right away.
INITIAL_Y_SIZE	int	>=1	initial size of new maps in y-direction. For any growing map you will want to set this to 2, from which the map will start to grow. However, you can set it to any desired size right away. If you set this value to 1, you will create a 1-dimensional SOM, that grows only linearly, resulting, if expanded hierarchically, in a tree-like representation fo your data.
LABELS_NUM	int	>=0	max # of labels per unit; 0 = no labels. The labelSOM method is used to select those features that are most characteristic of the respective unit to describe it.
LABELS_ONLY	bool	true \| false	if 'true', only the labels will be shown on nodes which have been expanded into the next layer along with a link labeled "down". Setting this property to 'false' is only useful for testing small data sets to see which data is mapped onto the according map in the next layer.
LABELS_THRESHOLD	real	[0-1]	features which are most important are used as labels; a value of 0.8 means that only the features with values in the top 20% of all are printed as labels; the lower this value the more labels will be shown (limited by LABELS_NUM)