The various steps described are
For the following steps we assume that all documents are available in some ASCII-file format such as plain text or HTML in a directory named experiments/files.
Step 1: Create a directory called experiments/files, preferably in your www subdirectory branch, and copy all files to be included in the SOMLib library into this directory in ASCII File Format (e.g. plain text or HTML)
bash$ mkdir experiments bash$ cd experiments bash$ mkdir files bash$ cd filesAlternatively (and preferably), you can create a link to the appropriate directory where the files are stored.
bash$ mkdir experiments bash$ cd experiments bash$ ln -s PATH_TO_YOUR_EXPERIMENTS_FILES filesFor our experiments we will use a collection of abstracts from scientific papers, which you may download..
bash$ wget http://www.ifs.tuwien.ac.at/~andi/somlib/download/democollection.tar.gz --16:06:28-- http://www.ifs.tuwien.ac.at:80/%7Eandi/somlib/download/democollection.tar.gz => `democollection.tar.gz' Connecting to www.ifs.tuwien.ac.at:80... connected! HTTP request sent, awaiting response... 200 OK Length: 20,561 [application/x-tar] 0K -> .......... .......... [100%] 16:06:28 (542.68 KB/s) - `democollection.tar.gz' saved [20561/20561]Unzip and untar the file to extract the articles in it, afterwards, the original tar.gz archive can be deleted.
bash$ tar -xvzf democollection.tar.gz alb_lisa98.html ber_hicss98.html [...] tjo_02.html win_nlp96.html bash$ rm democollection.tar.gzWe now have a collection of 51 html files in our directory, such as the abstract of a paper published at WIRN 1998: experiments/files/wirn98.html
bash$ ls -al total 208 drwxr-xr-x 2 andi ifs_staf 4096 Mar 8 16:08 . drwxr-xr-x 3 andi ifs_staf 4096 Mar 8 15:50 .. -rw-r--r-- 1 andi ifs_staf 2093 Mar 8 16:01 alb_lisa98.html -rw-r--r-- 1 andi ifs_staf 1774 Mar 8 16:01 ber_hicss98.html -rw-r--r-- 1 andi ifs_staf 1773 Mar 8 16:01 duf_vv98.html -rw-r--r-- 1 andi ifs_staf 1885 Mar 8 16:01 ell_compsac96.html -rw-r--r-- 1 andi ifs_staf 1704 Mar 8 16:01 ell_dexa96a.html -rw-r--r-- 1 andi ifs_staf 1849 Mar 8 16:01 ell_dexa96b.html -rw-r--r-- 1 andi ifs_staf 2249 Mar 8 16:01 ell_sast96.html -rw-r--r-- 1 andi ifs_staf 1733 Mar 8 16:01 han_idamap97.html -rw-r--r-- 1 andi ifs_staf 1540 Mar 8 16:01 has_eann97.html -rw-r--r-- 1 andi ifs_staf 2280 Mar 8 16:01 hor_jcbm97.html -rw-r--r-- 1 andi ifs_staf 1938 Mar 8 16:01 hor_jieee98.html -rw-r--r-- 1 andi ifs_staf 1683 Mar 8 16:01 koh_acnn96.html -rw-r--r-- 1 andi ifs_staf 1597 Mar 8 16:01 koh_cbms97.html -rw-r--r-- 1 andi ifs_staf 1769 Mar 8 16:01 koh_eann97.html -rw-r--r-- 1 andi ifs_staf 1236 Mar 8 16:01 koh_esann96.html -rw-r--r-- 1 andi ifs_staf 1958 Mar 8 16:01 koh_icann96.html -rw-r--r-- 1 andi ifs_staf 1554 Mar 8 16:01 koh_icann98.html -rw-r--r-- 1 andi ifs_staf 1755 Mar 8 16:01 kor_twd98.html -rw-r--r-- 1 andi ifs_staf 1502 Mar 8 16:01 mer_aiem96.html -rw-r--r-- 1 andi ifs_staf 2157 Mar 8 16:01 mer_cise98.html -rw-r--r-- 1 andi ifs_staf 2252 Mar 8 16:01 mer_codas96.html -rw-r--r-- 1 andi ifs_staf 1758 Mar 8 16:01 mer_dexa97.html -rw-r--r-- 1 andi ifs_staf 2145 Mar 8 16:01 mer_dexa98.html -rw-r--r-- 1 andi ifs_staf 1485 Mar 8 16:01 mer_eann98.html -rw-r--r-- 1 andi ifs_staf 1360 Mar 8 16:01 mer_fns97.html -rw-r--r-- 1 andi ifs_staf 2276 Mar 8 16:01 mer_icail97.html -rw-r--r-- 1 andi ifs_staf 1881 Mar 8 16:01 mer_nlp98.html -rw-r--r-- 1 andi ifs_staf 1987 Mar 8 16:01 mer_pkdd97.html -rw-r--r-- 1 andi ifs_staf 2004 Mar 8 16:01 mer_sigir97.html -rw-r--r-- 1 andi ifs_staf 2124 Mar 8 16:01 mer_wirn97.html -rw-r--r-- 1 andi ifs_staf 1662 Mar 8 16:01 mer_wsom97.html -rw-r--r-- 1 andi ifs_staf 1467 Mar 8 16:01 mer_wsom97a.html -rw-r--r-- 1 andi ifs_staf 2354 Mar 8 16:01 mik_aa97.html -rw-r--r-- 1 andi ifs_staf 2088 Mar 8 16:01 mik_aips98.html -rw-r--r-- 1 andi ifs_staf 1828 Mar 8 16:01 mik_bidamap97.html -rw-r--r-- 1 andi ifs_staf 1800 Mar 8 16:01 mik_ecp97.html -rw-r--r-- 1 andi ifs_staf 2272 Mar 8 16:01 mik_ijcai97.html -rw-r--r-- 1 andi ifs_staf 2193 Mar 8 16:01 mik_jaim96.html -rw-r--r-- 1 andi ifs_staf 1949 Mar 8 16:01 mik_keml97.html -rw-r--r-- 1 andi ifs_staf 2407 Mar 8 16:01 mik_scamc96.html -rw-r--r-- 1 andi ifs_staf 2567 Mar 8 16:01 rau_caise98dc.html -rw-r--r-- 1 andi ifs_staf 1551 Mar 8 16:01 rau_esann98.html -rw-r--r-- 1 andi ifs_staf 1953 Mar 8 16:01 rau_icann98.html -rw-r--r-- 1 andi ifs_staf 1704 Mar 8 16:01 rau_wirn98.html -rw-r--r-- 1 andi ifs_staf 2264 Mar 8 16:01 sha_aime97.html -rw-r--r-- 1 andi ifs_staf 2119 Mar 8 16:01 sha_jaim98.html -rw-r--r-- 1 andi ifs_staf 1964 Mar 8 16:01 sha_scamc96.html -rw-r--r-- 1 andi ifs_staf 2134 Mar 8 16:01 tjo_01.html -rw-r--r-- 1 andi ifs_staf 1414 Mar 8 16:01 tjo_02.html -rw-r--r-- 1 andi ifs_staf 1335 Mar 8 16:01 win_nlp96.html
mkdir files_cleaned cd files bash$ for file in *; do > ../programs/html2txt -t -a < $file > ../files_cleaned/$file > done cd ../files_cleanedFor some HTML-files you may have to repeat the process to get rid of nested HTML tags, storing the intermediate files into a temporary directory or piping it through the command twice, before finally obtaining a clean ASCII text only version:
bash$ for file in *; do > ../programs/html2txt -t -a < $file | ../programs/html2txt -t -a > ../files_cleaned/$file > done cd ../files_cleaned
mkdir stemmed cd cleaned for file in *; do > porterstem.pl $file > ../complete_stemmed/$file > doneAt the end of the preprocessing-stage you should have a directory contatining the pure ascii-files that you want to proces with the SOMLib Digital Library system.
2.) Parsing
The parsing process creates feature vectors describing the contents of the documents.
For details on the feature vector creation process, see the Section on Text Representation at the SOMLib Project Homepage.
We use the feature extraction programs of the SOMLib Java package to obtain the feature vectors. Download the SOMLib Java package and put it in a directory called programs in your experiments directory. Unpacking the program extracts all necessary class-files into a somlib subdirectory. Make sure you set the classpath to include the current directory. A detailed description of the various modules in the SOMLib Java Package is provided at the packet's homepage.
Instead of calling all modules separately, you may use the somlib parser script to call the appropriate modules. Calling the module without any parameters provides a listing of the parameters applicable.
bash$ programs/somlib_parser_script ERROR: Usage: programs/somlib_parser_script Name InputDir MinWordLength Min_df Max_df Verbosity Example: programs/somlib_parser_script somlib_test inp_dir 3 0.01 0.6 2 or simple version: ERROR: Usage: programs/somlib_parser_script Name InputDir Example: programs/somlib_parser_script somlib_test inp_dirIt is recommended to set all parameters individually, especially the upper and lower boundaries for the pruning process, as the percentual values need to be adopted to the number of files in the data set. Some notes on the parameters:
[andi@student experiments]$ somlib_parser_script demo_1 files_cleaned 3 0.05 0.6 2 demo_1 Thu Apr 5 10:28:24 CEST 2001 somlib_parser_script demo_1 files_cleaned 3 0.05 0.6 2 (Usage: somlib_parser_script Name InputFiles MinWordLength Min_df Max_df Verbosity) ------------------------------------------------------ somlib_parser_script: created directory parser/ for parsing files somlib_parser_script: created directory parser/histo/ for histogram files somlib_parser_script: created directory vectors/ for vector files somlib_parser_script: created local symbolic link to /usr/local/somlib/bin/somlib_java/ directory ------------------------------------------------------ somlib_parser_script: calling java -Xmx10000m somlib.textrepresentation.wordsexc -i files_cleaned -o parser/histo -m 3 -v 2 to create wordhistograms somlib_parser_script: finished somlib.textrepresentation.wordsexc ------------------------------------------------------ somlib_parser_script: calling java -Xmx10000m somlib.textrepresentation.templatevectorexc -i parser/histo -o parser/demo_1.tv.hash -v 2 to extract template vector somlib_parser_script: finished somlib.textrepresentation.templatevectorexc ------------------------------------------------------ somlib_parser_script: calling java -Xmx10000m somlib.textrepresentation.reducerexc -i parser/demo_1.tv.hash -o parser/demo_1.tv.red.hash -n 0.05 -x 0.6 -r vectors/demo_1.removed.txt -v 2 to create reduced templatevector somlib_parser_script: finished somlib.textrepresentation.reducerexc ------------------------------------------------------ somlib_parser_script: calling java -Xmx10000m somlib.textrepresentation.extractorexc -i ./histo -j ./demo_1.gen.red.hash -o ./vectors/demo_1 -f t -v 2 to create individual vectors somlib_parser_script: finished somlib.textrepresentation.extractorexc ------------------------------------------------------ somlib_parser_script: creating html file demo_1.parser.html ------------------------------------------------------ ls -al total 28 drwxr-xr-x 5 andi ifs_staf 4096 Apr 5 10:28 . drwxr-xr-x 11 andi ifs_staf 4096 Apr 5 09:56 .. -rw-r--r-- 1 andi ifs_staf 1247 Apr 5 10:28 demo_1.parser.html -rw-r--r-- 1 andi ifs_staf 2265 Apr 5 10:28 demo_1.parser.log lrwxrwxrwx 1 andi ifs_staf 44 Apr 5 09:57 files -> /home/lehre/vo_dl/collections/democollection drwxr-xr-x 2 andi ifs_staf 4096 Apr 5 10:06 files_cleaned drwxr-xr-x 3 andi ifs_staf 4096 Apr 5 10:28 parser lrwxrwxrwx 1 andi ifs_staf 33 Apr 5 10:28 somlib -> /usr/local/somlib/bin/somlib_java drwxr-xr-x 2 andi ifs_staf 4096 Apr 5 10:28 vectors ls -al vectors/* -rw-r--r-- 1 andi ifs_staf 14471 Apr 5 10:28 vectors/demo_1.removed.txt -rw-r--r-- 1 andi ifs_staf 74061 Apr 5 10:28 vectors/demo_1.tfxidf -rw-r--r-- 1 andi ifs_staf 14186 Apr 5 10:28 vectors/demo_1.tv ------------------------------------------------------ content parser done Thu Apr 5 10:28:30 CEST 2001 ------------------------------------------------------During the parsing procedure, 2 directories are created, namely a parser directory containing hash-files and histogram files of the documents. This directory can savely be removed after the parsing procedure. The second directory called vectors contains the feature vectors as well as the list of pruned (i.e. removed) words. You should find there the following files: demo_1.removed.txt, demo_1.tfxidf, demo_1.tv
[andi@student experiments]$ dir vectors/ total 124 drwxr-xr-x 2 andi ifs_staf 4096 Apr 4 12:49 . drwxr-xr-x 7 andi ifs_staf 4096 Apr 4 12:45 .. -rw-r--r-- 1 andi ifs_staf 14471 Apr 4 12:48 demo_1.removed.txt -rw-r--r-- 1 andi ifs_staf 74061 Apr 4 12:48 demo_1.tfxidf -rw-r--r-- 1 andi ifs_staf 14186 Apr 4 12:48 demo_1.tvBy analyzing the list of removed words we can check, whether the setting for tf_min and tf_max was appropriate. This file contains the list of all removed words, plus a flag indicating whether it was removed due to the upper (H) or lower threshold (L) in the first column. The number in the second column lists the document frequency, i.e. the number of document the respective term ocurredin, whereas the third column lists the term frequency, i.e. the number of times the given term appeared in the collection in total. Usually, only a small number of terms is removed due to the threshold on the maximum document frequency, which are the typical stop words such as articles etc. (typically between 20 and 300 terms) A quite significant reduction in dimensionality can be obtained by increasing the lower document frequency threshold (up to tens of thousands of terms), i.e. by removing rare words that drepresent content, but are too rare to diffrentiate between different content clusters. It more or less indicates a kind of minimum cluster size or topic granularity.
[andi@student vectors]$ grep L demo_1.removed.txt | sort +1 -rg | more L 2 9 mortality L 2 7 validation L 2 7 protocols L 2 7 criteria L 2 6 tree L 2 6 assessment L 2 5 security L 2 5 logit L 2 5 asbruview L 2 4 web L 2 4 som L 2 4 semantic L 2 4 resources : : [andi@student vectors]$ grep H demo_1.removed.txt | sort +1 -g | more H 31 48 that H 31 53 this H 32 68 are H 44 125 for H 44 222 and H 50 102 step H 50 103 technology H 50 50 comments H 50 50 guide H 50 50 ifs H 50 50 rauber H 50 50 tuwien H 50 51 creating H 50 51 university H 50 51 vienna H 50 52 somlib : : :Also worth analyzing in this context is the template vector file, to find out, which words were not removed (and probably should be): If we take a look at the words with the highest document frequencies and find them to be mostly stop words we may decide to lower the respective threshold to remove them from the list. The same applies to the lower threshold, if we want to further reduce thedimensionality of the feature space.
[andi@student vectors]$ sort +2 -g demo_1.tv | more $TYPE template $VEC_DIM 492 $XDIM 7 $YDIM 50 100 services 3 3 1 1 1.0 103 straight 3 3 1 1 1.0 105 active 3 3 1 1 1.0 109 collected 3 4 1 2 1.3333333333333333 110 operations 3 3 1 1 1.0 112 regarded 3 4 1 2 1.3333333333333333 113 compensation 3 3 1 1 1.0 115 identify 3 3 1 1 1.0 116 pattern 3 4 1 2 1.3333333333333333 119 propose 3 3 1 1 1.0 122 explicitly 3 3 1 1 1.0 126 understanding 3 3 1 1 1.0 127 technique 3 4 1 2 1.3333333333333333 [andi@student vectors]$ sort +2 -rg demo_1.tv | more 229 with 29 43 1 3 1.4827586206896552 159 based 28 50 1 5 1.7857142857142858 137 data 27 94 1 12 3.4814814814814814 470 from 26 42 1 5 1.6153846153846154 416 paper 25 26 1 2 1.04 378 abstract 24 25 1 2 1.0416666666666667 313 neural 23 46 1 6 2.0 294 such 22 32 1 4 1.4545454545454546 258 which 22 31 1 3 1.4090909090909092 187 using 21 27 1 2 1.2857142857142858 177 knowledge 21 40 1 3 1.9047619047619047 167 representation 21 30 1 4 1.4285714285714286 74 model 20 23 1 2 1.15 251 approach 20 30 1 3 1.5 148 results 20 21 1 2 1.05 450 classification 19 37 1 5 1.9473684210526316If you find the thresholds either too high or too low you may need to re-run the parsing process to obtain better weight vector representations. The resulting feature space at the end typically lies somewhere between 3.000 and 15.000 dimensions, with lower dimensionalities greatly reducing computation times for thraining the subsequent map. Before training the SOM we might want to take a look at the input vector file conatining the documents feature vectors as well as the name of each vector.
[andi@student vectors]$ cd ~/www/experiments/vectors/ [andi@student experiments]$ more vectors/demo_1.tfxidf $TYPE vec_tfxidf $XDIM 50 $YDIM 1 $VEC_DIM 492 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0986123 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0986123 0 0 0 0 2.4849067 0 0 0 0 0 0 2.3025851 0 0 0 2.7725887 0 0 0 0 0 0 0 0 0 0 5.5451775 0 0 0 1.0986123 0 1.609438 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0986123 0 0 0 1.609438 2.7725887 0 0 0 0 0 0 0 0 2.0794415 0 0 0 0 0 0 0 0 0 0 2.7725887 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0 0 0 0 2.4849067 0 0 0 0 0 0.6931472 0 0 0 0 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0 0.6931472 0 0 0 0 0 0 0 0 0 2.0794415 0 0 2.4849067 0 0 0 0 1.9459101 0 0.6931472 0 0 0 0 0 0 0 0 0 0 0 0 1.9459101 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.6931472 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7.45472 0 0 0 0 0 1.0986123 0 2.0794415 0 0 0 0 0 0 0 0 0 0 0 2.0794415 0 0 0 0 2.4849067 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.3862944 0 0 0 0 0 0 0 0 0 4.9698133 0 0 0 2.4849067 0 0 0 0 0.6931472 0 0 0 0 0 0 0 0 0 0 0 0 0 2.0794415 0 0 0 0 0 1.945910 1 0 0 0 0 0 0 2.0794415 0 0 0 1.3862944 0 0 1.0986123 0 0 0 0 0 0 0 0 0 2.7725887 2.7725887 0 0 0 0 0 0 0 0 2.7725887 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4.9698133 0 0 4.9698133 0 0 1.9459101 0 0 2.3025851 0 0 0 0 0 0 0 0 0 0 2.4849067 0 0 0 2.7725887 0 1.9459101 0 0 0 0 0 0 0 0 0 1.9459101 0 1.609438 0 1.0986123 1.609438 0 2.7725887 0 0 0 0 0 0 0 0 0 0 0 0 2.3025851 0 0 0 0 0.6931472 0 0 0 0 0 0 0 0 2.7725887 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2.7725887 0 0 0 0 0 parser/histo/alb_lisa98.html.txt.idv 0 0 3.583519 0 0 0 0 0 0 0 0 0 0.... etc.This file contains the tfxidf values of the various attributes, i.e. words, as described in the template vector file. The last entry in each line is the name of the vector, i.e. the name of the file. This filename will later be used to create the link to the documents, allowing users to browse the library.
3.) Training
Follwing the vector creation process we can train the self-organizing maps.
For this we use the GHSOM program, the growing Hierarchical Self-Organizing Map, which is capable of producing (1) "conventional" SOMs, (2) growing SOMs and (3) growing hierarchical SOMs.
Download the GHSOM and put it in your programs directory.
The GHSOM program can be used to create three different flavours of SOMs, namely (1) the traditional, static SOM which requires a fixed map size to be specified for the training process, a (2) growing SOM, where rows and columns are added to a SOM until it has reached a certain size suffizient for explaining the data to a certain degree of granularity, and (3) the growing hierarchical SOM (GHSOM), which dapts both it's size as well as it's hierarchical structure according to the data requirements. Examples for all three kinds of maps are provided below.
The GHSOM reads all parameters from a so-called property file.
Create separate directories for the property-files (which contain the parameters for the experiment runs you want to perform) and the outputs of your runs, eg.
mkdir properties mkdir outputCreate / Edit the property files in the properties directory according to your needs as follows (sample property files are provided below):
property1=value1 property2=value2 property3=value3 ...
ATTENTION: no white-spaces are allowed between property/value and the equal sign. Furthermore, no trailing white spaces should be present after the value.
If you don't provide one or more of several of the following properties, a default value for them will be set.
Property | Type | Range | Description |
EXPAND_CYCLES | int | >=1 | # of cycles
after which the map is checked for eventual expansion;
1 cycle actually means # of input vectors; Example: 100 input vectors, 10 cycles = 1000 times a randomly chosen pattern is presented to the SOM for learning |
TAU_1 | real | [0-1] | percentage of remaining error that has to be explained by each map, ako stopping criterion for horozontal growth. The smaller this value, the larger each map will grow, and the flatter the hierarchy will be A good starting point may be a value of about 0.25 |
TAU_2 | real | [0-1] | final degree of
granularity represented by the maps in the lowest layer. The smaller, the more detailed the data representation will be, and thus the bigger the overall GHSOM structure. An appropriate value for testing may be 0.1 or less; if you set this property to 1, only one single SOM in the first layer will be trained |
INITIAL_LEARNRATE | real | [0-1] | determines how strong the winner and its neighboring units are initially adapted, decreases over time good starting point: 0.8 |
INITIAL_NEIGHBOURHOOD | int | >=0 | initial neighborhood range, decreases over time If you are training a GHSOM starting with a 2x2 initial map, a value of 2 or 3 is sufficient. If you are using the GHSOM to tarin a conventional SOM of size XxY, you might want to set it to X or Y whichever is higher |
HTML_PREFIX | string | - | prefix for the output files. All files will be labeled that way,, followed by an underscore and subsequent numbering |
DATAFILE_EXTENSION | string | may be empty | suffix
for the reference of the data files in the HTML tables; we usually name the vectors in the inputvector-file to link to the actual files but omit the extension to get "better looking" maps; if you do so, you have to provide the extension to get the correct links to the document files; for browsing, the document files are always expected in a subdirectory files of the directory where the HTML files are located |
randomSeed | int | any | initial seed value for the random number generator to enable repeatable training-runs |
inputFile | string | - | path (relative to the current directory you are in or absolute) + name of the input vector file (vectors/test.in) |
descriptionFile | string | - | path (relative to the current directory you are in or absolute) + name of the input vector file (vectors/test.tv) |
savePath | string | - | directory where the output files are written (without trailing slash). Note: make sure that this directory exists, and that you have write permissions on it! :) (output) |
normInputVectors | string | NONE | LENGTH | INTERVAL | if and how the input vectors are normalized; NONE=raw input data will be used; LENGTH=vectors are normalized to length 1; INTERVAL=vector elements are transformed into the interval [0-1] |
INITIAL_X_SIZE | int | >=1 | initial size of new maps in x-direction. For any growing map you will want to set this to 2, from which the map will start to grow. However, you can set it to any desired size right away. |
INITIAL_Y_SIZE | int | >=1 | initial size of new maps in y-direction. For any growing map you will want to set this to 2, from which the map will start to grow. However, you can set it to any desired size right away. If you set this value to 1, you will create a 1-dimensional SOM, that grows only linearly, resulting, if expanded hierarchically, in a tree-like representation fo your data. |
LABELS_NUM | int | >=0 | max # of labels
per unit; 0 = no labels. The labelSOM method is used to select those features that are most characteristic of the respective unit to describe it. |
LABELS_ONLY | bool | true | false | if 'true', only the labels will be shown on nodes which have been expanded into the next layer along with a link labeled "down". Setting this property to 'false' is only useful for testing small data sets to see which data is mapped onto the according map in the next layer. |
LABELS_THRESHOLD | real | [0-1] | features which are most important are used as labels; a value of 0.8 means that only the features with values in the top 20% of all are printed as labels; the lower this value the more labels will be shown (limited by LABELS_NUM) |
Static SOM
This property file setting emulates a static SOM: by setting both TAU_1 and TAU_2 to 1.0, the SOM basically fulfills the stopping criteri for both horizontal as well as hierarchical growth immediately. Thus, after EXPAND_CYCLES number of iterations, when the MQE's are evaluated, the training process stops and the map file is stored to disk.
EXPAND_CYCLES must be set to rather high levels, because the complete map is trained without any training repetition as no expansion is required.
INITIAL_X_SIZE and INITIAL_Y_SIZE are directly set to the final map size values, resulting in a 5x6 map for the given example.
EXPAND_CYCLES=100 # Iteration= #cycles * #input_vecs MAX_CYCLES=0 # max nr. of cycles, 0= unlimited TAU_1=1.0 # stopping criterion for horizontal growth TAU_2=1.0 # absolute stopping criterion INITIAL_LEARNRATE=0.5 INITIAL_NEIGHBOURHOOD=3 HTML_PREFIX=static_demo1 # output filename DATAFILE_EXTENSION= # link filename extension randomSeed=17 inputFile=vectors/demo_1.tfxidf # path to vector files descriptionFile=vectors/demo_1.tv # path to template vector savePath=output # directory for results printMQE=false # debug normInputVectors=LENGTH # normalize vecs to unit length saveAsHTML=true # save html result files saveAsSOMLib=true # save somlib datafiles INITIAL_X_SIZE=6 # size of SOM horizontal INITIAL_Y_SIZE=5 # size of SOM, vertical LABELS_NUM=15 # max nr of labels LABELS_ONLY=true # only labels for expanded units plus "down" link LABELS_THRESHOLD=0.35 # threshold for label selection ORIENTATION=false # ignore orientation of lower level maps, overrides X and Y size if set true
Flat Growing SOM
The folowing property file results in a flat growing SOM, i.e. a map, where, starting from initially 2x2 units, rows and columns are added until the data is explained to a sufficient degree as indicated by the parameter TAU_1. By setting TAU_2 again to 1.0, no hierarchical expansion is required, as that stopping criterion is immediately met.
The EXPAND_CYCLES can now be set to a somewhat lower level, as several training cycles will be performed anyway as new rows and columns are inserted.
Note, that INITIAL_X_SIZE and INITIAL_Y_SIZE are set to 2 initially, from which size the map will start to grow.
EXPAND_CYCLES=40 MAX_CYCLES=0 TAU_1=0.01 TAU_2=1 INITIAL_LEARNRATE=0.5 INITIAL_NEIGHBOURHOOD=3 HTML_PREFIX=growing_demo1 DATAFILE_EXTENSION= randomSeed=17 inputFile=vectors/demo_1.tfxidf descriptionFile=vectors/demo_1.tv savePath=output printMQE=false normInputVectors=LENGTH saveAsHTML=true saveAsSOMLib=true INITIAL_X_SIZE=2 INITIAL_Y_SIZE=2 LABELS_NUM=15 LABELS_ONLY=true LABELS_THRESHOLD=0.35 ORIENTATION=trueGHSOM 1
EXPAND_CYCLES=4 MAX_CYCLES=0 TAU_1=0.1 TAU_2=0.01 INITIAL_LEARNRATE=0.5 INITIAL_NEIGHBOURHOOD=3 HTML_PREFIX=ghsom_demo1_a DATAFILE_EXTENSION= randomSeed=17 inputFile=vectors/demo_1.tfxidf descriptionFile=vectors/demo_1.tv savePath=output printMQE=false normInputVectors=LENGTH saveAsHTML=true saveAsSOMLib=true INITIAL_X_SIZE=2 INITIAL_Y_SIZE=2 LABELS_NUM=15 LABELS_ONLY=true LABELS_THRESHOLD=0.35 ORIENTATION=trueGHSOM 2
EXPAND_CYCLES=4 MAX_CYCLES=0 TAU_1=0.15 TAU_2=0.05 INITIAL_LEARNRATE=0.5 INITIAL_NEIGHBOURHOOD=3 HTML_PREFIX=ghsom_demo1_b DATAFILE_EXTENSION= randomSeed=17 inputFile=vectors/demo_1.tfxidf descriptionFile=vectors/demo_1.tv savePath=output printMQE=false normInputVectors=LENGTH saveAsHTML=true saveAsSOMLib=true INITIAL_X_SIZE=2 INITIAL_Y_SIZE=2 LABELS_NUM=15 LABELS_ONLY=true LABELS_THRESHOLD=0.35 ORIENTATION=trueAfter editing the properties-files, simply run the ghsom program with the properties-file you want to use, e.g.
[andi@student experiments]$ ghsom properties/som_static.prop [andi@student experiments]$ nice -19 ghsom properties/som_flat_growing.prop & [andi@student experiments]$ nice -19 ghsom properties/som_ghsom1.prop & [andi@student experiments]$ nice -19 ghsom properties/som_ghsom2.prop &Note: we use "nice -19" to assign a lower process priority to the training process, so as to be able to do something else interactively on the machine, while the GHSOM ist trained.
The GHSOM training process is performed, and the result files are written into the output directory specified in the properties file.
[andi@student experiments]$ ghsom properties/ghsom1.prop EXPAND_CYCLES = 4 MAX_CYCLES = 0 TAU_1 = 0.1 TAU_2 = 0.01 INITIAL_LEARNRATE = 0.5 INITIAL_NEIGHBOURHOOD = 3 HTML_PREFIX = ghsom_demo1_a DATAFILE_EXTENSION = randomSeed = 17 inputFile = vectors/demo_1.tfxidf descriptionFile = vectors/demo_1.tv savePath = output printMQE = false normInputVectors = LENGTH saveAsHTML = true saveAsSOMLib = true INITIAL_X_SIZE = 2 INITIAL_Y_SIZE = 2 LABELS_NUM = 15 LABELS_ONLY = true LABELS_THRESHOLD = 0.35 ORIENTATION = true added alb_lisa98.html added ber_hicss98.html added duf_vv98.html added ell_compsac96.html added ell_dexa96a.html added ell_dexa96b.html added ell_sast96.html added han_idamap97.html added has_eann97.html added hor_jcbm97.html added hor_jieee98.html added koh_acnn96.html added koh_cbms97.html added koh_eann97.html added koh_esann96.html added koh_icann96.html added koh_icann98.html added kor_twd98.html added mer_aiem96.html added mer_cise98.html added mer_codas96.html added mer_dexa97.html added mer_dexa98.html added mer_eann98.html added mer_fns97.html added mer_icail97.html added mer_nlp98.html added mer_pkdd97.html added mer_sigir97.html added mer_wirn97.html added mer_wsom97.html added mer_wsom97a.html added mik_aa97.html added mik_aips98.html added mik_bidamap97.html added mik_ecp97.html added mik_ijcai97.html added mik_jaim96.html added mik_keml97.html added mik_scamc96.html added rau_caise98dc.html added rau_esann98.html added rau_icann98.html added rau_wirn98.html added sha_aime97.html added sha_jaim98.html added sha_scamc96.html added tjo_01.html added tjo_02.html added win_nlp96.html calculating MQE0 MQE: 44.8245 ....MQE ; 9.77425, to go : 4.48245 neuron with max MQE : 0,1 inserting column:1 ....MQE ; 5.9087, to go : 4.48245 neuron with max MQE : 1,1 inserting row:1 ....MQE ; 3.69849, to go : 4.48245 MQE: 3.69849 UL: 0.000448 / 0.008255 UR: 0.002435 / 0.005777 LL: 0.000797 / 0.010301 LR: 0.007144 / 0.012635 ....MQE ; 0.676031, to go : 0.339001 neuron with max MQE : 0,1 inserting row:1 ....MQE ; 0.26785, to go : 0.339001 MQE: 0.26785 ....MQE ; 0.329538, to go : 0.186332 neuron with max MQE : 0,0 inserting row:1 ....MQE ; 0.19162, to go : 0.186332 neuron with max MQE : 0,0 inserting row:1 ....MQE ; 0.0432246, to go : 0.186332 MQE: 0.0432246 ....MQE ; 0.533793, to go : 0.305209 neuron with max MQE : 1,1 inserting row:1 ....MQE ; 0.226509, to go : 0.305209 MQE: 0.226509 ....MQE ; 0.751386, to go : 0.52348 neuron with max MQE : 0,1 inserting row:1 ....MQE ; 0.567074, to go : 0.52348 neuron with max MQE : 1,0 inserting row:1 ....MQE ; 0.288368, to go : 0.52348 MQE: 0.288368 ....MQE ; 1.9259, to go : 1.04072 neuron with max MQE : 0,0 inserting column:1 ....MQE ; 1.50876, to go : 1.04072 neuron with max MQE : 2,1 inserting row:1 ....MQE ; 1.09421, to go : 1.04072 neuron with max MQE : 0,0 inserting column:1 ....MQE ; 0.543406, to go : 1.04072 MQE: 0.543406 ....MQE ; 0.226696, to go : 0.155404 neuron with max MQE : 1,0 inserting row:1 ....MQE ; 0.13269, to go : 0.155404 MQE: 0.13269 ....MQE ; 0.13099, to go : 0.173675 MQE: 0.13099 ....MQE ; 0.770394, to go : 0.53574 neuron with max MQE : 1,0 inserting column:1 ....MQE ; 0.494487, to go : 0.53574 MQE: 0.494487 ....MQE ; 0.0268877, to go : 0.0690829 MQE: 0.0268877 ....MQE ; 0.0207163, to go : 0.0550141 MQE: 0.0207163 ....MQE ; 0.109155, to go : 0.0513641 neuron with max MQE : 0,0 inserting row:1 ....MQE ; 0.0383167, to go : 0.0513641 MQE: 0.0383167 ....MQE ; 0.0182991, to go : 0.0470159 MQE: 0.0182991 ....MQE ; 0.0243392, to go : 0.0553909 MQE: 0.0243392 ....MQE ; 0.0317231, to go : 0.0653689 MQE: 0.0317231 ....MQE ; 0.311141, to go : 0.196105 neuron with max MQE : 0,1 inserting column:1 ....MQE ; 0.12854, to go : 0.196105 MQE: 0.12854 ....MQE ; 0.0928975, to go : 0.058814 neuron with max MQE : 1,1 inserting column:1 ....MQE ; 0.025703, to go : 0.058814 MQE: 0.025703 ....MQE ; 0.174803, to go : 0.0797663 neuron with max MQE : 0,1 inserting column:1 ....MQE ; 0.0184409, to go : 0.0797663 MQE: 0.0184409 ....MQE ; 0.0934771, to go : 0.0566015 neuron with max MQE : 1,1 inserting column:1 ....MQE ; 0.0554816, to go : 0.0566015 MQE: 0.0554816 ....MQE ; 0.0600263, to go : 0.0449625 neuron with max MQE : 0,1 inserting column:1 ....MQE ; 0.0596367, to go : 0.0449625 neuron with max MQE : 2,0 inserting column:2 ....MQE ; 0.000341286, to go : 0.0449625 MQE: 0.000341286 0.26 saving output/ghsom_demo1_a_1_1_0_0.html saving output/ghsom_demo1_a_2_2_0_0.html saving output/ghsom_demo1_a_3_2_1_0.html saving output/ghsom_demo1_a_4_2_2_0.html saving output/ghsom_demo1_a_5_2_0_1.html saving output/ghsom_demo1_a_6_2_1_1.html saving output/ghsom_demo1_a_7_2_2_1.html saving output/ghsom_demo1_a_8_2_0_2.html saving output/ghsom_demo1_a_9_2_1_2.html saving output/ghsom_demo1_a_10_2_2_2.html saving output/ghsom_demo1_a_11_3_1_2.html saving output/ghsom_demo1_a_12_3_0_3.html saving output/ghsom_demo1_a_13_3_1_3.html saving output/ghsom_demo1_a_14_3_1_0.html saving output/ghsom_demo1_a_15_3_2_0.html saving output/ghsom_demo1_a_16_3_2_1.html saving output/ghsom_demo1_a_17_3_3_2.html saving output/ghsom_demo1_a_18_3_1_0.html saving output/ghsom_demo1_a_19_3_0_1.html saving output/ghsom_demo1_a_20_3_2_1.html saving output/ghsom_demo1_a_1_1_0_0.mapdescr saving output/ghsom_demo1_a_1_1_0_0.wgt saving output/ghsom_demo1_a_1_1_0_0.unit saving output/ghsom_demo1_a_2_2_0_0.mapdescr saving output/ghsom_demo1_a_2_2_0_0.wgt saving output/ghsom_demo1_a_2_2_0_0.unit saving output/ghsom_demo1_a_3_2_1_0.mapdescr saving output/ghsom_demo1_a_3_2_1_0.wgt saving output/ghsom_demo1_a_3_2_1_0.unit saving output/ghsom_demo1_a_4_2_2_0.mapdescr saving output/ghsom_demo1_a_4_2_2_0.wgt saving output/ghsom_demo1_a_4_2_2_0.unit saving output/ghsom_demo1_a_5_2_0_1.mapdescr saving output/ghsom_demo1_a_5_2_0_1.wgt saving output/ghsom_demo1_a_5_2_0_1.unit saving output/ghsom_demo1_a_6_2_1_1.mapdescr saving output/ghsom_demo1_a_6_2_1_1.wgt saving output/ghsom_demo1_a_6_2_1_1.unit saving output/ghsom_demo1_a_7_2_2_1.mapdescr saving output/ghsom_demo1_a_7_2_2_1.wgt saving output/ghsom_demo1_a_7_2_2_1.unit saving output/ghsom_demo1_a_8_2_0_2.mapdescr saving output/ghsom_demo1_a_8_2_0_2.wgt saving output/ghsom_demo1_a_8_2_0_2.unit saving output/ghsom_demo1_a_9_2_1_2.mapdescr saving output/ghsom_demo1_a_9_2_1_2.wgt saving output/ghsom_demo1_a_9_2_1_2.unit saving output/ghsom_demo1_a_10_2_2_2.mapdescr saving output/ghsom_demo1_a_10_2_2_2.wgt saving output/ghsom_demo1_a_10_2_2_2.unit saving output/ghsom_demo1_a_11_3_1_2.mapdescr saving output/ghsom_demo1_a_11_3_1_2.wgt saving output/ghsom_demo1_a_11_3_1_2.unit saving output/ghsom_demo1_a_12_3_0_3.mapdescr saving output/ghsom_demo1_a_12_3_0_3.wgt saving output/ghsom_demo1_a_12_3_0_3.unit saving output/ghsom_demo1_a_13_3_1_3.mapdescr saving output/ghsom_demo1_a_13_3_1_3.wgt saving output/ghsom_demo1_a_13_3_1_3.unit saving output/ghsom_demo1_a_14_3_1_0.mapdescr saving output/ghsom_demo1_a_14_3_1_0.wgt saving output/ghsom_demo1_a_14_3_1_0.unit saving output/ghsom_demo1_a_15_3_2_0.mapdescr saving output/ghsom_demo1_a_15_3_2_0.wgt saving output/ghsom_demo1_a_15_3_2_0.unit saving output/ghsom_demo1_a_16_3_2_1.mapdescr saving output/ghsom_demo1_a_16_3_2_1.wgt saving output/ghsom_demo1_a_16_3_2_1.unit saving output/ghsom_demo1_a_17_3_3_2.mapdescr saving output/ghsom_demo1_a_17_3_3_2.wgt saving output/ghsom_demo1_a_17_3_3_2.unit saving output/ghsom_demo1_a_18_3_1_0.mapdescr saving output/ghsom_demo1_a_18_3_1_0.wgt saving output/ghsom_demo1_a_18_3_1_0.unit saving output/ghsom_demo1_a_19_3_0_1.mapdescr saving output/ghsom_demo1_a_19_3_0_1.wgt saving output/ghsom_demo1_a_19_3_0_1.unit saving output/ghsom_demo1_a_20_3_2_1.mapdescr saving output/ghsom_demo1_a_20_3_2_1.wgt saving output/ghsom_demo1_a_20_3_2_1.unitFirst, the input vectors are read. In the second step, the initial MQE of layer 0 is computed, which will furthermore be used to guide the training process and decide on the ultimate stopping criterion for the lowest-level granularity.
The results can be analyzed using any browser. For the GHSOM, the primary entry files is always called xxxxxx_1_1_0_0.html To allow direct access to the files we should also include a link from the output directory to the source files.
[andi@student experiments]$ cd ~/www/experiments/ [andi@student experiments]$ ln -s ../files output/files [andi@student experiments]$ netscape output/ghsom_demo1_a_1_1_0_0.html &The resulting HTML-Files should now be analyzed with respect to:
4.) libViewer Representation
The libViewer provides a graphical representation of your document as a library of books in shelves. It reads an library description file, which sepcifies the graphical look and feel of the documents and their position according to the trained maps.
Such a libViewer description file may now be created, providing a graphical user interface to a document collection.
For this, you need to have some metadata on your documents available, that you want to depict in a graphical way.
(Description of libViewer file format etc. to be added... If you need the information right away, take a look at the demo-files provided with the packages available for download on th web, and/or send me an e-mail if you have problems with those.)