1. General
The following sections provide a short overview on how to use the our implementation of the GHSOM.
The source code, as well as a compiled version, can be obtained via the download-page of the SOMLib Digital Library project at http://www.ifs.tuwien.ac.at/~andi/somlib.
In order to compile the source-code, unpack the gnu-zipped tar archive, change into the respective directory, and simply do a
./configure ./make
This should leave you with an executable program called ghsom that you can either use from you current directory or put in any place in your path searched for executables. (If it does not... well... then it's time for trouble-shooting :)
Input Files
The data to be clustered using the GHSOM is represented by 2 input files that are in plain-text ASCII file format. The fields enclosed in brackets have to be substituted by the actual values. You can use any of the demo files provided on the web or build your own data files following this format:
Inputvector file
The number of features and input vectors are integer values and the
single vector elements may be integer or real values >=0.
$TYPE inputvec $XDIM <# of input vectors> $YDIM 1 $VECDIM <# of features> <feat 1> <feat 2> ... <name of input vector 1> <feat 1> <feat 2> ... <name of input vector 2> . . .The first lineTYPE contains just a free-form text tag (a single word) allowing you to label the vector file.
Template vector file
The template vector file basically describes the dimensions of your feature space, and may be used to assign labels to the clusters by selecting the dimension identifiers that are most characteristic for a given cluster.
The file format is basically as follows:
$TYPE template $XDIM 7 $YDIM <# of input vectors> $VECDIM <# of features> 0 <name of feature 1> <df> <tf> <min_tf> <max_tf> <mean_tf> 1 <name of feature 2> <df> <tf> <min_tf> <max_tf> <mean_tf> 2 <name of feature 3> <df> <tf> <min_tf> <max_tf> <mean_tf> : : : <# of features - 1> <name of feature n> <df> <tf> <min_tf> <max_tf> <mean_tf>TYPE again is a free-text identifier characterizing the file.
3. Usage
All paramters to the GHSOM training have to be specified in a property file which is described inmore detail below. In order to train a map, you simply call ghsom with that property file as input, i.e.
ghsomThe property-file is a simple plain-text file consisting of several property - value pairs like this:
property1=value1 property2=value2 property3=value3 ...ATTENTION: no white-spaces are allowed between property/value and the equal sign. Furthermore, no trailing white spaces should be present after the value.
If you don't provide one or more of several of the following properties, a default value for them will be set.
Property Type Range Description EXPAND_CYCLES int >=1 # of cycles after which the map is checked for eventual expansion; 1 cycle actually means # of input vectors;
Example: 100 input vectors, 10 cycles = 1000 times a randomly chosen pattern is presented to the SOM for learningTAU_1 real [0-1] percentage of remaining error that has to be explained by each map, ako stopping criterion for horozontal growth. The smaller this value, the larger each map will grow, and the flatter the hierarchy will be
A good starting point may be a value of about 0.25TAU_2 real [0-1] final degree of granularity represented by the maps in the lowest layer. The smaller, the more detailed the data representation will be, and thus the bigger the overall GHSOM structure.
An appropriate value for testing may be 0.1 or less; if you set this property to 1, only one single SOM in the first layer will be trainedINITIAL_LEARNRATE real [0-1] determines how strong the winner and its neighboring units are initially adapted, decreases over time
good starting point: 0.8INITIAL_NEIGHBOURHOOD int >=0 initial neighborhood range, decreases over time
If you are training a GHSOM starting with a 2x2 initial map, a value of 2 or 3 is sufficient. If you are using the GHSOM to tarin a conventional SOM of size XxY, you might want to set it to X or Y whichever is higherHTML_PREFIX string - prefix for the output files. All files will be labeled that way,, followed by an underscore and subsequent numbering DATAFILE_EXTENSION string may be empty suffix for the reference of the data files in the HTML tables;
we usually name the vectors in the inputvector-file to link to the actual files but omit the extension to get "better looking" maps; if you do so, you have to provide the extension to get the correct links to the document files; for browsing, the document files are always expected in a subdirectory files of the directory where the HTML files are locatedrandomSeed int any initial seed value for the random number generator to enable repeatable training-runs inputFile string - path (relative to the current directory you are in or absolute) + name of the input vector file (vectors/test.in) descriptionFile string - path (relative to the current directory you are in or absolute) + name of the input vector file (vectors/test.tv) savePath string - directory where the output files are written (without trailing slash). Note: make sure that this directory exists, and that you have write permissions on it! :) (output) normInputVectors string NONE
| LENGTH
| INTERVALif and how the input vectors are normalized; NONE=raw input data will be used; LENGTH=vectors are normalized to length 1; INTERVAL=vector elements are transformed into the interval [0-1] INITIAL_X_SIZE int >=1 initial size of new maps in x-direction. For any growing map you will want to set this to 2, from which the map will start to grow. However, you can set it to any desired size right away. INITIAL_Y_SIZE int >=1 initial size of new maps in y-direction. For any growing map you will want to set this to 2, from which the map will start to grow. However, you can set it to any desired size right away. If you set this value to 1, you will create a 1-dimensional SOM, that grows only linearly, resulting, if expanded hierarchically, in a tree-like representation fo your data. LABELS_NUM int >=0 max # of labels per unit; 0 = no labels.
The labelSOM method is used to select those features that are most characteristic of the respective unit to describe it.LABELS_ONLY bool true | false if 'true', only the labels will be shown on nodes which have been expanded into the next layer along with a link labeled "down". Setting this property to 'false' is only useful for testing small data sets to see which data is mapped onto the according map in the next layer. LABELS_THRESHOLD real [0-1] features which are most important are used as labels; a value of 0.8 means that only the features with values in the top 20% of all are printed as labels; the lower this value the more labels will be shown (limited by LABELS_NUM)
Up
Comments: rauber@ifs.tuwien.ac.at