Step-by-step guide to train and view Maps
- Data Preprocessing
- Feature Extraction
- Feature Processing
- SOM Training
- SOM Viewing
Data preprocessing
Depending on your data, you might need to perform some preprocessing steps. Examples of preprocessing might be- Text data: converting different formats to plain text, applying a stemming algorithm, ...
- Music: converting all the data to the same format, sample rate and stereo/mono setting
Feature Extraction
The Self-Organising Map can handle only numerical representations of data. Thus, you might need to apply some feature extraction, which is the process of describing certain characteristics of the data with numeric attributes. Some data (such as sales data) might already be in a numeric form and thus might require no feature extraction (but maybe some processing, such as normalisation, see below).
Specifically, our implementation of the SOM requires the data to be in the SOMLib file format, a rather simple ASCII format describing the features and the numeric representation of the data instances.
Data that might need feature extraction may for example be:- Text documents. Text can be represented in many ways, with simple bag-of-words approaches, phrase detection, or
Latent Semantic Indexing.
Our group has developed a Java feature extractor which describes text documens with bag-of-words features, and generates SOMLib files. - Music (Speech). A series of feature extractors exist that try to capture rhythmic information, beat, instrumentation, etc. You may utilise our suite of Rhythm Patterns features (available in Matlab)
- Images. Image may be described by colour histograms, objects, ...
Feature processing: normalisation
This is an optional step, and you should be aware what kind of normalisation you want to apply to your data. The Java SOMToolbox provides the following normalisation methods:
- Unit length: all vectors will be scaled to the same length. This is useful e.g. when processing text documents, and you don't a different length of text documents to have an influence on the values in the feature vector.
- Min-Max: each vector attribute will be normalised between 0 and 1.
- Standard score: each vector attribute will be scaled to have zero mean and the standard deviation as max value.
./somtoolbox.sh SOMLibVectorNormalization -m UNIT_LEN <inputfile> <outputfile>
(in Windows use somtoolbox.bat instead of ./somtoolbox.sh)
For a brief introduction on the SOMLib input vector format see the quick guide on input files, or take a look at the detailed specification.
Self-Organizing Map training
Setup
Download the som.prop properties file and edit:
outputDirectory = <directory where files will be created; empty means use workingDirectory>
namePrefix = <any project name you like>
vectorFileName = <name of *normalized* vector file - see 1.>
sparseData = <yes|no> ... use yes if vectors are sparse (e.g. text data), no if vectors are not sparse (audio!)
isNormalized = <yes|no> ... set yes if vectorFile has been previously normalized
templateFileName=vector.tv (the template vector file - see below)
Note: Under Windows use double backslashes \\ as path separator.
The remaining parameters control the SOM algorithm and can be experimented with:
ySize=14 ... size of map in y direction
learnrate=0.75
#sigma=12
#tau=
#metricName=
numIterations=2000 ... should be larger than the # of vectors in vectorFile (recommended: 5*<#_of_vectors>)
You have to provide an appropriate template vector file:
- For text vectors you have to create an individual vector file for each data. If you for example use the TeSeTool, those vectors will be automatically created for you.
- For Rhythm Patterns vector files extracted from audio, use this rhythm_patterns.tv file.
Note: you can also take a look at the complete and documented properties file.
Training
Now you are ready to train the SOM:
./somtoolbox.sh GrowingSOM [path/to/]som.prop
If an error occurs, please check the parameters provided.
At this point check if four files in your outputDirectory have been created with the namePrefix as provided in som.prop and the following extensions:
- .dwm.gz - Data winner mapping file.
- .map.gz - Map description file.
- .unit.gz - Unit description file.
- .wgt.gz - Weight vector file.
Analysing with the SOM Viewer
./somtoolbox.sh SOMViewer -u /path/to/file.unit.gz -w /path/to/file.wgt.gz --dw /path/to/file.dwm.gz