Institut
für Softwaretechnik
Favoritenstr. 9 - 11 / 188
A - 1040 Wien
Tel.: (+43) 1 58801 18801, Fax.: (+43) 1 5040532
Arbeitsgruppe: Information & Software Engineering
SOMLib Data Files
Technical Report TR-IR98-1
Ver. 1.3.5 - 18. 07. 2000 (internal)
(History)
SOMLib Data Files - General Information
Basically there are 5 different types of data files which are used to create 6 different
files namely:
- SOMLib Map Description File: SOMLib-Map-Descr
- SOMLib Weight Vector File:
- SOMLib Template Vector File: SOMLib Template Vector File
- SOMLib Unit Description File: SOMLib Unit Descriptions
- SOMLib Vector Description File: SOMLib Vector Descriptions
Alle of these files are built around the same basic structure which is defined as follows:
- Entries can be comments or parameter values
- Comments are indicated by a # character at the beginning of a line
- Parameters are indicated by a $ character at the beginning of the line
- Comments are allowed
- as a block of comment lines at the beginning of every file
- after a parameter introduced by ' #' and running till end of line
(e.g. $TYPE vec # input vector file)
Note: no comment lines are allowed after the initial block of comments.
- Parameters are identified by a certain KEYWORD, followed by a blank and the according
value, which can be either a real, integer, or string value or a
list of these values separated by blanks
- For real numbers the separator character is a dot.
- If a value is not available, a default NULL value is given which is in the case of
- string: VOID
- real, integer: -1
- It is suggested to follow the order of the entries in the data files. If parameters are given in
a different order, a warning shall be printed to stdout/log when trying to read the file - however, it should not be relied upon.
- Some of the parameters to be read are mandatory. When mandatory parameters
are missing, reading fails with an error message.
- Some of the parameters to be read are optional. When optional parameters
are missing, a warning shall be printed to stdout/log when trying to read the file with the reading process continuing.
In the following sections the 6 files are described in more detail, giving an idea of the contents and the intention of the file as well as its very structure in terms of the order of parameters as well as the distinction between mandatory (M) and optional (O) parameters. Furthermore, the relationships between the parameters are listed.
SOMLib Map Description File
Standard filename: XXX.map
Produced by: SOM training program
Modified by: SOM mapping program, SOM quant-error program
Demo-File: demo.mapdescr
This file describes the basic structure of the Self-Organizing Map, giving all the parameters
used in the training process. It is initially written as result of the training process of the SOM. Additional Information attributes may be added as required by various programs.
Parameter Entries:
- # Block of Comments: (optional) several lines of comments each starting with #
- $TYPE : string, mandatory
describes the topology of the map, currently used values: descr
- $TOPOLOGY : describes the topology of the map. Currently used values rect, hex, hfm, gcs, gg, ghsom
- $XDIM : integer, mandatory
number of units in x-direction
- $YDIM : integer, mandatory
number of units in y-direction
- $VEC_DIM : integer, mandatory
dimensionality of weight vectors of map
- $STORAGE_DATE : string (or date, format tbd), optional
date of storage time of trained map
- $STORAGE_TIME : string (or time, format tbd), optional
time of storage time of trained map, probably combined
with $STORAGE_DATE in one string?
- $TRAINING_TIME : integer, optional
training time for map in seconds
- $LEARNRATE_TYPE : string, optional
type of learn rate given as free text string
- $LEARNRATE_INIT : real, optional
initial learn rate a0
- $NEIGHBORHOOD_TYPE : string, optional
type of neighborhood region as free text string
- $NEIGHBORHOOD_INIT : real, optional
initial neighborhood range e0
- $RAND_INIT : integer, optional
init value for random number generator
- $ITERATIONS_TOTAL : integer, optional
number of iterations of training process
- $ITERATIONS_BUFFERED : integer, optional
number of iterations of one training process cycle
when using buffered reading
- $NR_TRAINVEC_TOTAL : integer, optional
number of input vectors used for training in total
- $NR_TRAINVEC_BUFFERED : integer, optional
number of input vectors used for training on cycle
when using buffered reading of input vectors
- $VEC_NORMALIZED : integer, optional
indicator whether input vectors were normalized prior
to the training process. permitted values 0, 1
- $QUANTERROR_MAP : real, optional
quantization error of map
- $QUANTERROR_VEC : real, optional
average input vector quantization error of map, i.e. the quantization error of the map divided by the number of vectors mapped onto the SOM ($QUANTERROR_MAP / $NR_TRAINVEC_TOTAL)
- $URL_TRAINING_VEC : string, optional
URL of file containing input vectors used for training
(Input Vector File, XXX.in)
- $URL_TRAINING_VEC_DESCR : string, optional
URL of file containing description of input vectors used
for training (Input Vector Description File, XXX.vec)
- $URL_WEIGHT_VEC : string, optional
URL of file containing weight vectors of trained map
(Weight Vector File, XXX.wgt)
- $URL_QUANTERR_MAP : string, optional
URL of file containing quantization error vectors of trained map
(Quantization Error File, XXX.err)
written by SOM quant-error program
- $URL_MAPPED_INPUT_VEC : list of strings, optional
URLs of files containing input vectors mapped onto trained map
(Input Vector File, XXX.in)
written by SOM mapping program
- $URL_MAPPED_INPUT_VEC_DESCR : list of strings, optional
URLs of files containing descriptions of input vectors
mapped onto trained map
(Input Vector Description File, XXX.vec) written by SOM mapping program
- $URL_UNIT_DESCR : string, optional
URL of file containing description of units of trained map
(Unit Description File, XXX.unit)
- $DESCRIPTION: string or memo, optional
free form text description of map to be used for display.
Read to TO_EOF, i.e. description may span multiple lines.
Back to Top.
SOMLib Weigth Vector File
Standard filename: XXX.wgt
Produced by: SOM init program, SOM training program
Modified by: -
Demo-File: demo.wgt
This file describes the weight vectors of the trained Self-Organizing Map.
It is initially written as result of the SOM init program, read by the SOM training program as initialized map and finally written by the SOM training program after training
The files consists of two blocks, the first one describing the general SOM structure, the second giving the weight vectors of the SOM
The first 4 parameter entries are given as a sanity check to find out whether the given
SOM map file and weight vector file match. If any of the 4 first parameters does not match
the program should print a detailed error message and exit.
Parameter Entries:
- # Block of Comments: (optional) several lines of comments each starting with #
- $TYPE : string, mandatory
describes the filetype and/or topology of the map, currently used values: hex, rect, som, ghsom, rect_som, hex_som
- $XDIM : integer, mandatory
number of units in x-direction
- $YDIM : integer, mandatory
number of units in y-direction
- $VEC_DIM : integer, mandatory
dimensionality of weight vectors of map, = n
- <x_1_1> .......... <x_1_n> <label_1>
- :::::::::::::::::::::::::::::::: :::::::
- <x_m_1> .......... <x_m_n> <label_m>
lists the n vector elements (n dimensions, i.e. n entries per line) of m weight vectors
where m = XDIM x YDIM, being real values, followed by the label of the weight vector,
being a string value like "SOM_MAP_Name_(X/Y)". All values are mandatory.
If the number of weight vectors m is smaller than XDIM x YDIM the program reading this
file should print a warning message.
the order of vectors should be line by line, i.e. (0/0), (1/0), (2/0), from left to right, starting with (0/0) in the upper left corner of the map.
If the number of vector elements does not match the given dimensionality VEC_DIM the
program should print a detailed error message and exit.
Back to Top.
SOMLib Quantization Error Map File
Standard filename: XXX.err
Produced by: SOM quantization error program
Modified by: -
Demo-File: none
This file describes the quantization error vectors of the trained Self-Organizing Map.
It is written by the SOM quantization error program based on a trained map and given
input vectors
The files consists of two blocks, the first one describing the general SOM structure, the second giving the quantization error vectors of the SOM.
The file structure is identical to the general weight vector description file.
The first 4 parameter entries are given as a sanity check to find out whether the given
SOM map file and weight vector file match. If any of the 4 first parameters does not match
the program should print a detailed error message and exit.
Parameter Entries:
- the parameters and the file structure is identical to the
SOMLib Weigth Vector File, with the $TYPE Parameter being set to qerr, qerr_rect, qerr_hex, err etc.
Back to Top.
SOMLib Input Vector File
Standard filename: XXX.in
Produced by: Parser, Vector Generator
Modified by: -
Demo-File: demo.tfxidf
This file describes the input vectors to be used for the training process of a Self-Organizing Map.
It is written by the parser or vector generator program creating the vector structure
The files consists of two blocks, the first one describing the input vectors in order to follow the general file structure of weight vector files, the second giving the input vectors
The file structure is identical to the SOMLib Weight Vector File.
However, some semantical changes of the first 4 vector entries are as follows
Parameter Entries:
- # Block of Comments: (optional) several lines of comments each starting with #
- $TYPE : string, mandatory
vec, vec_tf, vec_tfxidf, vec_bin, vec_structure to indicate input vector file
further information about the type of quantization and encoding used
can be packed into this string
- $XDIM : integer, mandatory
number of input vectors in file
- $YDIM : integer, mandatory
1; this allows again for XDIM x YDIM to give the total number of vectors to be read from file. NOTE: for any program reading this file: the number of vectors listed in the file is given by XDIM * YDIM, and not by XDIM alone!
- $VEC_DIM : integer, mandatory
dimensionality of weight vectors of map, = n
The remainder of the file is identical to the
SOMLib Weigth Vector File:
- <x_1_1> .......... <x_1_n> <VEC_ID_1>
- :::::::::::::::::::::::::::::::: :::::::
- <x_m_1> .......... <x_m_n> <VEC_ID_m>
lists the n vector elements (n dimensions, i.e. n entries per line) of m weight vectors
where m = XDIM (i.e. = XDIM x YDIM with YDIM being 1), being real values, followed by the
<VEC_ID>, i.e. the label of the weight vector, being a string value. All values are mandatory.
If the number of weight vectors m is smaller than XDIM x YDIM the program reading this
file should print a warning message.
If the number of vector elements does not match the given dimensionality VEC_DIM the
program should print a detailed error message and exit.
Back to Top.
SOMLib Template Vector File
Standard filename: XXX.tv
Produced by: Parser, Vector Generator
Modified by: -
Demo-File: demo.tv
This file describes the template vectors providing the attribute structure of the input vectors used for the training process of a Self-Organizing Map.
It is written by the parser or vector generator program creating the vector structure
Parameter Entries:
- # Block of Comments: (optional) several lines of comments each starting with #
- $TYPE : string, mandatory
template to indicate template vector file
further information may be packed into this string
- $XDIM : integer, mandatory
nr. of columns used in layout, min.: 2 (Nr. and Attribute), max. currently 7
- $YDIM : integer, mandatory
number of feature vectors in corresponding SOMLib Input Vector File
- $VEC_DIM : integer, mandatory
dimensionality of weight vectors of map, = n
The remainder of this files lists the attributes of the vectors by 7 columns of information as follows
- <nr> <attr> [<df> <tf_coll> <max_tf> <min_tf> <mean_tf> # comment]
- :::::::::::::::::::::::::::::::: :::::::
- <nr> <attr> [<df> <tf_coll> <max_tf> <min_tf> <mean_tf> # comment]
- with
- <nr>: int, consecutive numbering of attributes, starting with 0, up to VEC_DIM-1
- <attr>: string, label or name of the attribute, i.e. keyword etc.
- <df>: int, document frequency - in how many documents or feature vectors is this attribute present, i.e. has an input vector value <> 0
- <tf_coll>: real, term frequency in the whole collection - how often does this attribute show up in the whole collection of feature vectors, ako of counter for the attribute, sum of all values of the attribute (sum across all feature vectors)
- <min_tf>: real, minimal value of this attribute in the collection of feature vectors
- <max_tf>: real, maximum value of this attribute in the collection of feature vectors
- <mean_tf>: real, mean value of this attribute in the collection of feature vectors
- # comment: optional comment for attributes till end of line
Back to Top.
SOMLib Unit Description File
Standard filename: XXX.unit
Produced by: SOM training program
Modified by: SOM mapping program, LabelSOM program
Demo-File: demo.unit
This file describes the units of the trained Self-Organizing Map.
It is written by the SOM training program.
The files consists of two blocks, the first one describing the general SOM structure, the second giving a specific description of every unit
The first 3 parameter entries are given as a sanity check to find out whether the given
SOM map file and weight vector file match. If any of the 3 first parameters does not match
the program should print a detailed error message and exit.
Parameter Entries:
- # Block of Comments: (optional) several lines of comments each starting with #
- $TYPE : string, mandatory
describes the topology of the map, currently used values: hex, rect
- $XDIM : integer, mandatory
number of units in x-direction
- $YDIM : integer, mandatory
number of units in y-direction
This header describes the general SOM structure.
Following this block, the second block contains the following set of attributes per unit:
- # Block of Comments: (optional) several lines of comments each starting with #
- $POS_X : integer, mandatory
x coordinate of unit in standard visualization of SOM (column)
- $POS_Y : integer, mandatory
y coordinate of unit in standard visualization of SOM (line)
- $UNIT_ID : string, optional
short label / id of unit as free text string, e.g. (0/0), (1/0), etc.
- $QUANTERROR_UNIT : real, optional
quantization error of unit
- $QUANTERROR_UNIT_AVG : real, optional
average input vector quantization error of unit, i.e. QUANTERROR_UNIT divided by the number of weight vectors mapped onto this unit (NR_VEC_MAPPED)
- $AC_POS_X : real, optional
x coordinate of unit in AC visualization of SOM
- $AC_POS_Y : real, optional
y coordinate of unit in AC visualization of SOM
- $UMAT_UNIT : real, optional
averaged distance for U-Matrix representation for unit
- $UMAT_RIGHT : real, optional
distance to the right neighbor for U-Matrix representation
- $UMAT_DOWN : real, optional
averaged distance to the lower neighbor for U-Matrix representation
in case of hexagonal map arrangement
distance to the lower neighbor for U-Matrix representation in case
of rectangle map arrangement
- $UMAT_DOWN_LEFT : real, optional
averaged distance to the left lower neighbor for U-Matrix representation
in case of hexagonal map arrangement
averaged distance to the lower left neighbor for U-Matrix representation
in case of rectangle map arrangement
- $UMAT_DOWN_RIGHT : real, optional
averaged distance to the right lower neighbor for U-Matrix representation
in case of hexagonal map arrangement
averaged distance to the lower right neighbor for U-Matrix representation
in case of rectangle map arrangement
- $NR_VEC_MAPPED : integer, optional
number of input vectors mapped onto this unit
written by SOM training program
- $MAPPED_VECS : list of string, optional
list of strings giving the VEC_ID's of input vectors
(labels) mapped onto this unit.
Used for static referencing
The number should be identical to NR_VEC_MAPPED.
IF not a warning should be printed.
Written by SOM mapping program
- $MAPPED_VECS_DIST : list of real, optional
distances by which vectors are mapped onto the unit
- $NR_SOMS_MAPPED : integer, optional
number of other SOMs mapped onto this unit
written by hierarchical SOM training program or
by integrating SOM training program
- $URL_MAPPED_SOMS : list of strings, optional
list of strings giving the URL's of
SOM Map Description Files (filename XXX.map)
Used for dynamic referencing
The number should be identical to NR_SOMs_MAPPED.
IF not a warning should be printed.
Written by SOM mapping program
- $MAPPED_SOM_DIST : list of real, optional
distances by which SOM vectors are mapped onto the unit.
for GHSOM: mqe of that unit
- $NR_UNIT_LABELS : int, optional
number of labels for this unit, written by LabelSOM program
- $UNIT_LABELS : list of strings, optional
list of labels for unit, written by LabelSOM program
- $UNIT_LABELS_QE : list of real, optional
quantization error of the labels
- $UNIT_LABELS_WGT : list of real, optional
weight of the labels
- $UNIT_LABELS_LEFT : list of strings, optional
list of labels for unit, written by LabelSOM program
- $UNIT_LABELS_LEFT_DIFF : list of real, optional
difference to left neighbor labels of the labels
- $UNIT_LABELS_RIGHT : list of strings, optional
list of labels for unit, written by LabelSOM program
- $UNIT_LABELS_RIGHT_DIFF : list of real, optional
difference to right neighbor labels of the labels
- $UNIT_LABELS_UP : list of strings, optional
list of labels for unit, written by LabelSOM program
- $UNIT_LABELS_UP_DIFF : list of real, optional
difference to upper neighbor labels of the labels
- $UNIT_LABELS_DOWN : list of strings, optional
list of labels for unit, written by LabelSOM program
- $UNIT_LABELS_DOWN_DIFF : list of real, optional
difference to down neighbor labels of the labels
- $URL_RELATED_UNITS : list of string, optional
list of strings giving URLs of related units.
These can be links to units within the same map
or to units on other SOM maps.
The URL willmost probably consist of the URL of
the SOM map file (XXX.map) plus the unit location
within the map given as '#(x/y)', details tbd.
- $DESCRIPTION : string, optional
free form text description of unit, terminated by newline
Back to Top.
SOMLib Vector Description File
Standard filename: XXX.vec
Produced by: Parser or vector generator program
Modified by: SOM browsing software
This file describes the input vectors for a self-organizing map.
It is written by the parser or vector generator program and describes the properties of each vector
The file consists of one set of attributes per vector with the very attributes still being subject to modification, or rather, extension. The structure of the description of the vectors follows in general the structure of the unit description file. Further attributes will be added as the necessity arises, especially in the context of metaphor graphics.
Furthermore, the question whether each of the description files should be kept as an independet file or be part of one lare file comrising the whole collection has not been fully decided upon.
The attributes considered so far are:
Parameter Entries:
- # Block of Comments: (optional) several lines of comments each starting with #
- $TYPE: string, mandatory
vecdescr to indicate input vector description file, further information may be packed into this string
- $NR_Files: int, mandatory
number of files / vectors described in this file
- $NR_METADATA_ATTR: int, mandatory
number of metadata attributes per entry
The header above describes the general file structure.
following this block, the second block contains the following set of attributes per vector/file:
- # Block of Comments: (optional) several lines of comments each starting with #
- $VEC_ID : string, mandatory
ID of vector, a kind of short label or unique ID specially for documents
split into several vectors
- $LABEL : string, optional
label of vector, full name, file name, possibly identical to $VEC_ID
- $URL_DOC : string, optional
URL of document being the basis for the vector
- $TYPE : string, mandatory
giving the type of the vector with currently supported types being DOC,
SOM and VEC, with DOC for vectors describing documents (that can be referenced),
SOM for other SOMs (that can be referenced) and VEC for general vectors (that cannot be referenced)
- DUBLIN-CORE Metadata Attribute Set: All Attributes of the Dublin Core Metadata Set - recommended set of attributes for Input Vector Description, such as creator, subject, keywords etc.
Additional attributes only where necessary, e.g. price, domain, etc.
May also be used to directly accomodate libViewer attributes.
- $SIZE : integer, optional
length of document in bytes, size of SOM in terms of units or
documents mapped, details tbd.
- $PRICE: price for the document etc.
- $NR_TIMES_REFERENCED : integer, optional
number of times this vector was referenced.
initialized to 0 by parser program, modified by
SOM browsing software
- $LAST_REFERENCED : string / date, optional
date of last reference to vector
modified by SOM browsing software
- $DESCRIPTION : string, optional
free form text description of unit, terminated by newline
Back to Top.
History
- Vers. 1.3.5: (18.7.2000):
- fixed formatting
- changed SOMLib Input Vector Description File structure and setup
- renamed file to SOMLib Vector Description File
- Vers. 1.3.4: (17.7.2000):
- fixed formatting (added $)
- changed structure and occurrence of comments
- added Dublin Core recommendation to Input Vector Description File
- Vers. 1.3.3: (11.7.2000):
* fixed formatting errors
* added Distances in Unit Description Files for mapped vecs, soms, and labels
* added $NR_UNIT_LABELS
- Vers. 1.3.2: (10.7.2000):
* fixed formatting errors
* added 7. attribute for template vector file: mean
- Vers. 1.3.1: (8.7.2000):
* fixed formatting errors
- Vers. 1.3: (6.7.2000):
* adapted SOM Input Vector File
* changed $TOPOLOGY into $TYPE
* changed NODE into UNIT
* removed $SIGNATURE
* added template vector file description
* added some demo-files (artificially created - real ones to be added)
* removed hex-SOM required condition for x/y-Pos values
- Vers. 1.2: (18.11.1998):
* SOM Unit Description File: X_POS, Y_POS mandatory instead of optional
* 0 <= POS_X < 2*XDIM and 0 <= POS_Y < 2*YDIM to accomodate hex-location
* 0 <= AC_POS_X < 2*XDIM and 0 <= AC_POS_Y < 2*YDIM to accomodate hex-location
- Vers. 1.1: (03.11.1998):
* added UMAT_RIGHT, UMAT_UNIT, UMAT_DOWNRIGHT, UMAT_DOWNLEFT to SOM Unit Description
* changed URL_VEC to URL_DOC in Input Vector Descriptions
* added (keyword) to SOM Map Description File to indicate whether a description follows
* spelling
- Vers. 1.0: (17.09.1998):
* basic Datafile Structure
Back to Top.
Home
Comments:
rauber@ifs.tuwien.ac.at