To divide an existing mailbox in distinct email-files, an EMail-Preprocess has been developed. By making use of
java somlib.ui.Mailfile the first stage of this preprocessing can be started.
'Mailfile to Emails' devides the mailfile to many EMail-files maintaining the header and the format. Thus the
mailbox-file is to be given as input. The directory for the devided emails needs to be declared in the output-section.
The emails will be given the extension '.email'.
Another preprocess-module ('Bookmarks') is summoned with java somlib.ui.Bookmarks.
The first two modules have the same function: plain-text-files are translated to histogram-files.
'Plaintext to Wordhistograms' takes the files in the 'Plaintext Directory' and transfers them to the directory
'Word Histograms'. The new files are serialised Hashtables that contain the single words and their number of
appearence. A 'Minimum Word Length' can be defined.
'Histograms to Template Vector' combines the files that were created in the last step to a single file 'Template Vector'.
An already created vector can be extended.
The 'Template Vector' is reduced to 'Reduced Vector' by setting the minimum and the maximum appearence of a word
(or a n-gram respectively). If 'Save Removed' is set, the removed words will be stored at the location defined.
The 'Reduced Template Vector' and the various histogram-files have to be adjusted to. After having set the location of
the vector ('TV') and the directory containing the histogram-files created in one of the first two steps the vector-files are
written to the directory 'Individual Vector Files' in the defined format.
To divide an existing mailbox in distinct email-files, an EMail-Preprocess has been developed:
java somlib.textrepresentation.bkmkexc -f bookmark-file -d destination-dir -b buffer-dir -r vector-description-dir -m mode
Step 2, Plaintext to Histograms:
step 3, Histograms to Template Vector:
step 4, Reduce Template Vector:
step 5, Extract Individual Vectors:
Step 2, Training:
Step 3, Mapping:
Step 4, Labelling:
Step 5, Postprocess:
Comments: rauber@ifs.tuwien.ac.at
the graphical interface
To keep the processus as flexible as possible, the parser is divided in several modules. All of them are
shaped in a similar way:
As input a file or directory has to be issued, depending on the specific module. The parameter Output
is handeled identically. Options can be defined for the various modules. Additionally a logfile can be
created if required; on top the verbosity of the system can be adjusted from 0 to 3, where 3 defines the most talkative
stage.
The control-block appears intuitively comprehensible.
Preprocess
Files in a plain-text format are required to apply the parser. Since this very data is most likely not available, various
steps of preprocessing have been introduced.
'EMails to Plaintext' is started by java somlib.ui.EMail. This second part serves to convert the formatted files
into plain-text. Input is the output-directory of the former step or any other directory containing separated mailfiles.
For the resulting plain-text-files, that will receive the extension '.plain', a directory is to be issued in the block 'Output'.
In 'Options' a directory can be defined where 'vector description files' will be stored.
Either a bookmark-file of a Netscape-browser or available html-pages serve as input. If a bookmark-file is
given (and according to the mode), all links will be downloaded into a buffer-directory (in 'Intermediate')
and subsequently parsed to plain-text (in a directory that is to be set in the output-section) from their initial
html-format. Therefore 4 different modes are available:
1, clear: the directories will be cleared of all files and the newly downloaded files will be stored there
2, overwrite: existing files with the same name as downloaded files are overwritten, the others are kept as they are
3, add new: no files are deleted, none are overwritten, only new files are added
4, use present: the bookmark-file isn't used (thus the parameter will not be set), only in the buffer-directory existing files are used
Additionally vector description files can be written. This is achieved by defining a directory at 'VecDes' in the
'Options'-block.
Parser
java somlib.ui.Parser starts the parser.
It consists of 5 modules. These modules can be activated by the checker on the left of each module.
'Plaintext to n-grams' creates a Hashtable as well. It doesn't take the whole words of the plain-text-file but creates
Character-Strings of a predefined length between 2 and 5 ('N [2-5]'). Consequently the word 'Plaintext' is translated to:
plai, lain, aint, inte, ntex, text using 4-grams. Also a matter of definition is whether or not numbers or spaces should
be included in the n-grams.
the minimalist version: command-line
In each step the Verbosity can be set between 0 and 3.
Step 1, Preprocess:
Files in a plain-text format are required be able to apply the parser. Since this very data is most likely not available, various
steps of preprocessing have been introduced.
java somlib.textrepresentation.emailexc -i input_mailfile -o partitioned_emails -u plain_emails [-r vector-description-dir] [-v verbosity]
The name of the mailbox can be set by input_mailfile.
partitioned_emails defines the buffer-directory where the devided mails will be copied to. The formatting and the header
is left unchanged at this point of time. Subsequently the plain-text-files will be parsed to the directory plain_emails
seperately. Additionally vector-description-dir will hold descriptions of the emails. If this parameter is not set, no
vector-description-files will be written.
Either a bookmark-file of a Netscape-browser (bookmark-file) or available html-pages serve as input. If a
bookmark-file is given, all links will be downloaded into a buffer-directory (buffer-dir) and subsequently
parsed to plain-text (in a directory that is to be set in the output-section) into the directory destination-dir
from their initial html-format. Therefore 4 different modes are available:
1, clear: the directories will be cleared of all files and the newly downloaded files will be stored there
2, overwrite: existing files with the same name as downloaded files are overwritten, the others are kept as they are
3, add new: no files are deleted, none are overwritten, only new files are added
4, use present: the bookmark-file isn't used (thus the parameter will not be set), only in the buffer-directory existing files are used
Additionally vector-description-dir will hold descriptions of the emails. If this parameter is not set, no
vector-description-files will be written.
The following step can be done by two different modules. Plain-text-files will be translated to histogram.files.
java somlib.textrepresentation.wordsexc -i input_words -o output_words -m minLength [-v verbosity]
Thus files stored in the 'Plaintext Directory' (input_words) are taken and transfered to the directory
'Word Histograms' (output_words). The new files are serialised Hashtables that contain the single
words and their number of appearence. A 'Minimum Word Length' (minLength) can be defined.
java somlib.textrepresentation.ngramsexc -i input_ngrams -o output_ngrams -s size_ngrams -p spaces [t/f] -n numbers [t/f] [-v verbosity]
'Plaintext to n-grams' creates a Hashtable as well. It doesn't take the whole words of the plain-text-file but creates
Character-Strings of a predefined length between 2 and 5 (size_ngrams). Consequently the word 'Plaintext' is translated to:
plai, lain, aint, inte, ntex, text using 4-grams. Also a matter of definition is whether or not numbers or spaces should
be included in the n-grams. Processed are all files that are stored in the directory input_ngrams; they are
translated to the directory output_ngrams.
java somlib.textrepresentation.templatevectorexc -i input_dir -o output_template [-a old_template] [-v verbosity]
This module combines the files that were created in the last step (and written to input_dir) to a single file
output_template. An already created vector old_template can be extended.
java somlib.textrepresentation.reducerexc -i input_template -o output_reduced -n min_reducer -x max_reducer [-r removed_template] [-v verbosity]
The 'Template Vector' input_template is reduced to 'Reduced Vector' output_reduced by setting
the minimum min_reducer and the maximum max_reducer appearence of a word
(or a n-gram respectively). These numbers are defined as percent by numbers between 0 and 1,
whereby a point is the decimal point (e.g. 0.02 - 0.8).
If removed_template is set, the removed words will be stored at the location defined.
java somlib.textrepresentation.extractorexc -i input_hgs_dir -j input_tv -o output_extractor -t chkTF [t/f] -b chkBin [t/f] -f chkTFxIDF [t/f] [-v verbosity]
The 'Reduced Template Vector' input_tv and the various histogram-files stored in the directory
input_hgs_dir have to be adjusted. The vector-files are written to the directory output_extractor;
the format can be defined as *.tf, *.bin or *.tfxidf.
an example-log:
java somlib.textrepresentation.emailexc -i /test/Dbworld.mail -o /test/mails -u /test/plain -r /test/descr -v 1
java somlib.textrepresentation.bkmkexc -f /test/bookmark.html -d /test/bkmk -b /test/html -r /test/descr -m 2
java somlib.textrepresentation.wordsexc -i /test/mails -o /test/words -m 4
java somlib.textrepresentation.ngramsexc -i /test/mails -o /test/ngrams -s 3 -p f -n t
java somlib.textrepresentation.templatevectorexc -i /test/words -o /test/tv -v 2
java somlib.textrepresentation.reducerexc -i /test/tv -o /test/rtv -n 0.02 -x 0.8
java somlib.textrepresentation.extractorexc -i /test/words -j /test/rtv -o /test/extractor -t f -b f -f t
SOM Modules
The SOM can be executed in command-line-mode.
command - lining
In each step the Verbosity can be set between 0 and 3.
Step 1, Preprocess:
java somlib.som.preprocess.Vec2Vec -i SOM-Input-Vector_in [-o SOM-Input-Vector_out] [-n normalize [t/f]] [-v verbosity]
A SOM Input Vector File written by a Parser (*.tf, *.tfxidf resp. *.bin are e.g. produced by the SOMLib-Parser) is converted and the result written in a SOM Input Vector File. The vectors can be normalised to a length of 1.0 using the parameter normalize. If normalize isn't activated, the file will only be copied.
The data is taken from the file SOM-Input-Vector_in and a Vector File named SOM-Input-Vector_out is produced.
parameter-name [optional] type [value] description example
-i SOM-Input-Vector_in filename vector-file generated by the Parser tiere/tiere.tfxidf
-o SOM-Input-Vector_out optional filename generated converted vector-file tiere/tiere.in
-n normalize Boolean (true, false) [t/f] normalise vectors to a length of 1.0 f
-v verbosity optional integer [0-3] verbosity 1
java somlib.som.Training -t SOM-Input-Vector_in [-d SOM-Input-Vector-Description_in] -i iterations -n init_neighborhood [-g neighborhood_name] -l init_learnrate [-h learnrate_name] -x XDim -y YDim -f SOM-Map-Description_out -c SOM-Weight-Vector_out -r SOM-Node-Description_out [-s seed for map-init] [-u SOM-Training-File_out] [-v verbosity]
This is the most important part of the SOM-package - and definitely the most time-wasting. Using this application a SOMap can be trained with the Training Vectors stored in the Input Vector File SOM-Input-Vector_in with the Input Vector Descriptions SOM-Input-Vector-Description_in.
The Neighborhood-Functions used for the Training are defined in a Class-File somlib.som.functions. 'neighborhood_name'. By default the function somlib.som.functions.ENeighbor is used. It is initialized with the value init_neighborhood.
As a Learnrate the Function somlib.som.functions. 'learnrate_name' is taken, by default somlib.som.functions.ELearn, and it is initialized with init_learnrate.
As far as the map is concerned, the dimensions are set to XDim x YDim and the nodes are initialized with the random seed for map-init.
The information produced in the Training is stored to the SOM Map Description File at SOM-Map-Description_out, SOM Weigth Vectors SOM-Weight-Vector_out and SOM Node Descriptions SOM-Node-Description_out appended by the x/y-position in the card. A Logging of the Training-process can be written to the SOM-Training-File.
parameter-name [optional] type [value] description example
-t SOM-Input-Vector_in filename vector-file generated by Parser /tiere/tiere.in
-d SOM-Input-Vector-Description_in optional filename Vector-description-Files generated by the Parser tiere/descriptions
-i iterations integer [1- ] iterations of Training 5000
-n init_neighborhood real number [0.0- ] initial value for Neighborhood-function 3.5
-g neighborhood_name optional string name of Neighborhood-function ENeighbor
-l init_learnrate real number [0.0- ] initial value for Learnrate 0.8
-h learnrate_name optional string name of Learnrate-function ELearn
-x XDim integer [1- ] size of card x-axis 8
-y YDim integer [1- ] size of card y-axis 3
-f SOM-Map-Description_out filename generated SOM-Map-Description tiere/training/mapdescr.map
-c SOM-Weight-Vector_out filename generated SOM-Weight-Vectors tiere/training/weightvec.wgt
-r SOM-Node-Description_out filename generated SOM-Node-Descriptions tiere/training/nodedescr.node
-u SOM-Training-File_out optional filename generated SOM-Training-File tiere/training/tiere.tng
-s seed integer initial value for random number generator 20
-v verbosity optional integer [0-3] verbosity 1
java somlib.som.Mapping -m SOM-Map-Description_in -i SOM-Input-Vector_in [-d SOM-Input-Vector-Description_in] [-f SOM-Map-Description_out] [-c SOM-Weight-Vector_out] [-r SOM-Node-Description_out] [-p SOM-Quantisation-Error-Map_out] [-n SOM-Mapping-File_out] [-v verbosity]
In a Mapping, a previously trained map described at SOM-Map-Description_in is read. Input Vector Files to be found at SOM-Input-Vector_in - and described at SOM-Input-Vector-Description_in - are mapped onto the very map. The new SOM Map Description File is written to SOM-Map-Description_out, the SOM Weigth Vectors to SOM-Weight-Vector_out, SOM Node Descriptions to SOM-Node-Description_out and a SOM Mapping File to SOM-Mapping-File_out. If one of the Filenames is not issued the very File will overwrite its old version, resp. the SOM Weightvector File will not be written in this case, as it remains unchanged in this step.
parameter-name [optional] type [value] description example
-f SOM-Map-Description_in filename in training generated SOM-Map-Description tiere/training/mapdescr.map
-i SOM-Input-Vector_in filename vector-file generated by the Parser /tiere/tiere.in
-d SOM-Input-Vector-Description_in optional filename Vector-description-Files generated by the Parser tiere/descriptions
-f SOM-Map-Description_out optional filename generated SOM-Map-Description tiere/mapping/mapdescr.map
-c SOM-Weight-Vector_out optional filename generated SOM-Weight-Vectors tiere/mapping/weightvec.wgt
-r SOM-Node-Description_out optional filename generated SOM-Node-Descriptions tiere/mapping/nodedescr.node
-p SOM-Quantisation-Error-Map_out optional filename generated SOM-Quantisation-Error-Map tiere/mapping/quanterrmap.err
-n SOM-Mapping-File_out optional filename generated SOM-Mapping-File tiere/mapping/tiere.mpn
-v verbosity optional integer [0-3] verbosity 1
Labelling a map means to pick the most important dimensions of a node, as the objects mapped on this node are described by these characteristics. This can be achieved by different means:
java somlib.som.labelling.NWords -i SOM-Map-Description_in -w SOM-Template-Vector_in [-o SOM-Node-Description_out] [-l SOM-Label-File_out] -n number_of_labels [-v verbosity]
NWords is a very simple Labelling. It just picks the n (number_of_labels) dimensions with the biggest values. The map to be labelled is to be found at SOM-Map-Description_in and the labels are stored in SOM-Template-Vector_in.
Optionally a new location for the altered Node Description Files can be defined in SOM-Node-Description_out and also a SOM Label File can be written to SOM-Label-File_out.
parameter-name [optional] type [value] description example
-i SOM-Map-Description_in filename in mapping generated SOM-Map-Description tiere/mapping/mapdescr.map
-w SOM-Template-Vector-Description_in filename input SOM-Template-Vector-Description tiere/mapping/tiere.desc
-o SOM-Node-Description_out optional filename generated SOM-Node-Descriptions tiere/mapping/nodedescr.node
-l SOM-Label-File_out optional filename generated SOM-Label-File tiere/mapping/tiere.lbl
-n number_of_labels integer [1- ] number of labels 4
-v verbosity optional integer [0-3] verbosity 1
java somlib.som.labelling.LabelSOM -i SOM-Map-Description_in -w SOM-Template-Vector_in [-o SOM-Node-Description_out] [-l SOM-Label-File_out] -m minimum weight of label -n maximum number of labels [-v verbosity]
This module labels the nodes according to the LabelSOM-Algorithm. Herefore a maximum of labels (maximum number of labels) with a weight not smaller than minimum weight of label and a Quantisation Error as small as possible are taken.
The map to be labelled is to be found at SOM-Map-Description_in and the labels are stored in SOM-Template-Vector_in.
Optionally a new location for the altered Node Description Files can be defined in SOM-Node-Description_out and also a SOM Label File can be written to SOM-Label-File_out.
geschrieben werden.
parameter-name [optional] type [value] description example
-i SOM-Map-Description_in filename in mapping generated SOM-Map-Description tiere/mapping/mapdescr.map
-w SOM-Template-Vector-Description_in filename input SOM-Template-Vector-Description tiere/mapping/tiere.desc
-o SOM-Node-Description_out optional filename generated SOM-Node-Descriptions tiere/mapping/nodedescr.node
-l SOM-Label-File_out optional filename generated SOM-Label-File tiere/mapping/tiere.lbl
-m minimum weight real number [0.0- ] minimal weight 0.85
-n maximum number of labels integer [1- ] maximal number of labels 5
-v verbosity optional integer [0-3] verbosity 1
writing a HTML-Description
java somlib.som.postprocess.HTMLDescr -i SOM-Map-Description_in -o html_out [-r html-relative Path] [-e cut # extensions [0-2]] [-p show General Parameters [t/f] ] [-q show Quantisation Error of SOM [t/f] ] [-a show Average Quantisation Error of SOM [t/f] ] [-t show ID of Unit [t/f] ] [-u show Quantisation Error of Unit [t/f] ] [-b show Average Quantisation Error of Unit [t/f] ] [-l show Labels [t/f] ] [-m show Label Quantisation Error [t/f] ] [-w show Label Weight [t/f] ] [-s show Mapped Vectors [t/f] ] [-d show Vectors Distance [t/f] ] [-v verbosity]
Hereby a HTML Description of a given SOM-Map-Description is produced, to facilitate the interpretation of a SOM's topology. The HTML-file is written to html_out and the html-relative Path signifies the path to the files mapped relative to the html_out-File. Optionally one to two extensions can be cut of the files mapped onto the card. This is handled by the parameter -e; if it isn't issued, the name remains unchanged.
The other parameters control what will be included in the description. As default each parameter is "false".
parameter-name [optional] type [value] description example
-i SOM-Map-Description_in filename in mapping generated SOM-Map-Description tiere/mapping/mapdescr.map
-o html_out filename generated HTML-file tiere/tiere.html
-r html-relative Path optional directory relative path to mapped objects tiere/mails
-e cut # extensions optional integer [0-2] that many extensions of the mapped objects are deleted 2
-p show General Parameters optional Boolean (true, false) [t/f] include general parameters in description t
-q show Quantisation Error of SOM optional Boolean (true, false) [t/f] include total quantisation error in description t
-a show Average Quantisation Error of SOM optional Boolean (true, false) [t/f] include average quantisation error in description t
-t show ID of Unit optional Boolean (true, false) [t/f] include ID of a Unit in description t
-u show Quantisation Error of Unit optional Boolean (true, false) [t/f] include quantisation error of a Unit in description f
-b show Average Quantisation Error of Unit optional Boolean (true, false) [t/f] include average quantisation error of a Unit in description f
-l show Labels optional Boolean (true, false) [t/f] include labels in description t
-m show Label Quantisation Error optional Boolean (true, false) [t/f] include quantisation error of a label in description f
-w show Label Weight optional Boolean (true, false) [t/f] include the weight of a label in description f
-s show Mapped Vectors optional Boolean (true, false) [t/f] include mapped vectors in description t
-d show Vectors Distance optional Boolean (true, false) [t/f] include distance of mapped vectors in description f
-v verbosity optional integer [0-3] verbosity 1
creating a description for the LibViewer
java somlib.som.postprocess.LibViewer -i SOM-Map-Description_in -o LibViewer_out [-l imageLocation] [-s SOM-Vector-Description-Suffix] [-e cut # extensions [0-2]] [-v verbosity]
This class takes an existing card and produces a LibViewer-file. By the LibViewer the vectors mapped on the card can be viewed as digital books. A SOM Map Description File (SOM-Map-Description) is needed to produce the description for the LibViewer. Out of the map description the SOM Node Descriptions and the directories containing the Vector Description Files (of the vectors mapped onto the card) are extracted, which are needed to produce the LibViewer-description also. If the Vector Description Files have a different Suffix than ".vec" - the standard extension - this has to be specified by the parameter SOM-Vector-Description-Suffix. Optionally one to two extensions can be cut of the names of the Vector Description Files by adjusting the parameter -e.
A logo for the sender's domain can be shown on the digital book's cover. For this reason a picture file in gif-format is taken, called as the domain (eg edu.gif, at.gif, com.gif, ...). The directory searched for the file is imageLocation. In case the parameter is not defined, the standard directory-name images is used.
parameter-name [optional] type [value] description example
-i SOM-Map-Description_in filename in mapping generated SOM-Map-Description tiere/mapping/mapdescr.map
-o LibViewer_out filename generated LibViewer-file tiere/tiere.lib
-l imageLocation optional filename path to pictures images
-s SOM-Vector-Description-Suffix optional string suffix of SOM-Vector-Description-files desc
-e cut # extensions optional integer [0-2] that many extensions of mapped objects are deleted 2
-v verbosity optional integer [0-3] verbosity 1
an example-log:
java somlib.som.preprocess.Vec2Vec -i tiere/tiere.in -o tiere/tiere.in -n f
java somlib.som.Training -t tiere/tiere.in -i 5000 -n 3.5 -l 0.8 -x 8 -y 3 -f tiere/Training/mapdescr.map -c tiere/Training/weightvec.wgt -r tiere/Training/nodedescr.node -s 20
java somlib.som.Mapping -m tiere/Training/mapdescr.map -i tiere/tiere.in -d tiere/descriptions -f tiere/Mapping/mapdescrmap.map -c tiere/Mapping/weightvec.wgt -p tiere/Mapping/quanterrmap.err -r tiere/Mapping/nodedescr.node -n tiere/Mapping/mapping.mpn
java somlib.som.labelling.NWords -i tiere/Mapping/mapdescr.map -w tiere/tiere.desc -l tiere/Mapping/label.lbl -n 4
java somlib.som.labelling.LabelSOM -i tiere/Mapping/mapdescr.map -l tiere/Mapping/label.lbl -w tiere/tiere.desc -m 0.85 -n 5
java somlib.som.postprocess.HTMLDescr -i tiere/Mapping/mapdescr.map -o tiere/HTMLdescription.html -p t -q t -a t -t t -u t -b t -l t -m t -w t -s t -d t
java somlib.som.postprocess.LibViewer -i tiere/Mapping/mapdescr.map -o tiere/map.lib