Department of
This step-by-step guide describes how you can label map created by the SOMLib Digital Library System with key phrases rather than simple keywords. It uses the KEA keyphrase extraction tool, rather than the labelSOM method, to extract phrases from the document clusters. Links to the downloadable software is provided, as well as short descriptions which options are available. The rest of the document is organized as follows:
This
sections will describe where you get the various modules and how to install
them.
You can
install it e.g. in /usr/local/labelsomII/just make sure that the
variable $PATH (e.g. bashrc) is set right.
For some
KEA modules you need the java development kit, you will get it
http://java.sun.com/products/jdk/1.2/
KEA
2.0You find
the software package plus a detailed description at
http://www.nzdl.org/Kea/index.html
You should
install the modules in
/usr/local/KEA-2.0/KEA-2.0/
Please
include the path in the global variable $CLASSPATH. Otherwise you should
include -cp KEAINSTALLPATH in each KEA call (KEA is fully written in
java).
You should
have a trained GHSOM as described in the SOMLib step-by-step guide For our example we are using the TIME60 collection, so after the trainingsprocess you
should end up with a directory structure like this:
[user@Kenny time60]# ls -l
total 7752
drwxr-xr-x 2 user user 32768 Aug 22 2001 files_cleaned
lrwxrwxrwx 1 user user 16 Oct 23 2001 files -> /www/time/files/
drwxr-xr-x 3 user user 12288 Jan 7 2002 output
drwxrwxrwx 1 user user 17 Oct 23 2001 parser
drwxr-xr-x 2 user user 4096 Aug 22 2001 props
drwxrwxrwx 1 user user 18 Oct 23 2001 vectors
To extract
keyphrases for new documents, you first need to build a KEA keyphrase
extraction model from a set of documents (preferably from the same domain) or
which you have author- assigned keyphrases. To this end you have to go through
the following steps:
For our
example we can build our KEA model with the training documents coming along
with the KEA package.So we change into the KEA directory and build the model
defaultmodel with the default values
[user@Kenny /]# cd /usr/local/KEA-2.0/KEA-2.0/
[user@Kenny KEA-2.0]# java KEAModelBuilder -l CSTR_abstracts_train -m defaultmodel
Building model with options: -l CSTR_abstracts_train -m defaultmodel -e default -x 3 -y 1 -o 2
[user@Kenny KEA-2.0]#
To extract
keyphrases for our collections, we have to put the documents in a directory,
for example in cleaned files. The files have to end with the suffix .TXT. So we
create in our working directory TIME60 a subdirectory , copy the collection
into this directory an rename the files.
[user@Kenny time60]# mkdir key
[user@Kenny time60]# cp files/* key -R
[user@Kenny time60]# cd key/
[user@Kenny files.txt]# for file in *; do
> mv $file $file.txt -f
> done
[user@Kenny key]#
NOTE: Depending on how you original files
are named you need files with the following format for the keyphrase extraction
and later processing:
original_filename.TXT
NOTE II: You will get better results in extracting keyphrases
when you are using pure text instead of HTML-files or similiar stuff. So you
should clean them up first. Before we can extract the keyphrases from our
documents we have to fix one shortcoming of KEA. KEA is working with a fixed
(hardcoded) english stopwords list - which will give you bad results when you
are working with a collection in a different language. So we need to recompile
the specific module with the collection-specific stopwords found in the parsing
process. (Note that you should adapt the script if you haven't set the
CLASSPATH variable and/or KEA is installed in a different directory. You will
find the script generate_stopwords in the directory where you have installed
all the LABELSOM_II binaries)
[user@Kenny time60]# cd vectors/
[user@Kenny vectors]# generate_stopwords time.removed.txt
Programme: generate_stopwords 1.0
extracting the stopwords from a giving file which has been generated through
somlib_parser_script
...pressuming KEA is installed on /usr/local/KEA-2.0/KEA-2.0/
Starting ....
greping stopwords from time.removed.txt ...
generating code
Compiling Code
Done ... you will find the defined stopwords in the file
generate_stopwords.removed in this directory
On your next run of the KEA_ModelBuilder or KEA_KeyphraseExtractor these words
will be assumed as stopwords
[user@Kenny vectors]# cd ..
In this case
we have now clean text files ready for keyphrase extraction, we just have to
start the KEAKeyphraseExtractor with our previously build KEA model
"defaultmodel".Don't forget the -a option otherwise the
further processing will not work, more detailed information plus the
explanation for all options is given in the KEA documentation. (local
copy)
[user@Kenny time60]# java KEAKeyphraseExtractor -m /usr/local/KEA-2.0/KEA-2.0/defaultmodel -l key -a -n 15
Extracting keyphrases with options: -l key -m /usr/local/KEA-2.0/KEA-2.0/defaultmodel
-e default -n 15 -a
Avg. number of correct keyphrases: 0 +/- 0
Based on 0 documents
[user@Kenny time60]#
This will
create a ".key" file for each document in the directory. Each file
will contain fifteen (option -n) extracted keyphrases for the corresponding document. If not needed
anymore we can remove the ".txt" files from the directory to save
disc space.
The labelling process is diveded in two parts,
first we are parsing the unit files generated by the GHSOM , calculating the
labels and writing a new unit file and in a second step from the new unit file
a HTML Output is generated. These two programmes are parse_unitfile and
unit2html. Both are using the same configuration file which we will explain
first.Configuration file (download here).
· NUMBEROFPHRASES 15
This option effects both programmes and defines how many labels for each cluster is calculated.
· OUTPUTPATH output/
This is the relativ path where parse_unitfile and unit2html should find the ".unit" files and where the whole output is written
· KEYPATH output/key/
This is the path relative to the current directory where the ".key" files are supposed to be.
· KEAFILEEXTENSION .key
Keep this option unchanged as you are not changing the suffix of the ".key"-files.
· SOURCEPATH files/
The path relative to the current directory where the original sourcefiles can be found
· TMPKEAPATH output/tmp_kea/
A temporary directory relative to the current directory.
· KEYFILESLINKPATH key/
The directory paht to the ".key"-files relative to the outputdirectory.
· SOURCEFILESLINKPATH files/
The directory paht to the original sourcefiles relative to the outputdirectory.
· KEYLINKS_OUTPUT 1
This option affects the HTML-Output only. It determines weather the hyperlinks to the ".key"-files are generated or not (0 = FALSE everthing else=TRUE, default = true)
· LABELSOMI_OUTPUT 0
This option affects the HTML-Output only. If set TRUE the Labels of the LABELSOM I are available they are written (default = true)
· FILELINKS_OUTPUT 1
This option affects the HTML-Output only. If set to TRUE the links to mapped files are generated (default = true)
NOTE: Please let all pathes end with an
slash /.
Before we
can start we have to set up some directories for the HTML-Output
[user@Kenny time60]# cd output
[user@Kenny output]# ln -s ../key key
[user@Kenny output]# ln -s ../files files
[user@Kenny output]# mkdir tmp_kea
parse_unitfile
Usage: parse_unitfile processfile config_file
The
programme has two input parameter, the first is a process file where all the
file to be parsed are listet. The second paramter is the configuration file. So
we have a trained GHSOM in our output directory named time60 with 38 maps
(unitfiles).
[user@Kenny output]# ls *.unit
time60_10_2_2_2.unit time60_2_2_0_0.unit time60_33_4_1_1.unit
time60_1_1_0_0.unit time60_22_3_0_1.unit time60_34_4_2_1.unit
time60_11_2_0_3.unit time60_23_3_1_1.unit time60_35_4_1_2.unit
time60_12_2_1_3.unit time60_24_3_2_1.unit time60_36_4_2_2.unit
time60_13_2_2_3.unit time60_25_3_0_2.unit time60_37_5_1_2.unit
time60_14_3_1_0.unit time60_26_3_2_2.unit time60_38_5_3_1.unit
time60_15_3_1_0.unit time60_27_3_1_3.unit time60_4_2_2_0.unit
time60_16_3_0_1.unit time60_28_3_2_3.unit time60_5_2_0_1.unit
time60_17_3_1_1.unit time60_29_3_2_0.unit time60_6_2_1_1.unit
time60_18_3_0_3.unit time60_30_3_0_2.unit time60_7_2_2_1.unit
time60_19_3_0_0.unit time60_31_4_1_0.unit time60_8_2_0_2.unit
time60_20_3_1_0.unit time60_3_2_1_0.unit time60_9_2_1_2.unit
time60_21_3_2_0.unit time60_32_4_2_0.unit
Due to the
hierarical structure of a GHSOM we need to process the files in the right order
from the bottom (lowest layer) to the top. We have written a script which
generates a proper process file. The usage of the script is preprocess
<name_of_the_map> and the output is written to the file
process.parse_unitfile.
[user@Kenny output]# generate_processfile time60
...written process.parse_unitfile !
[user@Kenny output]# more process.parse_unitfile
time60_38_5_3_1.unit
time60_37_5_1_2.unit
time60_36_4_2_2.unit
time60_35_4_1_2.unit
time60_34_4_2_1.unit
time60_33_4_1_1.unit
....
time60_6_2_1_1.unit
time60_5_2_0_1.unit
time60_4_2_2_0.unit
time60_3_2_1_0.unit
time60_2_2_0_0.unit
time60_1_1_0_0.unit
So we can
start the parsing process.
[user@Kenny time60]# parse_unitfile
output/process.parse_unitfile labels.config
Processing output/time60_38_5_3_1.unit and writing output/time60_38_5_3_1.unit.labelunit
...
Processing output/time60_37_5_1_2.unit and writing output/time60_37_5_1_2.unit.labelunit
...
Processing output/time60_36_4_2_2.unit and writing output/time60_36_4_2_2.unit.labelunit
...
...
Processing output/time60_2_2_0_0.unit and writing output/time60_2_2_0_0.unit.labelunit
...
Processing output/time60_1_1_0_0.unit and writing output/time60_1_1_0_0.unit.labelunit
...
[user@Kenny time60]#
The
programme parse_unitfile is parsing each file and writes a new one with an
additional suffic ".labelunit" , so the old files are preserved. Unfortunately
the ".labelunit" as the ".unit"-files are not so comfortable
to read and interpret so we wrote a second application which generates a
HTML-output easy to read and browse.
unit2html
Usage: unit2html processfile config_file
Again there
are two arguements, a process file and the configuration file. We using the
same configuration file as abvove. To generate a process file we can do the
following:
[user@Kenny time60]# cd output
[user@Kenny output]# ls *.labelunit > process.unit2html
[user@Kenny output]# more process.unit2html
time60_10_2_2_2.unit.labelunit
time60_1_1_0_0.unit.labelunit
... and so on !
Now we can
create the HTML-page with
[user@Kenny time60]# unit2html output/process.unit2html labels.config
Processing output/time60_10_2_2_2.unit.labelunit ....and writing output/time60_10_2_2_2.unit.labelunit.html
!
Processing output/time60_1_1_0_0.unit.labelunit ....and writing output/time60_1_1_0_0.unit.labelunit.html
!
....
Processing output/time60_8_2_0_2.unit.labelunit ....and writing output/time60_8_2_0_2.unit.labelunit.html
!
Processing output/time60_9_2_1_2.unit.labelunit ....and writing output/time60_9_2_1_2.unit.labelunit.html
!
[user@Kenny time60]#
The top
layer of the map we can browser by the following command:
[user@Kenny time60]# netscape output/time60_1_1_0_0.unit.labelunit.html &
Up
Comments:michael.majdic@gmx.at,
rauber@ifs.tuwien.ac.at