This step-by-step guide describes how you can label map created by the SOMLib Digital Library System with key phrases rather than simple keywords. It uses the KEA keyphrase extraction tool, rather than the labelSOM method, to extract phrases from the document clusters. Links to the downloadable software is provided, as well as short descriptions which options are available. The rest of the document is organized as follows:
http://java.sun.com/products/jdk/1.2/
KEA 2.0 You find the software package plus a detailed description athttp://www.nzdl.org/Kea/index.html
You should install the modules in/usr/local/KEA-2.0/KEA-2.0/
Please include the path in the global variable $CLASSPATH. Otherwise you should include -cp KEAINSTALLPATH in each KEA call (KEA is fully written in java).[root@Kenny time60]# ls -l
total 7752
drwxr-xr-x 2 root root 32768 Aug 22 2001 files_cleaned
lrwxrwxrwx 1 root root 16 Oct 23 2001 files -> /www/time/files/
drwxr-xr-x 3 root root 12288 Jan 7 2002 output
drwxrwxrwx 1 root root 17 Oct 23 2001 parser
drwxr-xr-x 2 root root 4096 Aug 22 2001 props
drwxrwxrwx 1 root root 18 Oct 23 2001 vectors
[root@Kenny /]# cd /usr/local/KEA-2.0/KEA-2.0/
[root@Kenny KEA-2.0]# java KEAModelBuilder -l CSTR_abstracts_train -m defaultmodel
Building model with options: -l CSTR_abstracts_train -m defaultmodel -e default -x 3 -y 1 -o 2
[root@Kenny KEA-2.0]#
[root@Kenny time60]# mkdir keyNOTE: Depending on how you original files are named you need files with the following format for the keyphrase extraction and later processing:
[root@Kenny time60]# cp files/* key -R
[root@Kenny time60]# cd key/
[root@Kenny files.txt]# for file in *; do
> mv $file $file.txt -f
> done
[root@Kenny key]#
[root@Kenny time60]# cd vectors/In this case we have now clean text files ready for keyphrase extraction, we just have to start the KEAKeyphraseExtractor with our previously build KEA model "defaultmodel". Don't forget the -a option otherwise the further processing will not work, more detailed information plus the explanation for all options is given in the KEA documentation. (local copy)
[root@Kenny vectors]# generate_stopwords time.removed.txt
Programme: generate_stopwords 1.0
extracting the stopwords from a giving file which has been generated through
somlib_parser_script
...pressuming KEA is installed on /usr/local/KEA-2.0/KEA-2.0/
Starting ....
greping stopwords from time.removed.txt ...
generating code
Compiling Code
Done ... you will find the defined stopwords in the file
generate_stopwords.removed in this directory
On your next run of the KEA_ModelBuilder or KEA_KeyphraseExtractor these words
will be assumed as stopwords
[root@Kenny vectors]# cd ..
[root@Kenny time60]# java KEAKeyphraseExtractor -m /usr/local/KEA-2.0/KEA-2.0/defaultmodel -l key -a -n 15This will create a ".key" file for each document in the directory. Each file will contain fifteen (option -n) extracted keyphrases for the corresponding document. If not needed anymore we can remove the ".txt" files from the directory to save disc space.
Extracting keyphrases with options: -l key -m /usr/local/KEA-2.0/KEA-2.0/defaultmodel -e default -n 15 -a
Avg. number of correct keyphrases: 0 +/- 0
Based on 0 documents
[root@Kenny time60]#
NUMBEROFPHRASES 15This option effects both programmes and defines how many labels for each cluster is calculated.
OUTPUTPATH output/This is the relativ path where parse_unitfile and unit2html should find the ".unit" files and where the whole output is written
KEYPATH output/key/This is the path relative to the current directory where the ".key" files are supposed to be.
KEAFILEEXTENSION .keyKeep this option unchanged as you are not changing the suffix of the ".key"-files.
SOURCEPATH files/The path relative to the current directory where the original sourcefiles can be found
TMPKEAPATH output/tmp_kea/A temporary directory relative to the current directory.
KEYFILESLINKPATH key/The directory paht to the ".key"-files relative to the outputdirectory.
SOURCEFILESLINKPATH files/The directory paht to the original sourcefiles relative to the outputdirectory.
KEYLINKS_OUTPUT 1This options just affects the HTML-Output only. It determines weather the hyperlinks to the ".key"-files are generated or not (0 = TRUE everthing else=FALSE)
LABELSOMI_OUTPUT 0This options just affects the HTML-Output only. If set TRUE the Labels of the LABELSOM I are available they are written.
NOTE: Please let all pathes end with an slash /.
Before we can start we have to set up some directories for the HTML-Output[root@Kenny time60]# cd output
[root@Kenny output]# ln -s ../key key
[root@Kenny output]# ln -s ../files files
[root@Kenny output]# mkdir tmp_kea
parse_unitfile Usage: parse_unitfile processfile config_fileThe programme has two input parameter, the first is a process file where all the file to be parsed are listet. The second paramter is the configuration file. So we have a trained GHSOM in our output directory named time60 with 38 maps (unitfiles).
[root@Kenny output]# ls *.unitDue to the hierarical structure of a GHSOM we need to process the files in the right order from the bottom (lowest layer) to the top. We have written a script which generates a proper process file. The usage of the script is preprocess <name_of_the_map> and the output is written to the file process.parse_unitfile.
time60_10_2_2_2.unit time60_2_2_0_0.unit time60_33_4_1_1.unit
time60_1_1_0_0.unit time60_22_3_0_1.unit time60_34_4_2_1.unit
time60_11_2_0_3.unit time60_23_3_1_1.unit time60_35_4_1_2.unit
time60_12_2_1_3.unit time60_24_3_2_1.unit time60_36_4_2_2.unit
time60_13_2_2_3.unit time60_25_3_0_2.unit time60_37_5_1_2.unit
time60_14_3_1_0.unit time60_26_3_2_2.unit time60_38_5_3_1.unit
time60_15_3_1_0.unit time60_27_3_1_3.unit time60_4_2_2_0.unit
time60_16_3_0_1.unit time60_28_3_2_3.unit time60_5_2_0_1.unit
time60_17_3_1_1.unit time60_29_3_2_0.unit time60_6_2_1_1.unit
time60_18_3_0_3.unit time60_30_3_0_2.unit time60_7_2_2_1.unit
time60_19_3_0_0.unit time60_31_4_1_0.unit time60_8_2_0_2.unit
time60_20_3_1_0.unit time60_3_2_1_0.unit time60_9_2_1_2.unit
time60_21_3_2_0.unit time60_32_4_2_0.unit
[root@Kenny output]# generate_processfile time60So we can start the parsing process.
...written process.parse_unitfile !
[root@Kenny output]# more process.parse_unitfile
time60_38_5_3_1.unit
time60_37_5_1_2.unit
time60_36_4_2_2.unit
time60_35_4_1_2.unit
time60_34_4_2_1.unit
time60_33_4_1_1.unit
....
time60_6_2_1_1.unit
time60_5_2_0_1.unit
time60_4_2_2_0.unit
time60_3_2_1_0.unit
time60_2_2_0_0.unit
time60_1_1_0_0.unit
[root@Kenny time60]# parse_unitfile output/process.parse_unitfile labels.config
Processing output/time60_38_5_3_1.unit and writing output/time60_38_5_3_1.unit.labelunit ...
Processing output/time60_37_5_1_2.unit and writing output/time60_37_5_1_2.unit.labelunit ...
Processing output/time60_36_4_2_2.unit and writing output/time60_36_4_2_2.unit.labelunit ...
...
Processing output/time60_2_2_0_0.unit and writing output/time60_2_2_0_0.unit.labelunit ...
Processing output/time60_1_1_0_0.unit and writing output/time60_1_1_0_0.unit.labelunit ...
[root@Kenny time60]#
parse_unitfileis parsing each file and writes a new one with an additional suffic ".labelunit" , so the old files are preserved. Unfortunately the ".labelunit" as the ".unit"-files are not so comfortable to read and interpret so we wrote a second application which genrates a HTML-output easy to read and browse.
unit2html Usage: unit2html processfile config_fileAgain there are two arguements, a process file and the configuration file. We using the same configuration file as abvove. To generate aprocess file we can do the following:
[root@Kenny time60]# cd outputNow we can create the HTML-page with
[root@Kenny output]# ls *.labelunit > process.unit2html
[root@Kenny output]# more process.unit2html
time60_10_2_2_2.unit.labelunit
time60_1_1_0_0.unit.labelunit
... and so on !
[root@Kenny time60]# unit2html output/process.unit2html labels.configThe top layer of the map we can browser by the following command:
Processing output/time60_10_2_2_2.unit.labelunit ....and writing output/time60_10_2_2_2.unit.labelunit.html !
Processing output/time60_1_1_0_0.unit.labelunit ....and writing output/time60_1_1_0_0.unit.labelunit.html !
....
Processing output/time60_8_2_0_2.unit.labelunit ....and writing output/time60_8_2_0_2.unit.labelunit.html !
Processing output/time60_9_2_1_2.unit.labelunit ....and writing output/time60_9_2_1_2.unit.labelunit.html !
[root@Kenny time60]#
[root@Kenny time60]# netscape output/time60_1_1_0_0.unit.labelunit.html &