Department of Software Technology
Vienna University of Technology
The SOMLib Digital Library - Experiments - Der Standard
Overview
For the experiments described below we use a collection of articles from the daily Austrian Newspaper Der Standard . (Special thanks for providing us with the article collection and allowing us to use them for our experiments).
On this page:
- Data
- Textrepresentation
- Trained Growing hierarchical Self-Organizing Maps with Labels
Data
The complete text collection consists of almost 50.000 documents. The following experiments use a subset of all articles,
covering the second quarter of 1999, i.e. the months April, May and June 1999. This subset consists of 11,627 articles. All HTML
tags were removed from the articles to obtain text-only representations suitable for content analysis by the SOMLib system.
Furthermore, experiments using the full dataset are presented.
Text Representation
To be used for map training, a vector-space representation of the single documents is created by full-text indexing.
For each document collection a list of all words appearing in the respective collection is extracted while applying some basic word stemming techniques.
Words that do not contribute to contents description are removed from these lists.
Instead of defining language or content specific stop word lists, we rather discard terms that appear in more than 813 (7%) or in less than 65 articles (0.56%).
We thus end up with a vector dimensionality of 3.799 unique terms.
The individual documents are then represented by feature vectors using a tf x idf, i.e. term frequency times inverse document frequency, weighting scheme as described by Salton.
This weighting scheme assigns high values to terms that are important as to describe and discriminate between the documents.
The 11,627 articles thus are represented by automatically extracted 3,799-dimensional feature vectors of word histograms weighted by a tf x idf weighting scheme and normalized to unit length.
The listing below provides the template vector, i.e. the list of words used for document representation, the list of removed
"stop-words" as well as the feature vectors used for training the maps of the subset of the article collection.
- Template Vector: List of the 3.799 words, i.e. index terms used for representing the content of the documents
plain text (162 KB),
gnu-zipped (65 KB)
- Removed Words: List of words removed because they appear either too frequently, i.e. in more than 813 (7%) or in less
than 65 articles (0.56%), indicated by a High and Low tag in the file.
gnu-zipped (1.3 MB)
- Input Vectors: List of vectors used for training the SOMs, i.e. a list of 11,627 vectors of dimensionality 3,799,
weighted by the tf x idf weighting scheme.
gnu-zipped (3.3 MB)
Furthermore, for more extensive experiments, the complete collection of 47.684 articles was used.
Words appearing in less than 0.9%, i.e. in less than 429 articles, or in more than 8% of all articles (i.e. in more than
3.814 articles) were removed from the featurelist, resulting in a 2.487-dimensional feature vector (down from an initial 438.313
words).
- Template Vector: List of the 2.487 words, i.e. index terms used for representing the content of the
47.684 documents
std_1999_5.tv.gz (gnu-zipped, 52 KB)
- Removed Words: List of words removed because they appear either too frequently, i.e. in more than 3.814 (8%) or in
less than 429 articles (0.9%), indicated by a High and Low tag in the file.
std_1999_5.removed.txt.gz (gnu-zipped, 3.1 MB)
- Input Vectors: List of vectors used for training the SOMs, i.e. a list of 47.684 vectors of dimensionality 2.487,
weighted by the tf x idf weighting scheme.
std_1999_5.tfxidf.gz (11.8 MB)
Document Clustering:
Trained Growing Hierarchical Self-Organizing Maps with Labels
With documents collections of this size, providing single, flat SOMs does not offer a convenient interface anymore as the resulting maps would become too large.
We thus use our new Growing Hierarchical Self-Organizing Map (GHSOM) model to create a hierarchical representation of the document archive.
Based on the document description as outlined above, we trained several growing hierarchical self-organizing maps to represent the contents of the document archive. Using the labelSOM method, characteristic keywords were automatically extracted from the trained maps, describing the various topical clusters.
GHSOMs with a subset of the article collection: 2. Quarter 1999
- GHSOM 1
: A rather deep hierarchy with very well separated topical branches. Among the main topical branches we find Sports, Culture, Radio- and TV programs, the Political Situation in the Balkan, Internal Affairs,Business.
Of this map, we also provide a version labeled with an extension to the standard labeling algorithm, incorporating phrases:
GHSOM 1 with phrase labels.
(standq2_3799_6.prop: property file for this run)
- GHSOM 2
: A more shallow hierachy, yet describing all topical clusters at the same level of detail in the lowest layer, due to an identical setting of t2. Yet, a different setting of t1 results in larger maps at each layer, representing the various topics in more detail.
Of this map, we also provide a version labeld with an extension to the standard labeling algorithm, incorporating phrases:
GHSOM 2 with phrase labels.
(standq2_3799_4.prop: property file for this run)
- GHSOM of the full article collection
The full collection of almost 50.000 articles, limited browsing support (see explanations below)
Of this map, we also provide a version labeled with an extension to the standard labeling algorithm, incorporating phrases:
GHSOM of the full article collection with phrase labels.
(std_1999_5.prop: property file for this run)
A brief discussion of the results
Training the GHSOM with parameters t1=0.035 and t2=0.0035 results in the more shallow hierarchical structure of up to 7 layers provided as GHSOM 2.
The layer 1 map grows to a size of 7 x 4 units, all of which are expanded at subsequent layers.
We find the most dominant branches to be, for example, Sports, located in the upper right corner of the map, Internal Affairs in the lower right corner, Internet-related articles on the left hand side of the map, to name but a few.
However, due to the large size of the resulting first layer map, a fine-grained representation of the data is already provided at this layer.
This results in some larger clusters to be represented by two neighboring units already at the first layer, rather than being split up in a lower layer of the hierarchy.
For example, we find the cluster on Internal Affairs to be represented by two neighboring units.
One of these, on position (6/4), covers solely articles related to the Freedom Party and its political leader Jörg Haider, representing one of the most dominant political topics in Austria for some time now, resulting in an accordingly large number of news articles covering this topic.
The neighboring unit to the right, i.e.located in the lower right corner on position (7/4), covers other Internal Affairs, with one of the main topics being the elections to the European Parliament.
The topics of articles covered by these two maps are closely related.
We also find, for example, articles related to the Freedom Party on the branch covering the more general Internal Affairs, reporting on their role and campaigns for the elections to the European Parliament.
As might be expected these are closely related to the other articles on the Freedom Party which are located in the neighboring branch.
Obviously, we would like them to be presented on the left hand side of this map, so as to allow the transition from one map to the next, with a continuous orientation of topics.
Due to the initialization of the added maps during the training process, this continuous orientation is preserved, as can easily be seen from the automatically extracted labels on the two maps.
Continuing from the second layer map of unit (6/4) to the right we reach the according second layer map of unit (7/4) where we first find articles focusing on the Freedom Party, before moving on to the Social Democrats, the People's Party, the Green Party and the Liberal Party.
As all units at layer two in these branches have a quantization error below 42.63, no unit is further expanded at a third layer.
A similar situation is encountered with the Sports cluster, being represented by 3 units in the upper right corner on the first layer map, which specialize on Tennis and International Soccer - UEFA-Cup on unit (5/1), Austrian Soccer on unit (6/1), and other sports articles, such as short result listings on unit (7/1).
Again, we find the second layer maps organized accordingly, i.e. for example we find the reports on International Soccer located on the lower right side of the second layer map, neighboring the other second layer map representing solely Austrian Soccer.
GHSOMs with the complete article collection of 1999
As a more extensive experiment, we created a GHSOM of the full article collection, comprising 47.684 articles represented by
2.487-dimensional feature vectors.
The top-layer map evolved to a size of 2 x 3 units.
Please note, how a very concise cluster of articles in the upper right corner is already sufficiently
represented at the top-layer map, resulting in this cluster not being further expanded at subsequent layers.
The articles on this cluster represent weather reports. In spite of their large number (and thus a rather unfavourably layout of the
resulting table), they do not require further detailed representation.
Next to it, in the upper left corner, we find another very concise cluster, representing articles on cultural events, listings of
theater performances ("watchlist" etc.). This cluster is represented in more detail on a second-layer map of 5 x 6 units.
again, the degree of representation at this level is sufficient, and the second-layer map in this branch is not expanded any
further.
A similar situation can be found with a cluster of articles on German
national politics originating from the the lower right corner of the top-layer map, or a cluster of
sports articles originating from the unit above it.
The majority of articles are represented in greater detail in the branch originating from the lower right corner of the top-layer
map. This second layer map expanded into 6 x 6 units, of which
most units were expanded into several further layers of
the hierarchy, whereas others did not require further expansion.
Publications
Some more detailed descriptions of experiments using this data collection can be found in some of our publications.
- M. Dittenbach, and A. Rauber, and D. Merkl:
Business, Culture, Politics, and Sports -- How to Find Your Way Through a Bulk of News? On Content-Based Hierarchical Structuring
and Organization of Large Document Archives
In: Proceedings of the 12th International Conference on Database and Expert Systems Applications (DEXA01), Sept. 3-7 2001, Munich,
Germany, Springer Lecture Notes in Computer Science, Springer, 2001.
Abstract,
HTML,
gnu-zipped Postscript (160 KB),
PDF (197 KB),
BibTeX
- M. Dittenbach, and A. Rauber, and D. Merkl:
Recent Advances with the Growing Hierarchical Self-Organizing Map
In: Allinson, N. and Yin, H. and Allinson, L. and Slack, J. (eds.) Advances in Self-Organizing Maps: Proceedings of the 3rd Workshop
on Self-Organizing Maps June 13-15 2001, Lincoln, England, Springer, 2001.
Abstract,
HTML,
gnu-zipped Postscript (160 KB),
PDF (1.9 MB),
BibTeX
- A. Rauber, M. Dittenbach and D. Merkl:
Automatically Detecting and Organizing Documents into Topic Hierarchies: A Neural Network Based Approach to Bookshelf Creation and Arrangement.
In: Proceedings of the 4rd European Conference on Research and Advanced Technologies for Digital Libraries (ECDL2000), September 18. - 20. 2000, Lisboa, Portugal.
Abstract,
HTML,
gnu-zipped Postscript (110 KB),
PDF (126 KB),
BibTeX
Up to the SOMLib Digital Library Homepage
Comments: rauber@ifs.tuwien.ac.at