TU Logo   IFS Logo Vienna University of Technology
Institute of Software Technology and Interactive Systems
Data Mining with the Java SOMToolbox
[DataMining Home] [People] [Publications] [SOMToolbox]

Benchmark Data Sets, Trained Maps, Sample Files

Iris

The Iris data set is one of the best known databases in the machine learning domain, and was first published in [1].
The dataset contains of 3 classes of iris plants, each with 50 instances. The plants are described by four attributes, sepal length and width, and petal length and width. One class (setoas) is linearly separable from the other two (virginica and versicolor), while the latter are not linearly separable from each other.

[1] Fisher, R. A. "The use of multiple measurements in taxonomic problems". In Annual Eugenics, 7, Part II, 179-188 (1936)

[2] Iris dataset in the UCI Machine Learning Repository

Boston Housing

This dataset describes housing values in the suburbs of Boston [1]. The dataset contains 506 instances, which are described by 13 attributes. The dataset can be categorised into 92 classes, each representing a district or suburb of Boston.

[1] Harrison, D. and Rubinfeld, D.L. "Hedonic prices and the demand for clean air". In Journal on Environ. Economics & Management, vol.5, 81-102, 1978.

[2] Boston Housing dataset in the UCI Machine Learning Repository

Animals (16 animals)

This data set has been used for several experiments with SOMs. It comprises 16 records of different kinds of animals, described by 13 binary-valued attributes. The animals can be categorised into three classes: birds, carnivores, and herbivores.

[1] Helge Ritter and Teuvo Kohonen: "Self-organizing semantic maps". In Biological Cybernetics, 61(4):241-254. Springer, 1989

Zoo (101 animals)

This data set is similar to the Animals data set above, but contains a larger number of 101 animals. They are described by 20 boolean-valued attributes, and can be categorised into seven different classes.

Spambase

Artificial data sets

Projection of the chainlink data set

The chain link data set, sometimes also called intertwined rings, is a classic example of a data set that provokes topology preservation vialitions. The data set contains two rings, each two-dimensional, that are intertwined in a three-dimensional space. When projecting this data set to a two-dimensional output space, the rings have to "break".

10 clusters

Projection of the 10-clusters data set

This artificial data set consists of data points arranged in 10 distinctive clusters. The clusters were generated from gaussian distributions, with different densities (standard deviations). The data set contains ten dimensions.

Sample properties files

Sample template vector files

Below is a range of sample template vector files that can be used together when analysing music described by the Rhythm Patterns suite of audio features. For more details, please visit the Audio Feature Extraction website of our Music Information Retrieval Group, and check the how-to for feature extraction.