at.tuwien.ifs.somtoolbox.visualization.clustering
Class KMeans

java.lang.Object
  extended by at.tuwien.ifs.somtoolbox.visualization.clustering.KMeans
Direct Known Subclasses:
UnitKMeans

public class KMeans
extends Object

Pretty much the classic K-Means clustering. Tried to keep it simple, though.

Version:
$Id: KMeans.java 3921 2010-11-05 12:54:53Z mayer $
Author:
Robert Neumayer

Nested Class Summary
static class KMeans.InitType
           
 
Field Summary
protected  Cluster[] clusters
           
protected  double[][] data
           
private  double[] differences
           
private  Hashtable<Integer,Integer> instancesInClusters
           
private  int k
           
private  int lastNumberOfUpdates
           
private  double[] maxValues
           
private  double[] minValues
           
private static int NUMBER_OF_UPDATE_RANGE
           
private  int numberOfAttributes
           
private  int numberOfInstances
           
private static long RANDOM_SEED
           
 
Constructor Summary
KMeans(int k, double[][] data)
          Default constructor (as much defaulting as possible).
KMeans(int k, double[][] data, KMeans.InitType initialisation)
          Instantiate a new KMeans object with:
KMeans(int k, double[][] data, KMeans.InitType initialisation, DistanceMetric distanceFunction)
          Construct a new K-Means bugger.
 
Method Summary
private  void calculateNewCentroids()
          Batch calculation of all cluster centroids.
 double[][] getClusterCentroids()
          Get a double[][] of all cluster centroids.
 Cluster[] getClusters()
           
 double[][] getClusterVariances()
           
 double[][] getData()
           
 double[] getDifferences()
           
private  int getIndexOfClosestCluster(double[] instance)
          Get the index of the closest cluster for the given instance index.
 double[] getMaxValues()
           
 double[][] getMinMaxNormalisedClusterCentroids()
          Get a double[][] of all cluster centroids.
 double[][] getMinMaxNormalisedClusterCentroidsWithin()
          Get a double[][] of all cluster centroids.
 double[] getMinValues()
           
 int[][] getOccurrenceLabels(int numberOfLabels)
          Get a set of labels for the given clustering based on the occurrences of attributes within clusters, i.e.
 double getSSE()
          Get the sum of the squared error for all clusters.
 double[] getSSEs()
          Get the sum of the squared error for single clusters.
private  double[] getSubstituteCentroid()
          Get a new centroid for empty clusters.
private  void initClustersEqualNumbers(DistanceMetric distanceFunction)
          cluster centres are initialised by equally sized random chunks of the input data when there's 150 instances, we assign 50 chosen randomly to each cluster and calculate its centre from these (the last cluster might be larger if numInstances mod k < 0)
private  void initClustersLinearly(DistanceMetric distanceFunction)
          This one does linear initialisation.
private  void initClustersLinearlyOnInstances(DistanceMetric distanceFunction)
          like initClustersLinearly(DistanceMetric), but after computing the exact linear point, rather finds & uses the closest instance from the data set as centroid.
private  void initClustersRandomly(DistanceMetric distanceFunction)
          Calculate random centroids for each cluster.
private  void initClustersRandomlyOnInstances(DistanceMetric distanceFunction)
          Take random points from the input data as centroids.
private  void initMinAndMaxValues()
          Utility method to get the min, max, and diff values of the data set.
 void printCentroids()
           
 void printCentroidsShort()
           
 void printClusterIndices()
           
private  void removeEmptyClusters()
          Searches for clusters which have no instances assigned.
 void setClusterCentroids(double[][] centroids)
          Initialise the cluster centres with the given centres.
 void train()
          Train for as long as instances move between clusters.
 void train(int numberOfSteps)
          Train for a certain number of steps.
private  boolean trainingStep()
          A classic training step in the K-Means world.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

data

protected double[][] data

k

private int k

numberOfInstances

private int numberOfInstances

numberOfAttributes

private int numberOfAttributes

minValues

private double[] minValues

maxValues

private double[] maxValues

differences

private double[] differences

instancesInClusters

private Hashtable<Integer,Integer> instancesInClusters

clusters

protected Cluster[] clusters

RANDOM_SEED

private static long RANDOM_SEED

lastNumberOfUpdates

private int lastNumberOfUpdates

NUMBER_OF_UPDATE_RANGE

private static final int NUMBER_OF_UPDATE_RANGE
See Also:
Constant Field Values
Constructor Detail

KMeans

public KMeans(int k,
              double[][] data)
Default constructor (as much defaulting as possible). Uses linear initialisation and Euclidean distance.

Parameters:
k - number of clusters
data - guess

KMeans

public KMeans(int k,
              double[][] data,
              KMeans.InitType initialisation)
Instantiate a new KMeans object with:

Parameters:
k - number of clusters
data - the data to cluster
initialisation - the initialisation method used (to be chosen from InitType)

KMeans

public KMeans(int k,
              double[][] data,
              KMeans.InitType initialisation,
              DistanceMetric distanceFunction)
Construct a new K-Means bugger.

Parameters:
k - number of clusters
data - the data set
initialisation - initialisation type
distanceFunction - an LnMetric of your choice
Method Detail

train

public void train(int numberOfSteps)
Train for a certain number of steps. Note that we won't stop until all training steps are finished.

Parameters:
numberOfSteps - how many would you like?

train

public void train()
Train for as long as instances move between clusters. "Not moving" means that there hasn't been a change in the last NUMBER_OF_UPDATE_RANGE steps (5).


removeEmptyClusters

private void removeEmptyClusters()
Searches for clusters which have no instances assigned. These are then replaced FIXME FIXME private void substituteEmptyClusters() { System.out.println("Removing empty clusters:"); double[] replacementCentroid = new double[clusters[0].getCentroid().length]; for (int i = 0; i < clusters.length; i++) { if (clusters[i].getNumberOfInstances() != 0) replacementCentroid = clusters[i].getCentroid().clone(); } for (int i = 0; i < clusters.length; i++) { if (clusters[i].getNumberOfInstances() == 0) clusters[i].setCentroid(replacementCentroid); } }


trainingStep

private boolean trainingStep()
A classic training step in the K-Means world.

Returns:
whether this step brought any changes or not. Note, this one also says no if there were as many changes as in the last step.

calculateNewCentroids

private void calculateNewCentroids()
Batch calculation of all cluster centroids.


getSubstituteCentroid

private double[] getSubstituteCentroid()
Get a new centroid for empty clusters. We therefore take the instance with the largest SSE to the cluster centroid having the largest SSE. Get the idea? Read slowly.

Returns:
a new centroid (rather: a clone thereof :))

getIndexOfClosestCluster

private int getIndexOfClosestCluster(double[] instance)
Get the index of the closest cluster for the given instance index. Note that in case of equally distant clusters we assign the first found cluster. At the end of the day this means that the clusters with lower indices will have a tendency to be larger. It hopefully won't have too much impact, possibly a random assignment in case of equal weights would make sense, however, this would require a couple of steps more in here.

Parameters:
instance - the data vector to be assigned
Returns:
index of the closest cluster centre

getOccurrenceLabels

public int[][] getOccurrenceLabels(int numberOfLabels)
Get a set of labels for the given clustering based on the occurrences of attributes within clusters, i.e. there's a preference for labels occurring in many instance.

Returns:
labels

initClustersRandomly

private void initClustersRandomly(DistanceMetric distanceFunction)
Calculate random centroids for each cluster.


initClustersEqualNumbers

private void initClustersEqualNumbers(DistanceMetric distanceFunction)
cluster centres are initialised by equally sized random chunks of the input data when there's 150 instances, we assign 50 chosen randomly to each cluster and calculate its centre from these (the last cluster might be larger if numInstances mod k < 0)


initClustersRandomlyOnInstances

private void initClustersRandomlyOnInstances(DistanceMetric distanceFunction)
Take random points from the input data as centroids.


initClustersLinearly

private void initClustersLinearly(DistanceMetric distanceFunction)
This one does linear initialisation. In the two dimensional space it will place the cluster centres on a diagonal line of a square.


initClustersLinearlyOnInstances

private void initClustersLinearlyOnInstances(DistanceMetric distanceFunction)
like initClustersLinearly(DistanceMetric), but after computing the exact linear point, rather finds & uses the closest instance from the data set as centroid.


setClusterCentroids

public void setClusterCentroids(double[][] centroids)
                         throws MoreCentresThanKException
Initialise the cluster centres with the given centres.

Parameters:
centroids - centroids for clusters.
Throws:
MoreCentresThanKException - don't dare to set more or less centres than our k value.

initMinAndMaxValues

private void initMinAndMaxValues()
Utility method to get the min, max, and diff values of the data set. This is used for scaling the (random) values in the initialisation functions.


getClusterCentroids

public double[][] getClusterCentroids()
Get a double[][] of all cluster centroids.

Returns:
all cluster centroids

getClusterVariances

public double[][] getClusterVariances()

getMinMaxNormalisedClusterCentroids

public double[][] getMinMaxNormalisedClusterCentroids()
Get a double[][] of all cluster centroids. Normalised in the range of the original data.

Returns:
all cluster centroids

getMinMaxNormalisedClusterCentroidsWithin

public double[][] getMinMaxNormalisedClusterCentroidsWithin()
Get a double[][] of all cluster centroids. Normalised in the range of the centroids.

Returns:
all cluster centroids

getMinValues

public double[] getMinValues()

getMaxValues

public double[] getMaxValues()

getDifferences

public double[] getDifferences()

getClusters

public Cluster[] getClusters()

getSSE

public double getSSE()
Get the sum of the squared error for all clusters.

Returns:
SSE.

getSSEs

public double[] getSSEs()
Get the sum of the squared error for single clusters.

Returns:
several SSEs.

printCentroids

public void printCentroids()

printCentroidsShort

public void printCentroidsShort()

printClusterIndices

public void printClusterIndices()

getData

public double[][] getData()
Returns:
Returns the data.