at.tuwien.ifs.somtoolbox.reportgenerator
Class DatasetInformation

java.lang.Object
  extended by at.tuwien.ifs.somtoolbox.reportgenerator.DatasetInformation

public class DatasetInformation
extends Object

FIXME: most probably all the methods in this class should be part of InputData and SOMLibClassInformation, respectively !
this class collects all available information about the values in the input dataset, like from the input file, the template vector file, ... and maybe computes some properties of its own. It's job is to give one centralized placed where the actual report generators (the output object) can ask for the data.

Version:
$Id: DatasetInformation.java 3590 2010-05-21 10:43:45Z mayer $
Author:
Sebastian Skritek (0226286, Sebastian.Skritek@gmx.at)

Field Summary
private  SOMLibClassInformation classInfo
           
private  String classInformationFilename
           
private  String[] classNames
           
(package private)  boolean denseData
           
private  boolean[] discrete
          only an estimation - we call values discrete if they are integer values
static int DISCRETE
           
private  EditableReportProperties EP
           
private  InputData inputData
           
private  String inputDataFilename
           
private  TemplateVector inputTemplate
           
private  double[] max
          holds for each dimension the maximal value
static int MAX_VALUE
           
private  double[] mean
          holds for each dimension the mean value
static int MEAN_VALUE
           
private  double[] min
          holds for each dimension the minimal value
static int MIN_VALUE
           
private  boolean[] only01
          we check whether there are values != 0 or 1
static int ONLY01
           
private  Vector<Integer> selectedIndices
           
private  String tvFilename
           
private  double[] var
          holds for each dimension the variance
static int VAR_VALUE
           
static int ZERO_VALUE
           
private  int[] zeroValues
          holds for each dimension the number of 0 - values.
 
Constructor Summary
DatasetInformation(Vector<Integer> selectedIndices, String inputDataFilename, String tvFilename, String classInformationFile, EditableReportProperties EP)
          creates a new object storing information about a given dataset
DatasetInformation(Vector<Integer> selectedIndices, String inputDataFilename, String tvFilename, String classInformationFile, EditableReportProperties EP, CommonSOMViewerStateData state)
           
 
Method Summary
private static String applyNameFix(String target)
          small helper method for getTrainingDataInfo
 double calculateAccumulatedVariance()
          this method is just a small helper method, used to display the Dimensions in the top-part of the output document It accumulates the Variances and calculates this Percentage from the total Variance
private  void checkDatatypes()
          runs over all dimensions of the input vectors and tries to fetch some information about their data ranges and other properties information gathered are: min and max value within each dimension (this.min, this.max) does a dimension contain only 0/1 values (this.only01) does a dimension contain only plain integer values (this.discrete) how many 0 (=missing?) values are in each dimension (this.zeroValues) the results are stored in the appropriate arrays
 boolean classInfoAvailable()
          returns whether class information are attached to the input vectors does not check whether it is a valid file, only whether a String with length > 0 has been specified as path
 String getAttributeLabel(int dim)
          returns the label (that is the name defined for an attribute in the template vector file) for the specified attribute.
 boolean getBoolDataProps(int type, int attribute)
          FIXME: split this into simple single getter methods...
 int[] getClassColorRGB(int c)
          returns an array of length three containing the r,g,b values of the colour used to colour the specified class
 int getClassIndexOfInput(String inputLabel)
          returns the index of the class the input vector specified by its index belongs to
 SOMLibClassInformation getClassInfo()
           
 String getClassInformationFilename()
          returns the path of the file containin the class information
 double[] getClassMeanVector(int classId)
          returns the mean vector of all input items belonging to the given class
 Vector<String> getClusterName(ClusterNode node, int clusterByValue, int nodeDepth)
          Tries to name a cluster by the input data mapped to units lying within the cluster For naming the cluster, some very simple heuristics are used: First, if there are any labels of the clusters, which correpsond to 0/1 attributes, and their values are all 0 (or 1) in the cluster, the name of this attribute is included to the name of the cluster.
 EditableReportProperties getEP()
          Returns the Editable Report Properties for the Semantic Report
 InputData getInputData()
          returns the InputData object storing information about the input data used for training the som.
 String getInputDataFilename()
          returns the complete filename of the file containing the input data complete filename means including the path.
 InputDatum getInputDatum(int d)
          returns the InputDatum at the specified index
 InputDatum getInputDatum(String name)
          returns the InputDatum labelled with the specified name
 String[] getInputLabelsofClass(int classId)
          returns a list of labels of all input items belonging to the given class
 String getNameOfClass(int c)
          returns the name of the class specified by the index
 int getNumberOfClasses()
          returns the number of classes.
 int getNumberOfClassmembers(int c)
          returns the number of input elements belonging to the given class if no class information is attached to this input, -1 is returned
 int getNumberOfInputVectors()
          returns the number of input vectors used for training the SOM, that is the number of different vectors present in the input file for the SOM training.
 int getNumberOfSelectedInputs()
          returns the number of inputs the user has selected to get information about their position on the SOM
 int getNumberOfZeroValues(int index)
          returns the number of input vectors that have 0 as value in the given dimension
 double getNumericalDataProps(int type, int attribute)
          FIXME: split this into simple single getter methods...
 double[][] getPCAdeterminedDims()
          This method calculates the most important Dimensions of the Dataset according to the results of a PCA, and rows the resulting dim-index in a new array on first index.
 int getSelectedInputId(int index)
          returns the id of the inputVector at position index in the list of selected inputs each input vector is identified by an id, which is its index in the complete input.
 String getTemplateFilename()
          returns the complete filename of the file containing the template data complete filename means including the path.
 String[] getTrainingDataInfo()
          Returns the names of the 3 files, used for training
 int getVectorDim()
          returns the dimension of the input vectors, that is the same as the number of attributes used to describe the objects.
 boolean is01(int index)
          returns whether the values in the given dimension are all only 0 or 1
 boolean isDiscrete(int index)
          returns whether our heuristic estimates this dimension to contain discrete values This is the case, if all values in this dimension are exact integer values.
 boolean isNormalized()
          returns whether the input set has been normalized (in fact, this functions returns the result of InputData.isNormalizedToUnitLength())
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MIN_VALUE

public static final int MIN_VALUE
See Also:
Constant Field Values

MAX_VALUE

public static final int MAX_VALUE
See Also:
Constant Field Values

MEAN_VALUE

public static final int MEAN_VALUE
See Also:
Constant Field Values

VAR_VALUE

public static final int VAR_VALUE
See Also:
Constant Field Values

ZERO_VALUE

public static final int ZERO_VALUE
See Also:
Constant Field Values

ONLY01

public static final int ONLY01
See Also:
Constant Field Values

DISCRETE

public static final int DISCRETE
See Also:
Constant Field Values

selectedIndices

private Vector<Integer> selectedIndices

inputData

private InputData inputData

inputDataFilename

private String inputDataFilename

tvFilename

private String tvFilename

inputTemplate

private TemplateVector inputTemplate

classInfo

private SOMLibClassInformation classInfo

classNames

private String[] classNames

classInformationFilename

private String classInformationFilename

EP

private EditableReportProperties EP

only01

private boolean[] only01
we check whether there are values != 0 or 1


discrete

private boolean[] discrete
only an estimation - we call values discrete if they are integer values


min

private double[] min
holds for each dimension the minimal value


max

private double[] max
holds for each dimension the maximal value


mean

private double[] mean
holds for each dimension the mean value


var

private double[] var
holds for each dimension the variance


zeroValues

private int[] zeroValues
holds for each dimension the number of 0 - values. Using this we estimate the missing values


denseData

boolean denseData
Constructor Detail

DatasetInformation

public DatasetInformation(Vector<Integer> selectedIndices,
                          String inputDataFilename,
                          String tvFilename,
                          String classInformationFile,
                          EditableReportProperties EP)
creates a new object storing information about a given dataset

Parameters:
selectedIndices - Vector of indices of the input items selected for more information
inputDataFilename - the path to the file containing the input data
tvFilename - the path to the file containin the template vector
classInformationFile - the path to the file containing the class information
EP - the customized Report Features of the Semantic Report

DatasetInformation

public DatasetInformation(Vector<Integer> selectedIndices,
                          String inputDataFilename,
                          String tvFilename,
                          String classInformationFile,
                          EditableReportProperties EP,
                          CommonSOMViewerStateData state)
Method Detail

classInfoAvailable

public boolean classInfoAvailable()
returns whether class information are attached to the input vectors does not check whether it is a valid file, only whether a String with length > 0 has been specified as path

Returns:
true if a class information file (.cls) has been specified, false otherwise

getClassInfo

public SOMLibClassInformation getClassInfo()

getNumberOfInputVectors

public int getNumberOfInputVectors()
returns the number of input vectors used for training the SOM, that is the number of different vectors present in the input file for the SOM training.

Returns:
the number of input vectors that appear in the input file

getClassMeanVector

public double[] getClassMeanVector(int classId)
returns the mean vector of all input items belonging to the given class

Parameters:
classId - the id of the class for which the mean vector shall be calculated
Returns:
the mean vector of the class

getVectorDim

public int getVectorDim()
returns the dimension of the input vectors, that is the same as the number of attributes used to describe the objects.

Returns:
the dimension of the input vectors

is01

public boolean is01(int index)
returns whether the values in the given dimension are all only 0 or 1

Parameters:
index - the dimension (starting with 0) for which this property is requested
Returns:
true if all input vectors contain only 0 or 1 in this dimension, false otherwise

isDiscrete

public boolean isDiscrete(int index)
returns whether our heuristic estimates this dimension to contain discrete values This is the case, if all values in this dimension are exact integer values.

Parameters:
index - the dimension (starting with 0) for which the estimation is requested
Returns:
true if all input vectors have only plain integers as values in this dimension, false otherwise

getNumberOfZeroValues

public int getNumberOfZeroValues(int index)
returns the number of input vectors that have 0 as value in the given dimension

Parameters:
index - the dimension (starting with 0) for which the number is requested
Returns:
the number of input vectors having the value 0 in the given dimension

isNormalized

public boolean isNormalized()
returns whether the input set has been normalized (in fact, this functions returns the result of InputData.isNormalizedToUnitLength())

Returns:
true if data iset is normalized, false if not

getNumericalDataProps

public double getNumericalDataProps(int type,
                                    int attribute)
FIXME: split this into simple single getter methods... !
returns the requested value describing the distribution of the input values. The types of information available are described by the constant members of this class (this function returns numerical properties): all information are returned for the given dimension (argument attribute).

Parameters:
type - specifies the type of information to be returned: allowed are some constants defined by this class (see above)
attribute - the index of the attribute for which the value shall be returned (starting with 0)
Returns:
the requested value. if the requested type is not available, -1 is returned

getBoolDataProps

public boolean getBoolDataProps(int type,
                                int attribute)
FIXME: split this into simple single getter methods... !
returns the requested value describing the distribution of the input values. The types of information available are described by the constant members of this class (this function returns boolean properties): all information are returned for the given dimension (argument attribute).

Parameters:
type - specifies the type of information to be returned: allowed are some constants defined by this class (see above)
attribute - the index of the attribute for which the value shall be returned (starting with 0)
Returns:
the requested value. if the requested type is not available, -1 is returned

getAttributeLabel

public String getAttributeLabel(int dim)
returns the label (that is the name defined for an attribute in the template vector file) for the specified attribute. If no template file is given, only the index of the attribute is returned.

Parameters:
dim - the index within the vector of the attribute whose label shall be returned
Returns:
the label specified in the template vector file or (if not present) the index of the attribute

getNumberOfClasses

public int getNumberOfClasses()
returns the number of classes. If there are no class information are attached to data, -1 is returned.

Returns:
the number of classes or -1

getNameOfClass

public String getNameOfClass(int c)
returns the name of the class specified by the index

Parameters:
c - the index of the class (starting with 0)
Returns:
the name of the class specified by the index, the empty String in case of any error finding the name

getInputLabelsofClass

public String[] getInputLabelsofClass(int classId)
returns a list of labels of all input items belonging to the given class

Parameters:
classId - the id of the class for which the input items are requested
Returns:
a list containing the lables of the input items belonging to this class

getClassColorRGB

public int[] getClassColorRGB(int c)
returns an array of length three containing the r,g,b values of the colour used to colour the specified class

Parameters:
c - the index of the class for which the colour is requested
Returns:
an array containing the r, g and b definitions of a color

getNumberOfClassmembers

public int getNumberOfClassmembers(int c)
returns the number of input elements belonging to the given class if no class information is attached to this input, -1 is returned

Parameters:
c - the index of the class (starting with 0)
Returns:
the number of elements belonging to this class, or -1

getClassIndexOfInput

public int getClassIndexOfInput(String inputLabel)
returns the index of the class the input vector specified by its index belongs to


getClassInformationFilename

public String getClassInformationFilename()
returns the path of the file containin the class information

Returns:
path to the file containting the class information

checkDatatypes

private void checkDatatypes()
runs over all dimensions of the input vectors and tries to fetch some information about their data ranges and other properties information gathered are: the results are stored in the appropriate arrays


getInputData

public InputData getInputData()
returns the InputData object storing information about the input data used for training the som. Needed by objects of type TestRunResult for some analysis

Returns:
the input data used to train the SOM

getInputDatum

public InputDatum getInputDatum(String name)
returns the InputDatum labelled with the specified name


getInputDatum

public InputDatum getInputDatum(int d)
returns the InputDatum at the specified index


getNumberOfSelectedInputs

public int getNumberOfSelectedInputs()
returns the number of inputs the user has selected to get information about their position on the SOM

Returns:
the number of inputs selected by the user.

getSelectedInputId

public int getSelectedInputId(int index)
returns the id of the inputVector at position index in the list of selected inputs each input vector is identified by an id, which is its index in the complete input. The vectors selected by the user (to display their position on the SOM) are also stored in a list. To retrieve the "real" id of the vector at position index in this list, this function should be used

Parameters:
index - the index of the vector in the list of selected inputs
Returns:
the id of the corresponding input, that is the index in the complete input list, -1 if error

getInputDataFilename

public String getInputDataFilename()
returns the complete filename of the file containing the input data complete filename means including the path. The string is not verified to point to a valid input file (or any file at all).

Returns:
the complete filename (including absolute path) of the input filename

getTemplateFilename

public String getTemplateFilename()
returns the complete filename of the file containing the template data complete filename means including the path. The string is not verified to point to a valid template file (or any file at all).

Returns:
the complete filename (including absolute path) of the template filename

getClusterName

public Vector<String> getClusterName(ClusterNode node,
                                     int clusterByValue,
                                     int nodeDepth)
Tries to name a cluster by the input data mapped to units lying within the cluster For naming the cluster, some very simple heuristics are used: First, if there are any labels of the clusters, which correpsond to 0/1 attributes, and their values are all 0 (or 1) in the cluster, the name of this attribute is included to the name of the cluster. (attributes of 0/1 type are supposed to encode any "has this property" yes/no information, thereby the value 1 is interpreted as "cluster has this property", whereas 0 is interpreted as "has not") If there are any labels that don't correspond not 0/1 attributes, it is checked whether both subclusters have the same value for this label. If yes, the name of this label is included to the name of the cluster If none of the properties above is valid, the first nodeDepth-1 labels of the cluster suggested by the clustering algorithm is used. (at least for the animal map this works quite well)

Parameters:
node - the node representing the cluster tha shall be named
clusterByValue - indicates whether the labels for the cluster shall be created by value (is handed unchanged to ClusterNode.getLabels(clusterByValue, boolen)
nodeDepth - the depth of the node in the tree, whereby the root (i.e. the cluster containing the whole map) node has depth 1
Returns:
the list of labels found for this cluster

getPCAdeterminedDims

public double[][] getPCAdeterminedDims()
This method calculates the most important Dimensions of the Dataset according to the results of a PCA, and rows the resulting dim-index in a new array on first index. On index 2, the corresponding % of the TotalVariance is calculated (as a quality measure)

Returns:
new array with most important dims ranked decreasingly.

calculateAccumulatedVariance

public double calculateAccumulatedVariance()
this method is just a small helper method, used to display the Dimensions in the top-part of the output document It accumulates the Variances and calculates this Percentage from the total Variance


getTrainingDataInfo

public String[] getTrainingDataInfo()
Returns the names of the 3 files, used for training


applyNameFix

private static String applyNameFix(String target)
small helper method for getTrainingDataInfo


getEP

public EditableReportProperties getEP()
Returns the Editable Report Properties for the Semantic Report