at.tuwien.ifs.somtoolbox.data
Class AbstractSOMLibSparseInputData

java.lang.Object
  extended by at.tuwien.ifs.somtoolbox.data.AbstractSOMLibSparseInputData
All Implemented Interfaces:
InputData
Direct Known Subclasses:
DataBaseSOMLibSparseInputData, RandomAccessFileSOMLibInputData, SimpleMatrixInputData, SOMLibSparseInputData

public abstract class AbstractSOMLibSparseInputData
extends Object
implements InputData

This abstract implementation provides basic support for operating on a InputData. Sub-classes have to implement constructors and methods to read input vectors and create an InputData object, for example by reading from a file or a database.

Version:
$Id: AbstractSOMLibSparseInputData.java 3883 2010-11-02 17:13:23Z frank $
Author:
Rudolf Mayer

Field Summary
protected  SOMLibClassInformation classInfo
          Any class label information attached to the input vectors.
protected  String content_subtype
          The specific subtype of content type (user-definable, for example "rp", "rh", or "ssd" for Rhythm Patterns, Rhythm Histograms or Statistical Spectrum Descriptor audio feature types).
protected  String content_type
          The content type of the vectors ("text", "audio", ...).
 String[] dataNames
          The label/name of the vector.
protected  int dim
          The dimension of the input vectors, i.e.
private  double[][] distanceMatrix
          A matrix containing the pairwise distances between two vectors.
FIXME: use LeightWeightMemoryInputVectorDistanceMatrix instead
protected static String ERROR_MESSAGE_FILE_FORMAT_CORRUPT
           
protected  int featureMatrixCols
          Column dimension of the feature matrix before having been vectorized to input vector.
protected  int featureMatrixRows
          Row dimension of the feature matrix before having been vectorized to input vector.
private  double[][] intervals
           
protected  boolean isNormalized
          Indicates whether the input data has been normalised.
protected  cern.colt.matrix.impl.DenseDoubleMatrix1D meanVector
          The mean of all the input vectors.
protected  double mqe0
           
protected  LinkedHashMap<String,Integer> nameCache
          A mapping from the name to the index of an input vector, for faster access.
protected  int numVectors
          The number of vectors in this input data collection.
protected  Random rand
           
protected  String source
          Where this input data was read from, e.g.
protected  TemplateVector templateVector
          A TemplateVector attached to this input data.
private  double[][] transformedVectors
          A transformation of the input vectors.
 
Fields inherited from interface at.tuwien.ifs.somtoolbox.data.InputData
inputFileNameSuffix, MISSING_VALUE
 
Constructor Summary
protected AbstractSOMLibSparseInputData()
           
protected AbstractSOMLibSparseInputData(boolean norm, Random random)
           
protected AbstractSOMLibSparseInputData(String[] dataNames, int dim, boolean norm, Random rand, TemplateVector tv, SOMLibClassInformation clsInfo)
           
 
Method Summary
private  boolean assertEqual(Object name, Object i1, Object i2)
           
 SOMLibClassInformation classInformation()
          Gets the class info associated with this input data.
static AbstractSOMLibSparseInputData create(InputDatum[] inputData, SOMLibClassInformation classInfo)
           
 int dim()
          Gets the dimension of the input data.
 boolean equals(Object obj)
           
 InputDatum[] getByNameDistanceSorted(double[] vector, Collection<String> inputNames, DistanceMetric metric)
          Retrieves the InputDatum corresponding to the given input names, and sorted by their distance to the given vector.
 String getContentSubType()
          Gets the content sub-type.
 String getContentType()
          Gets the content type.
 double[][] getData()
          Return the input data as a double array, i.e.
 double[][] getData(String className)
          Returns the vectors of all inputs associated with the given class name
 double[][] getDataIntervals()
          Return the min and max values for each feature, in a matrix of dim x 2
 String getDataSource()
          returns the name/URI/etc.
 double[][] getDistanceMatrix()
           
 ArrayList<InputDistance> getDistances(int inputIndex, DistanceMetric metric)
          Returns the distances to the index of the given vector of the dataset.
 Hashtable<Integer,Integer> getFeatureDensities()
          Returns feature densities statistics of the input data, namely a mapping from the number of input objects a specific feature is not zero in, to the total number of features with that density .
 int getFeatureMatrixColumns()
          Gets the number of columns before vectorisation.
 int getFeatureMatrixRows()
          Gets the number of rows before vectorisation.
static String getFileNameSuffix()
           
static String getFormatName()
           
 InputDatum getInputDatum(String label)
          Get an input datum with a specified label.
 InputDatum[] getInputDatum(String[] labels)
          Returns an array of input data with the specified labels.
 int getInputDatumIndex(String label)
           
 String getLabel(int index)
          Return the label of the input vector at the given index.
 String[] getLabels()
          Returns an array containing the labels of all the input data.
 cern.colt.matrix.DoubleMatrix1D getMeanVector()
          Gets the mean vector of the input vectors.
 cern.colt.matrix.DoubleMatrix1D getMeanVector(String[] labels)
          Returns mean vector of specified vectors provided by String[] array.
 InputDatum[] getNearestN(double[] vector, DistanceMetric metric, int number)
          Retrieves the given number of InputDatum that are closest to the given vector.
 InputDatum[] getNearestN(int inputIndex, DistanceMetric metric, int number)
          Returns the n nearest input vectors for the index of the given vector of the dataset.
 InputDatum[] getNearestNUnsorted(int inputIndex, DistanceMetric metric, int number)
           
private  InputDatum[] getNNearest(ArrayList<InputDistance> distances)
           
private  InputDatum[] getNNearest(int number, ArrayList<InputDistance> distances)
           
 InputDatum getRandomInputDatum(int iteration, int numIterations)
          Gets a random input sample from the input data set.
 void initDistanceMatrix(DistanceMetric metric)
          Calculates the distanceMatrix - careful, this is a lengthy process and should be done only if needed.
 boolean isNormalizedToUnitLength()
          Indicates whether this data set has been normalised to the unit length.
 int numVectors()
          Gives the size of this input data set.
 void setClassInfo(SOMLibClassInformation classInfo)
           
 void setTemplateVector(TemplateVector templateVector)
          Sets the template vector to be associated with this input data.
 TemplateVector templateVector()
          Gets the template vector associated with this input data.
 void transformValues(DistanceMetric metric)
          Calculates the matrix of transformedVectors using DistanceMetric.transformVector(double[]) of the given metric.
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface at.tuwien.ifs.somtoolbox.data.InputData
getInputDatum, getInputVector, getValue, mqe0, subset
 

Field Detail

ERROR_MESSAGE_FILE_FORMAT_CORRUPT

protected static final String ERROR_MESSAGE_FILE_FORMAT_CORRUPT
See Also:
Constant Field Values

source

protected String source
Where this input data was read from, e.g. a file or database table


classInfo

protected SOMLibClassInformation classInfo
Any class label information attached to the input vectors.


dataNames

public String[] dataNames
The label/name of the vector.


content_type

protected String content_type
The content type of the vectors ("text", "audio", ...).

An input file should use the following header format for content types:
$DATA_TYPE text
or
$DATA_TYPE audio-rp


content_subtype

protected String content_subtype
The specific subtype of content type (user-definable, for example "rp", "rh", or "ssd" for Rhythm Patterns, Rhythm Histograms or Statistical Spectrum Descriptor audio feature types).


featureMatrixRows

protected int featureMatrixRows
Row dimension of the feature matrix before having been vectorized to input vector.


featureMatrixCols

protected int featureMatrixCols
Column dimension of the feature matrix before having been vectorized to input vector.


dim

protected int dim
The dimension of the input vectors, i.e. the number of attributes


isNormalized

protected boolean isNormalized
Indicates whether the input data has been normalised.


meanVector

protected cern.colt.matrix.impl.DenseDoubleMatrix1D meanVector
The mean of all the input vectors.


mqe0

protected double mqe0

numVectors

protected int numVectors
The number of vectors in this input data collection.


rand

protected Random rand

templateVector

protected TemplateVector templateVector
A TemplateVector attached to this input data.


transformedVectors

private double[][] transformedVectors
A transformation of the input vectors. This can be used to perform for example a transformation of the input data for distance calculations once for all vectors to improve performance.


distanceMatrix

private double[][] distanceMatrix
A matrix containing the pairwise distances between two vectors.
FIXME: use LeightWeightMemoryInputVectorDistanceMatrix instead


nameCache

protected LinkedHashMap<String,Integer> nameCache
A mapping from the name to the index of an input vector, for faster access.


intervals

private double[][] intervals
Constructor Detail

AbstractSOMLibSparseInputData

protected AbstractSOMLibSparseInputData(String[] dataNames,
                                        int dim,
                                        boolean norm,
                                        Random rand,
                                        TemplateVector tv,
                                        SOMLibClassInformation clsInfo)

AbstractSOMLibSparseInputData

protected AbstractSOMLibSparseInputData(boolean norm,
                                        Random random)

AbstractSOMLibSparseInputData

protected AbstractSOMLibSparseInputData()
Method Detail

dim

public int dim()
Description copied from interface: InputData
Gets the dimension of the input data.

Specified by:
dim in interface InputData
Returns:
the dimension.

getContentType

public String getContentType()
Description copied from interface: InputData
Gets the content type.

Specified by:
getContentType in interface InputData
Returns:
the content type

getContentSubType

public String getContentSubType()
Description copied from interface: InputData
Gets the content sub-type.

Specified by:
getContentSubType in interface InputData
Returns:
the content sub-type

getFeatureMatrixRows

public int getFeatureMatrixRows()
Description copied from interface: InputData
Gets the number of rows before vectorisation.

Specified by:
getFeatureMatrixRows in interface InputData
Returns:
the number of rows of feature matrix before having been vectorized to input vector, or -1 if not available.

getFeatureMatrixColumns

public int getFeatureMatrixColumns()
Description copied from interface: InputData
Gets the number of columns before vectorisation.

Specified by:
getFeatureMatrixColumns in interface InputData
Returns:
the number of columns of feature matrix before having been vectorized to input vector, or -1 if not available.

getMeanVector

public cern.colt.matrix.DoubleMatrix1D getMeanVector()
Description copied from interface: InputData
Gets the mean vector of the input vectors.

Specified by:
getMeanVector in interface InputData
Returns:
the mean vector.

getMeanVector

public cern.colt.matrix.DoubleMatrix1D getMeanVector(String[] labels)
Description copied from interface: InputData
Returns mean vector of specified vectors provided by String[] array.

Specified by:
getMeanVector in interface InputData
Parameters:
labels - label names of the input data.
Returns:
the mean vector.

isNormalizedToUnitLength

public boolean isNormalizedToUnitLength()
Description copied from interface: InputData
Indicates whether this data set has been normalised to the unit length.

Specified by:
isNormalizedToUnitLength in interface InputData
Returns:
true if this data set is normalised, false otherwise.

numVectors

public int numVectors()
Description copied from interface: InputData
Gives the size of this input data set.

Specified by:
numVectors in interface InputData
Returns:
the number of vectors.

templateVector

public TemplateVector templateVector()
Description copied from interface: InputData
Gets the template vector associated with this input data.

Specified by:
templateVector in interface InputData
Returns:
the template vector, or null if the template vector was not specified.

classInformation

public SOMLibClassInformation classInformation()
Description copied from interface: InputData
Gets the class info associated with this input data.

Specified by:
classInformation in interface InputData
Returns:
the class info, or null if the class info file was not specified.

setTemplateVector

public void setTemplateVector(TemplateVector templateVector)
Description copied from interface: InputData
Sets the template vector to be associated with this input data.

Specified by:
setTemplateVector in interface InputData
Parameters:
templateVector - the new template vector.

getInputDatum

public InputDatum getInputDatum(String label)
Description copied from interface: InputData
Get an input datum with a specified label.

Specified by:
getInputDatum in interface InputData
Parameters:
label - the name of the input datum.
Returns:
the input datum.

getInputDatumIndex

public int getInputDatumIndex(String label)

getRandomInputDatum

public InputDatum getRandomInputDatum(int iteration,
                                      int numIterations)
Description copied from interface: InputData
Gets a random input sample from the input data set.

Specified by:
getRandomInputDatum in interface InputData
Returns:
the random input data.

getInputDatum

public InputDatum[] getInputDatum(String[] labels)
Description copied from interface: InputData
Returns an array of input data with the specified labels.

Specified by:
getInputDatum in interface InputData
Parameters:
labels - the labels of the input data.
Returns:
the input data.

transformValues

public void transformValues(DistanceMetric metric)
Calculates the matrix of transformedVectors using DistanceMetric.transformVector(double[]) of the given metric.

Parameters:
metric - the metric to be used to transform the values.

initDistanceMatrix

public void initDistanceMatrix(DistanceMetric metric)
                        throws MetricException
Calculates the distanceMatrix - careful, this is a lengthy process and should be done only if needed. Requires the matrix of transformedVectors being initialised (e.g. via transformValues(DistanceMetric)).

Parameters:
metric - the metric to use for calculating the distances.
Throws:
MetricException - if DistanceMetric.distance(double[], double[]) encounters a problem.

getNearestN

public InputDatum[] getNearestN(int inputIndex,
                                DistanceMetric metric,
                                int number)
                         throws MetricException
Returns the n nearest input vectors for the index of the given vector of the dataset. Uses a pre-calculated distance metric, if existing, otherwise calculates the distances as needed.

Parameters:
inputIndex - the index of the vector.
metric - the metric to use for the distance comparison. Only used when the distanceMatrix is not pre-calculated.
number - the number of nearest input vectors desired.
Returns:
the n nearest input vectors.
Throws:
MetricException - if DistanceMetric.distance(DoubleMatrix1D, double[]) encounters a problem.

getDistances

public ArrayList<InputDistance> getDistances(int inputIndex,
                                             DistanceMetric metric)
                                      throws MetricException
Returns the distances to the index of the given vector of the dataset. Uses a pre-calculated distance metric, if existing, otherwise calculates the distances as needed.

Parameters:
inputIndex - the index of the vector.
metric - the metric to use for the distance comparison. Only used when the distanceMatrix is not pre-calculated.
Returns:
the n nearest input vectors.
Throws:
MetricException - if DistanceMetric.distance(DoubleMatrix1D, double[]) encounters a problem.

getNNearest

private InputDatum[] getNNearest(ArrayList<InputDistance> distances)

getNNearest

private InputDatum[] getNNearest(int number,
                                 ArrayList<InputDistance> distances)

getNearestNUnsorted

public InputDatum[] getNearestNUnsorted(int inputIndex,
                                        DistanceMetric metric,
                                        int number)
                                 throws MetricException
Throws:
MetricException

getNearestN

public InputDatum[] getNearestN(double[] vector,
                                DistanceMetric metric,
                                int number)
                         throws MetricException
Retrieves the given number of InputDatum that are closest to the given vector.

Throws:
MetricException

getByNameDistanceSorted

public InputDatum[] getByNameDistanceSorted(double[] vector,
                                            Collection<String> inputNames,
                                            DistanceMetric metric)
                                     throws MetricException
Retrieves the InputDatum corresponding to the given input names, and sorted by their distance to the given vector.

Throws:
MetricException

getData

public double[][] getData()
Description copied from interface: InputData
Return the input data as a double array, i.e. a matrix of numVectors x dim

Specified by:
getData in interface InputData

getData

public double[][] getData(String className)
                   throws SOMToolboxException
Description copied from interface: InputData
Returns the vectors of all inputs associated with the given class name

Specified by:
getData in interface InputData
Throws:
SOMToolboxException - If no class information file is loaded

setClassInfo

public void setClassInfo(SOMLibClassInformation classInfo)
Specified by:
setClassInfo in interface InputData

getDistanceMatrix

public double[][] getDistanceMatrix()

getDataIntervals

public double[][] getDataIntervals()
Description copied from interface: InputData
Return the min and max values for each feature, in a matrix of dim x 2

Specified by:
getDataIntervals in interface InputData

getFeatureDensities

public Hashtable<Integer,Integer> getFeatureDensities()
Returns feature densities statistics of the input data, namely a mapping from the number of input objects a specific feature is not zero in, to the total number of features with that density .


getLabels

public String[] getLabels()
Description copied from interface: InputData
Returns an array containing the labels of all the input data.

Specified by:
getLabels in interface InputData

getLabel

public String getLabel(int index)
Description copied from interface: InputData
Return the label of the input vector at the given index.

Specified by:
getLabel in interface InputData

equals

public boolean equals(Object obj)
Overrides:
equals in class Object

assertEqual

private boolean assertEqual(Object name,
                            Object i1,
                            Object i2)

create

public static AbstractSOMLibSparseInputData create(InputDatum[] inputData,
                                                   SOMLibClassInformation classInfo)

getFormatName

public static String getFormatName()

getFileNameSuffix

public static String getFileNameSuffix()

getDataSource

public String getDataSource()
Description copied from interface: InputData
returns the name/URI/etc. of the source where this input data was read from

Specified by:
getDataSource in interface InputData