Text Mining applies Data Mining methods to text documents. The goal is to uncover higher-level information, such as topics of documents, similarity between documents, etc. To this end, text mining applies methods for indexing and feature extraction, and supervised and unsupervised machine learning methods for collection organisation and document classification.
One of our latest projects in this area deals deals with the search for emotionality in text. Not every text which has to be objective meets this requirements. To asses the emotional content of text documents, we implemented an eomotionality measurement method for German language texts.
Patents are special types of text documents - they exhibit certain properties that distinguish them from e.g. newspaper articles. Patents must comply to certain restrictions regarding their structure, with mandatory and optional sections. Further, they use a rather legislative language, and contain many inter-document references. Additionally, patents may also contain images and drawings.
Most information retrieval settings, such as web search, are typically precision-oriented, i.e. they focus on retrieving a small number of highly relevant documents. However, in specific domains, such as patent retrieval or law, recall becomes more relevant than precision: in these cases the goal is to find all relevant documents. This raises important questions with respect to retrievability and search engine bias: depending on the similarity measure and retrieval model, certain documents may be more or less retrievable, while some documents not being retrievable at all within common threshold settings. Biases may be oriented towards popularity of documents (increasing weight of references), towards length of documents, favour the use of rare or common words; rely on structural information such as metadata or headings, etc. We investigate an improved accessibility measurement by considering sets of relevant and irrelevant queries for each document. This simulates how recall oriented users create their queries when searching for relevant information.
Music Information Retrieval deals with ways of organising and accessing large collections of music. Important aspects are audio feature extraction, to capture characteristics of the musical pieces such as instrumentation, rhythm, chords, etc., as well as classification, clustering, and other means of organisation. A small selection of research topics is outlined below; more detailed information can be found on our dedicated Music Information Retrieval Website.
We are researching advanced methods for extracting semantic information from music, such as rhythm, presence of voice, timbre, etc, using digital signal processing and psycho-acoustics. These feature extraction algorithms are the basis to many subsequent tasks, like automatic music categorization and organization.
We developed a chord detection algorithm incorporating music theoretical knowledge in the form of key detection, beat tracking and chord-change frequencies improving the detection of chords in audio without restricting it to a narrow range of applicable music styles.
Applying machine learning methods and employing the features calculated from audio signal analysis we built a system performing categorization of music pieces into a pre-defined taxonomy, corresponding to the user's likes. The system has to be trained with a number of examples and is then able to categorize music into different classes, e.g. music genres (classical, jazz, hip hop, electronic, ...) or also identify artists. We also investigate combination of textual with audio information for music classification, and the recognition of moods and emotions in music.