The Information Management and Preservation Lab at the Department of Software Technology and Interactive Systems at Vienna University of Technology covers all aspects related to making information accessible. This includes research pertaining to Data Mining and Machine Learning (DM/ML) to analyze structure in data and provide means to visualize and interpret the data. This extends to research activities in Information Retrieval (IR) on multimodal data, particularly text, music and images. On top of that, research in Digital Preservation aims to ensure, that data, workflows and processes are authentically available over long time spans.
Contact: Andreas Rauber
More specifically, the core activities in these areas comprise:
In the area of data mining we focus on the detection of structure using a range of data mining as well as information visualization technologies. Specifically, we focus on the Self-Organizing Map (SOM) as a tool for topology-preserving data projection and vector quantization, as well as on approaches for classifying data into predefined categories using ensemble approaches. these techniques are evaluated on a range of benchmakr data as well as applied in real-world data mining settings ranging from sensor data to tasks in information retrieval.
SOM : The Self-Organizing Map (SOM) is an unsupervised neural network model providing a non-linear topology-preserving mapping from a high-dimensional input space onto a commonly two-dimensional output space while also performing vector quantization. Our research concentrates on novel visualization techniques and extensions to the SOM model to facilitate analysis and interpretation of the resulting map structures. Results are made available via the popular SOMToolbox, released as open source software.
Machine Learning Algorithms : We use a range of supervised machine learning algorithms to classify data, including Support Vector Machines, Naive Bayes, Random Forests, and others. Decision Manifolds allow the piecewise linear approximation of decision planes using self-organizing principles. Other aspects address include ensemble learning as well as parameter optimization.
Benchmark evaluations : Solid evaluation is essential for understanding performance differences in machine learning. Starting from characteristics of benchmark data and evaluation principles we are moving towards more solidly expressed evaluation workflows, analyzing documentation requirements and automated evaluation procedures.
Text Retrieval focuses on searching within text information. This includes indexing, feature extraction and supervised and unsupervised machine learning methods for collection organisation and document classification. Domain-specific adaptations of these methods are investigated for domains including medical and health documents, patent information, scientific papers, e-mail routing and others. Extensive consultation with end users in these domains is undertaken to guide the developments.
Music Retrieval research focuses on various methods of indexing and structuring audio collections, as well as providing intuitive user interfaces to facilitate exploration of musical libraries. Machine learning techniques are used to extract semantic information, group audio by similarity, or classify it into various categories. We developed advanced visualization techniques to provide intuitive interfaces to audio collections on standard PC as well as mobile devices. Our solutions automatically organize music according to its sound characteristics such that we find similar pieces of music grouped together, enabling direct access to and intuitive instant playback according to one's current mood.
Image Retrieval research focuses on retrieving images based on a range of similarity measures, from low-level colour and texture similarity to estimates of the emotional response elicted by images. Work on the domain of technical drawings for patent retrieval is a current focus of our research.
Cross-modal Retrieval : An important aspect of our work covers the cross-modal aspects inherently present in any type of information: documents consist of more than the textual content, including genre and layout information as well as images; music goes beyond the mere acoustic content, including song lyrics, artist biographies, album artwork and music videos. All of these information sources need to be analyzed together to offer comprehensive information retrieval capabilities in each domain.
Evaluation aims at measuring how effectively information retrieval algorithms perform. Current emphasis is on the evaluation of retrieval in the intellectual property domain, for which we organise the CLEF-IP and TREC-CHEM evaluation campaigns. The PatOlympics events go beyond laboratory evaluation by making the patent search professional a central part of the evaluation process.
Digital Preservation is the process of keeping electronic material and processes accessible and usable for longer periods of time. It has turned into one of the most pressing challenges within the digital library community. Long-term archiving of digital material (data and processes) is a highly complex and diverse matter. This is partially due to rapid changes and ongoing developments in file formats. At the same time, hardware and information technology infrastructure are subject to constant changes, making the task even more difficult. Extending our core competencies in digital preservation we are closely affiliated with the digital preservation group at Secure Business Austria (SBA), a national competence center on security.
Preservation Planning : We developed a methodology that allows to create preservation plans according to the particular requirements of stakeholders. The approach builds on Utility Analysis and a thoroughly defined workflow for reaching informed and accountable decisions on which preservation solution to adopt. The planning tool PLATO, which implements this workflow, has been released as open source software and experiences tremendous take-up in the community.
Up-/downscaling Preservation : Automation is crucial to allow preservation solutions to scale up to massive volumes of data, as well as to be deployable in small institutions without strong in-house expertise. In this area we investiage approaches to automate quality assurance in preservation actions, devise concepts and prototype systems for outsourcing preservation expertise, as well as cloud-deployment approaches for large-scale preservation actions.