MUSCLE Network of Excellence |
||
[MUSCLE Home] | ||
|
Text Analysis ToolsWithin the MUSCLE Network of Excellence on multimedia understanding, datamining and machine learning researchers have developed a range of tools for text analysis, text annotation, Natural Language Processing text classification and semantic indexing. This deliverable of WP4 represents the final inventory of text analysis tools developed: |
|
|
||
Part-of-Speech Tagger, Spatial Query ExtractorBilkent University, Ugur Gudukbay, Ozgur Ulusoy BilVideo is a video database management system. The first version, BilVideo v1.0, supports complex spatio-temporal object queries by an SQL-like textual query language, by sketches or by simple English sentences (NLP interface). BilVideo v2.0 is currently under development. It is designed to be an MPEG-7 compliant video database management system. The visual query interface, relevant publications and user manuals are available online (URL see below). When completed, BilVideo v2.0 will also be made accessible through the web site. The whole system is composed of several components and therefore not publicly available. BilVideo can extract from natural language queries the following spatial relations:
For example, given a video search query such as "Retrieve segments where James Kelly is to the right of his assistant", this system will extract the spatial relation right(JamesKelly, assitant) that can sent to a further query processing engine. The tool can be experimented through the Web as a Web client. There is ademo video for query processing purposes together with some examples, queries and tutorials. Website: http://pcvideo.cs.bilkent.edu.tr/ Short description and references: http://www.cs.bilkent.edu.tr/~bilmdg/bilvideo Downloadable tools: http://pcvideo.cs.bilkent.edu.tr/querying.html |
||
Updatable Probabilistic Latent Semantic Indexing AUTH, Constantine
Kotropoulos |
||
Natural Language Processing Tools, OWL version of WordNet CEA LIST, Olivier Mesnard
These functions allow the transformation of raw text into symbolic knowledge that can be used to describe, index and access textual information, such as that associated with image captions, or in raw descriptions. The CEA has also developed an OWL ontology version of the WordNet lexical hierarchy. A reduced version of this ontology restricted to all the picturable objects in WordNet (30 Mbytes) is available from the CEA LIST. Contact: Adrian.popescu@cea.fr |
||
SOMLib Java PackageTU-WIEN - IFS, Andreas Rauber TU Vienna - IFS has developed a software for analyzing text documents and organizing them on a Self Organizing Map (SOM) - a representation of a reduced semantic dimension, bringing similar documents, or objects closer together on a two or three dimensional plane. The SOMLib Java Package is a collection of JAVA programs that can be used to create SOMLib library systems for organizing text collections. The package includes
Website: http://www.ifs.tuwien.ac.at/~andi/somlib/download/index.html Quick
Reference: http://www.ifs.tuwien.ac.at/~andi/somlib/download/java_package/ |
||
Semi-automated Corpus Annotator (CNRS LLACAN)CNRS, Fathi
Debili Automatic versus Interactive Analysis of Arabic Corpora The language tools of CNRS LLACAN relate to the
automatic processing of the Arabic language. Based on a dictionary of
forms, they enable morphological analysis, POS tagging, phrase chunking
and dependency analysis of Modern Standard Arabic, with variable levels
of coverage and performance. |
||
UTIA Text ClassifierUTIA, Jana Novovicova Text categorization (also known as text classification)
is the task of automatically sorting a set of documents into predefined
classes based on its contents. Document classification is needed in
many applications including e-mail filtering, mail routing, spam
filtering, news monitoring, selective dissemination of information to
information consumers, and automated indexing of scientific articles.
The Prague-based team of UTIA has produced a method for text
classification using Oscillating Search which, unlike traditional
approaches, evaluates feature groups instead of individuals and which
improves classification accuracy in experiments. |
||
ECUE Spam Concept Drift DatasetsNUID / UCD, Sarah Jane Delany The ECUE Spam Concept Drift Datasets each consist of more than 10,000 emails collected over a period of approximately 2 years. Each is a collection of spam and legitimate email received by an individual. The following files are included in each dataset:
Papers describing the work: https://www.cs.tcd.ie/publications/tech-reports/reports.06/TCD-CS-2006-05.pdf (ECAI 2006), https://www.cs.tcd.ie/publications/tech-reports/reports.05/TCD-CS-2005-19.pdf (FLAIRS 2006) |
||
TechTC - Repository of Text Categorization DatasetsTechnion-ML, Shaul MarkovichWhile numerous works studied text categorization (TC) in the past, good test collections are by far less abundant. The TechTC-300 Test Collection contains 300 labeled datasets whose categorization difficulty (as measured by baseline SVM accuracy) is uniformly distributed between 0.6 and 1.0. Each dataset consists of a pair of ODP categories with an average of 150-200 documents (depending on the specific test collection), and defines a binary classification task that consists in telling these two categories apart. The average document size after filtering is slightly over 11 Kilobytes. HTML documents were converted into plain text and organized as a dataset, which were rendered in a simple XML-like format. The data is available in two formats:
The following test collections are currently available:
Papers describing the work: http://www.muscle-noe.org/images/DocumentPDF/MP_504_Gabrilovich-Markovitch-aaai2006.pdf |
||