|
|
MUSCLE Network of
Excellence
|
|
| |
[MUSCLE
Home] |
|
|
|
Text Analysis Tools
Within the MUSCLE Network
of Excellence on
multimedia understanding, datamining and machine
learning
researchers have developed a range of tools for text analysis, text
annotation, Natural Language Processing
text classification and semantic indexing. This deliverable of WP4
represents the final inventory of text analysis
tools developed: |
|
|
|
|
|
|
Part-of-Speech
Tagger, Spatial Query Extractor 
Bilkent University, Ugur Gudukbay, Ozgur Ulusoy
BilVideo is a video database management
system. The first version, BilVideo v1.0, supports complex
spatio-temporal
object queries by an SQL-like textual query language, by sketches or by
simple
English sentences (NLP interface). BilVideo v2.0 is currently under
development. It is designed to be an MPEG-7 compliant video database
management
system.
The visual query interface, relevant
publications and user manuals are available online (URL see below).
When completed, BilVideo v2.0
will also
be made accessible through the web site. The whole system is
composed of
several components and therefore not publicly available.
BilVideo can extract from natural language
queries the following spatial relations:
- topological
relations that describe order in 2D space (disjoint, touch,
inside, contain, overlap, cover, coveredby)
-
directional
relations that describe the neighborhood of objects (directions: north,
south, east, west, northeast, northwest, southeast, southwest
and neighborhood: left, right, below, above)
-
3D relations
that describe object positions in 3D space (infrontof,
strictlyinfrontof, behind, strictlybehind, touchfrombehind,
touchedfrombehind, samelevel)
For example, given a video search query such
as "Retrieve segments where James Kelly is to the right of his
assistant", this system will extract the spatial relation
right(JamesKelly, assitant) that can sent to a further query processing
engine.
The
tool can be experimented through the Web
as a Web client. There is a
demo video for query processing purposes together with some
examples,
queries and tutorials.
Website: http://pcvideo.cs.bilkent.edu.tr/
Short description and references: http://www.cs.bilkent.edu.tr/~bilmdg/bilvideo
Downloadable
tools: http://pcvideo.cs.bilkent.edu.tr/querying.html
Presentation
in MUSCLE BSCW
|
|
| |
Updatable
Probabilistic Latent Semantic Indexing
AUTH, Constantine
Kotropoulos
Probabilistic latent semantic indexing (PLSI) is a semantic space
reduction method that folds documents and the concepts that
appear in them into a smaller dimensioned semantic space which can then
be used to index and classify new documents. Building a
reduced semantic space is time consuming, order O(N^3). AUTH has
implemented a new method for updating PLSI when new documents arrive.
The new method incrementally adds the words of any new document in the
term-document matrix and derives the updating equations for
the probability of terms given the class (i.e. latent) variables, as
well those of documents given the latent variables. This quick updating
would be useful in a web crawler where the term-document matrix must be
refreshed very often.
Website: http://www.aiia.csd.auth.gr/EN/
Presentation
in MUSCLE BSCW
|
|
|
|
Natural
Language Processing Tools, OWL version of WordNet
CEA LIST, Olivier Mesnard
The CEA has a suite of natural language processing tools for the
following languages: English, French, Italian, Spanish, German,
Chinese, and Arabic. Alpha versions exist for Hungarian, Japanese, and
Russian. They perform the following functions:
- language identification and text encoding
identification
- UNICODE translation of codesets
- tokenization, dividing input stream into individual
words
- morphological analysis (recognizing conjugated word
forms and providing their normalized dictionary-entry
forms)
- part-of-speech tagging (choosing the grammatical
function of each word in a text)
- entity recognition (identifying people,
organizations, place names, products, money, time)
- dependency extraction (recognizing
subject-verb-object relations, and modifier relations)
These functions allow the transformation of raw text
into symbolic knowledge that can be used to describe, index and access
textual information, such as that associated with image captions, or in
raw descriptions.
The CEA has also developed an OWL ontology version of
the WordNet lexical hierarchy. A reduced version of this ontology
restricted to all the picturable objects in WordNet (30 Mbytes) is
available from the CEA LIST. Contact: Adrian.popescu@cea.fr
Website of commercial version of these tools: http://www.new-phenix.com
Presentation
in MUSCLE BSCW
|
|
| |
SOMLib Java
Package
TU-WIEN - IFS, Andreas Rauber
TU Vienna - IFS has developed a software for analyzing
text documents and organizing them on a Self Organizing Map
(SOM) - a representation of a reduced semantic
dimension, bringing similar documents, or objects closer together on a
two or three dimensional plane. The SOMLib Java Package is a
collection of JAVA programs that
can be used to create SOMLib library systems for organizing text
collections. The package includes
- Feature Extraction
- Feature space pruning
- Feature vector creation
- Feature vector normalization
- SOM training
- SOM Labeling
- libViewer template generation.
Website: http://www.ifs.tuwien.ac.at/~andi/somlib/download/index.html
Quick
Reference: http://www.ifs.tuwien.ac.at/~andi/somlib/download/java_package/
|
|
| |
Semi-automated
Corpus Annotator (CNRS LLACAN)
CNRS, Fathi
Debili
Automatic versus Interactive Analysis of Arabic Corpora
The tools we present make it possible to interactively annotate large
corpora of Arabic texts.
These annotations involve splitting text into words, lemmatization,
vowellization, tagging, segmentation into nominal and verbal chains,
along with the construction of dependency relations.
Within this process, interactive annotation serves automatic parsing by
massively providing it with annotated texts from which rules can be
learned and evaluated. Automatic parsing serves interactive annotation
whose correlated performance is to be measured by comparison to manual
processing and its cost.
The language tools of CNRS LLACAN relate to the
automatic processing of the Arabic language. Based on a dictionary of
forms, they enable morphological analysis, POS tagging, phrase chunking
and dependency analysis of Modern Standard Arabic, with variable levels
of coverage and performance.
The required production of large training corpora and the difficulties
specific to the Arabic language led to the realization of interactive
analysis tools.
These tools are now operating under MS Windows. An intranet version is
being developed.
These tools have also been used in the preparation of a tagged corpus
of about 250000 words (available from ELDA).
Presentation
in MUSCLE BSCW
|
|
| |
UTIA
Text Classifier
UTIA, Jana Novovicova
Text categorization (also known as text classification)
is the task of automatically sorting a set of documents into predefined
classes based on its contents. Document classification is needed in
many applications including e-mail filtering, mail routing, spam
filtering, news monitoring, selective dissemination of information to
information consumers, and automated indexing of scientific articles.
The Prague-based team of UTIA has produced a method for text
classification using Oscillating Search which, unlike traditional
approaches, evaluates feature groups instead of individuals and which
improves classification accuracy in experiments.
Paper describing the work: http://staff.utia.cas.cz/novovic/files/CIARP06_NSP.pdf
|
|
|
ECUE
Spam Concept Drift Datasets
NUID / UCD, Sarah Jane Delany
The ECUE Spam Concept Drift Datasets each consist of more than 10,000
emails collected over a period of approximately 2 years. Each is a
collection of spam and legitimate email received by an individual. The
following files are included in each dataset:
- SpamTraining.txt = all emails used as initial
training data in the concept drift experiments performed using this
dataset.
- NonspamTraining.txt = all emails used as initial
training data in the concept drift experiments performed using this
dataset.
- TestMMM99.txt = all emails used as 'test' data in the
concept drift experiments using this dataset where MMM represents the
month and 99 the year the emails were originally received.
Website (download): http://www.comp.dit.ie/sjdelany/Dataset.htm
Papers describing the work:
https://www.cs.tcd.ie/publications/tech-reports/reports.06/TCD-CS-2006-05.pdf
(ECAI 2006),
https://www.cs.tcd.ie/publications/tech-reports/reports.05/TCD-CS-2005-19.pdf
(FLAIRS 2006) |
|
|
TechTC
- Repository of Text Categorization Datasets
Technion-ML, Shaul
Markovich
While numerous works studied text categorization (TC) in the past, good
test collections are by far less abundant. The TechTC-300 Test
Collection contains 300 labeled datasets whose categorization
difficulty (as measured by baseline SVM accuracy) is uniformly
distributed between 0.6 and 1.0. Each dataset consists of a pair of ODP
categories with an average of 150-200 documents (depending on the
specific test collection), and defines a binary classification task
that consists in telling these two categories apart. The average
document size after filtering is slightly over 11 Kilobytes. HTML
documents were converted into plain text and organized as a dataset,
which were rendered in a simple XML-like format.
The data is available in two formats:
- Plain text
In plain text form, each dataset consists of a pair of files
corresponding to the two categories comprising the dataset. Each file
contains all the documents in one category in ASCII text format, which
resulted from HTML-to-text conversion.
- Preprocessed
feature vectors. In this format, texts were only tokenized
and digitized, but underwent no other preprocessing whatsoever.
The following test collections are currently available:
- TechTC-300
- a collection of 300 datasets whose categorization difficulty (as
measured by baseline SVM accuracy) is uniformly distributed between 0.6
and 1.0.
- TechTC-100
- a collection of 100 datasets whose categorization difficulty (as
measured by baseline SVM accuracy) is uniformly distributed between 0.6
and 0.92.
Note: TechTC-100 is a subset of TechTC-300.
Website (download): http://techtc.cs.technion.ac.il/
Papers describing the work: http://www.muscle-noe.org/images/DocumentPDF/MP_504_Gabrilovich-Markovitch-aaai2006.pdf |
|
| |
|
|