Information Retrieval - Hybrid Semantic Search, TU Vienna

HS³ a Hybrid Semantic Search System

The Hybrid Semantic Search System (HS³) was build with the aim to i) create a semantic search system that can use arbitrary ontologies, ii) automatically fetch and maintain related and relevant information from Web resources, iii) utilize this information to enhance the search process by combining keyword-based and concepts based search and iv) interactively assist the user during the search process with an intuitive UI.

Difference between concept, instance and keyword
Architecture of HS³
Query interface of HS³
Online Demo of HS³
Evaluation
Downloads

Difference between concept, instance and keyword

In the following we give a short explanation of the difference between a concept, an instance and a keyword, which is crucial for the understanding of hybrid semantic search. A concept is an abstract description of an object. Some examples of concepts are Hotel, Airplane, Car and Broadlocation. An instance is a specific occurrence of a concept. The following are instances of the previously listed concepts : "Hotel Savoy" is an instance of the concept Hotel, "Boeing 747" is an instance of the concept Airplane, "Ford Mustang" is an instance of the concept Car and "Salzburg" is an instance of the concept BroadLocation. Hence, there can be multiple instances per concept. For example "Salzburg", "Vienna", "Retz" and "Korneuburg" are all instances of the concept BroadLocation and "VW Beetle", "Chevy Camaro" and "Ford Mustang" are all instances of the concept Car.

In Figure 1 part of a Webpage is depicted where the textual representations of two instances of two different concepts have been highlighted. The first instance is Mayrhofen which is an instance of the concept BroadLocation. The second instance is Elisabeth which is an instance of Hotel. "Mayrhofen" and "Elisabeth" are their textual representation in the Webpage. Therefore, the system associates the textual representation "Elisabeth" with the actual instance Elisabethhotel which is located in Mayrhofen. Hence the textual representation "Elisabeth" has a semantic meaning to the system. However, not all terms within the Web page need to be textual representations of instances or concept that are known to the system. These terms have no semantic meaning to the system and therefore are regarded as mere keywords (= a sequence of characters) by the system. However, these keywords are still of importance, because they may describe concepts and instances, whose textual representation appear in the Web page, in more detail. For example the highlighted terms "magical" and "atmosphere" within the web page presented in Figure 1, describe the instance Elisabethhotel located in Mayrhofen in more detail. We regard all terms of a Web page that are unknown to the system (= have no semantic association) as concepts of type Keyword and the textual representation of the term as its instance. Therefore, the term "magical" in the Web page is of type Keyword and its instance is the textual representation "magical".

Figure 1 : Concept, instance and keyword

Architecture

The architecture of the system is depicted in Figure 2. The system consists of five main sets of components, namely the Persistence components, the Data Fetching components, the Annotation & Indexing components, the Search & Ranking services and the Interface Layer. The Persistence components are provided with data from the Ontology Enricher and the Transformation Engine. The Ontology Enricher adds textual representations of concepts to the ontology and the Transformation Engine transforms custom data structures to RDF and stores the RDF triples in the KB.

Figure 2 : HS³'s Architecture

The Ontology Enricher issues a query to the WordNet API for every concept in the ontology, creates a textual representation and stores it as part of the ontology. A textual representation of a concept contains one or more terms that describe a concept. For example terms such as "canyoning" and "rafting" are textual representations of the concept Canyoning. The Transformation Engine is an integral part of the system, because it is used to create the initial data set of the KB and can deal with different custom data structures. Therefore, the knowledge generated from this data is trusted and suits the same purpose as the pre-populated KB of KIM. The Persistence components are the ontology, the KB and the Document Store. The Document Store holds copies of fetched documents and the respective metadata. Annotations of documents are stored in the Document Store and can be modified or extended by annotators or indexers. The work queues are managed by so-called Work Queue Managers. Work queues are used by the Work Queue Managers to distribute work packages consisting of documents to the different components of the system. Any component exposes services that can be used by other components.

The Data Fetching components use services of the Persistence components. Data Fetchers and Metadata Fetchers read data from the KB and write fetched document data to the Document Store. The Metadata Fetcher is used to fetch metadata for instances of concepts that are stored in the KB. Metadata includes attributes such as the size, the name, the format, the summary and the URL of a document. This information is subsequently used by the Data Fetcher to fetch data such as HTML documents and store them in the Document Store for further processing. The Metadata Fetcher can be equipped with plug-ins that are specialized in retrieving results for a specific concept. Therefore, it is possible to customize the data fetching routines according to the ontology that is used. The default implementation of the Metadata Fetcher creates the textual query based on the IndexGraph that is attached to a specific concept. In a nutshell an IndexGraph defines all other instances that are of relevance when considering a specific instance of a concept. A query generated by the default plug-in would contain the textual representation of the instance itself and the textual representations of all the instances that are on the IndexGraph of this specific instance. In case additional or more specific information is needed in the textual queries a domain expert can modify the default plug-in or create a new plug-in.

The Annotation & Indexing components use data that was fetched by the Data Fetching components. Annotators use the ontologies and the KB to annotate documents in the Document Store. The Annotators use so-called Annotation Pipelines that adhere to a specific structure and can have multiple processing resources that operate on the data to identify concepts and instances and create annotations. These processing resources may either be local or remote. It is possible to use an external annotation service from within an Annotation Pipeline and process the result. The Annotation Pipeline includes a mandatory final processing step where the data is transformed into a structure that can be used by the Indexers. It is possible to include custom Annotation Pipelines. In case the different annotation pipelines create conflicting annotations that cannot be resolved via disambiguation rules defined by a domain expert, the system currently makes use of the annotation that refers to a concept of the main ontology. The main ontology can be defined via a parameter.

Indexers operate on the ontology, KB and the Document Store which holds the annotated documents. The system includes a Semantic Indexer out of the box, but custom Indexers can be added to the system as well. Custom Indexers just need to implement a pre-defined interface. The Semantic Indexer creates a Combined Index that consists of a full text index and a concept index that holds concepts, instances and their relations. The system is service-oriented and multi-threaded to support parallelism and facilitate scalability. The Search & Ranking Services hold services such as the Search Services, Instance Suggestor and Ontology Wrapper, which are accessed via the Interface Layer. The system includes a default search service, namely the Semantic Search Service. It can be extended with any custom implementation of a search service as long as it implements the pre-defined interfaces. The Instance Suggestor service can be used to get all instances of the KB that match a certain textual representation. The Ontology Wrapper is a service to access the ontologies managed by the system. The Interface Layer offers three different types of interfaces: a GWT based Web User Interface, a Web Service Interface that enables access to the Semantic Search Service and a communication facility for autonomous software agents, which is realized as Web Service as well.

Query Interface

HS³ uses a novel interactive ontology aware keyword-based input interface. In the following the usage of the interface is described followed by a short video that presents the interface in action. The interactive ontology-enhanced query formulation interface enables users to formulate queries to search for Web pages that hold specific concepts, instances, keywords and their relations. The following section will give you an introduction on how to use the interface to formulate such queries.

Queries are formulated interactively by typing into the input text field marked with a red rectangle in Figure 3.

Figure 3

To create a simple query just start typing the name of a concept, the name of an instance or a keyword. For example if you would be looking for a guesthouse start typing "gue" and you will be presented with a popup that holds suggestions matching your current input as shown in Figure 4. If the popup holds the concept or instance you are looking for select it with a mouse click or by pressing return when the concept or instance is highlighted. The popup first lists the concepts suggestions followed by the instance suggestions.

Figure 4

As soon as you've chosen a concept or instance per mouse click or return, a popup with all possible properties associated to that concept or instance will be displayed. For example if you would be looking for a guesthouse that provides a specific facility you would choose the "provides facility" property from the popup as depicted in Figure 5.

Figure 5

Next, you may input another concept or instances that shall be related to the previous concept or instance (in our example "guesthouse") via chosen property (in our example "provides facility"). Let's say we are looking for a guesthouse that provides the facility mountain biking. Therefore we would type "MountainBik" and choose the concept MountainBiking as depicted in Figure 6.

Figure 6

Again a popup will be displayed with all properties that are associated with the chose instance of concept. If you don't want to choose a property just hit the ESC key. In case you want to specify a further query restriction for the guesthouse concept you can type the word "and", resulting in a popup that includes the And concept as depicted in Figure 7. To pick the And concept just hit return or click with the left mouse button on the And concept which will bring up the properties popup of the previous concept or instance again. In our example this would be the properties popup of the concept guesthouse as depicted in Figure 8. In our example we choose the property "suitable for target audience" and hit return.

Figure 7

Figure 8

Now we just need to define the target audience we want the guesthouse to be suitable for. Since we are in our late fifties we want the guesthouse to be suitable for elderly people and choose the elderlyPeople concept from the popup as depicted in Figure 9.

Figure 9

However, we prefer those guesthouses which are child-friendly. Therefore, we modify our query by inserting an entry in front of the guesthouse concept. We do so by hovering in front of the guesthouse concept until the mouse cursor turns into a text cursor and click the left mouse button. Now we are able to input text in front of the guesthouse concept as depicted in Figure 10. Since the ontology and knowledge base does not contain a concept child-friendly and hence no corresponding instances we choose the keyword concept as depicted in Figure 11. A keyword always modifies the concept or instance to its right, in our case the concept guesthouse, because we are looking for a child-friendly guesthouse.

Figure 10

Figure 11

In addition here's a video that demonstrates how to use HS³:
(HS3Input.mov)

Online Demo of HS³

You can access the prototype of the system that uses an index containing a couple of e-Tourism Websites automtically fetched from several e-Tourism protals at HS³ Prototype. As this is a prototype there might be still a couple of issues. To get started with the system you can input for example the query Hotel and pick the Hotel concept.

Evaluation

The following table lists all queries and their corresponding ontology that have been used to evaluate the performance of HS³.

Number	Query	Ontology
1.	Give me information about hotel Panhans	e-Tourism
2.	Give me information about bed and breakfast accommodations where I can play squash	e-Tourism
3.	Give me information about guesthouses that have a whirlpool	e-Tourism
4.	Give me information about youth hostels that are equipped with an air condition	e-Tourism
5.	Give me information about hotels that are located in Mattersburg	e-Tourism
6.	Give me information about locations where youth hostels are located	e-Tourism
7.	Give me information about holiday flats in Mayrhofen that are equipped with a sauna	e-Tourism
8.	Give me information about hotels in Salzburg that are equipped with a whirlpool	e-Tourism
9.	Give me information about guesthouses in Mayrhofen that are located near a mountain	e-Tourism
10.	Give me information about recommended bed and breakfast accommodations in Neustift that have a garden	e-Tourism
11.	Give me news about persons that work at Walt Disney	News (KIM)
12.	Give me news about the person Oliver Kahn	News (KIM)
13.	Give me news about companies that are located in Chile	News (KIM)
14.	Give me news about companies in the retail industry that are located in Chile	News (KIM)
15.	Give me news about organizations in the retail industry that are traded on the New York Stock Exchange	News (KIM)
16.	Give me news about organizations in the financial services industry that are located in Japan and traded on the OTC Stock Exchange	News (KIM)
17.	Give me news about organizations that are located on the Cayman Islands and are traded on the NASDAQ	News (KIM)
18.	Give me news about market research reports of organizations that are located on the Bermudas	News (KIM)

Downloads

The datasets that have been used to evaluate the performance of HS³ are listed in the following :

e-Tourism dataset

e-Tourism Ontology based on the Harmonise Ontology (300KB) (from http://www.harmonise.org)
e-Tourism KB using the Harmonise Ontology (3MB)
Complementary e-Tourism document corpus (in GATE format, 54MB)

News dataset (KIM)

KIM system including the the KIM Ontology and KB on Ontotext's Website (external: http://www.ontotext.com)
Complementary news document corpus (in GATE format, 3.8GB)

Information Management and Preservation