CLEF-IP, University of Technology Vienna

Previous CLEF-IP Labs

CLEF-IP has ran for the first time in 2009. Since then, the document corpus has increased, the number and types of tasks has been modified. Below some details about the earlier CLEF-IP challanges are listed.

Much of the information below can be found on the previous CLEF-IP website.

Documents in the CLEF-IP Corpus

Format and Content

The documents in the patent collection are stored as XML files. The documents are derived from European Patent Office and have mixed content in English, German and French.
The files contain bibliographic data as well as descriptive text. The XML files are quite comprehensive, containing detailed information on inventors, assignees, priority dates etc. From the variety of information in the XML files, these are the elements you should start to look at:

invention-title
classifications-ipcr
abstract

Number of Documents

2009: 1,9 million patent documents, corresponding to approximately 1 million individual patents filed between 1985 and 2000.
2010: 2,6 million patent documents, corresponding to approximately 1,3 million individual patents published until 2001.
2011: All EPO documents that have an application date previous to 2002 (more than 2.5 Million patent documents constituting more than 1 Million patents). In addition for EuroPCT Applications we also added the corresponding patent documents published by the WIPO (more than 400,000 documents).
2012: The data corpus used in this year is the same as the one used in 2012.

Tasks and Topics

2009

There was only one kind of task: find documents that constitute prior art. 10.000 topics were made available, participants could choose to submit experiments using subsets of the largest topic set. Accepted subsets had to contain results for the first 500, 1000, or 5000 topics out of the complete set.

The language of the topic documents was not restricted. The 2009 track also made available optional language tasks for English, German and French, where the topics had textual content in one of the three languages, only.

2010

Two kinds of tasks were available:

Prior Art Candidate Search Task: find patent documents that are likely to constitute prior art to a given patent application.
Classification Task: classify a given patent document according to the IPC.

Both tasks contained 2000 topics, participants to the Prior Art task were allowed to submit results for a smaller topic set of 500 topics.

2011

There were four tasks in the 2011 track:

Prior Art Candidate Search: Find patent documents that are likely to constitute prior art to a given patent application.
Classification: Classify a given patent document according to the IPC system, up to the subclass level. A new optional sub-task is to classify a given patent document up to the group/subgroup level, when the subclass is given.
Image-based Patent Retrieval: Find patent documents relevant to a given patent document containing images.
Image-based Classification: Categorize given patent images into pre-defined categories of images (such as graph, flowchart, drawing, etc.).

2012

Three tasks were organized in 2012:

Passage retrieval starting from claims (patentability or novelty search): The topics in this task will be based on the claims in patent application documents. Given a claim, the participants will be asked to retrieve relevant documents in the collection and mark out the relevant passages in these documents.
Flowchart Recognition Task: The topics in this third task are patent images representing flow-charts. Participants in this task will be asked to extract the information in these images and return it in a predefined textual format.
Chemical Structure Recognition Task: The topics in this fourth task will be patent pages in TIFF format. Participants will be asked to identify the location of the chemical structures depicted on these pages and, for each of them, return the corresponding structure in a MOL file (a chemical structure file format).

Relevance Assessments

Obtaining Relevance Judgements

Relevance judgements are produced by an automatic method using patent citations from seed patents.
In 2009, for a small number of queries, (pooled) search results were reviewed by Intellectual Property experts.

Document vs. Patent IDs

In 2009, relevancy was measured on patent-level not on patent-document level. That is, a relevant item is a patent, not a patent file (or document).
A patent is identified by its patent ID. This means that a valid result is of the form EP0383071 rather than EP0383071-B1.xml or EP-0383071-B1 (which are document ids). Note that this patent-level relevancy can be applied for EP patents where the patent number/ID appears in every patent document in the data set and identifies a patent univocally. This may not be the case for publications from other patent offices - a typical example being the USPTO.
In 2010 relevancy was measured at the document level, in 2009 and 2011 the relevancy was measured at patent level.

Information Management and Preservation