Identification of Low/High Retrievable Patents using Content-Based Features

S. Bashir, A. Rauber:
"Identification of Low/High Retrievable Patents using Content-Based Features";
in:"Proceeding of the 2nd ACM International workshop on Patent information retrieval (ACM-PAIRĀ“09), Hong Kong, China, 6 November, 2009.", herausgegeben von: Conference on Information and Knowledge Management (ACM-CIKM2009); ACM, New York, NY, USA, 2009, ISBN: 978-1-60558-809-4, S. 9 - 16.

[ Publication Database ]


Document retrievability is a measurement used in information retrieval for identifying the bias of retrieval systems. In order to measure system bias for a specific document collection, an exhaustive set of queries is processed, measuring the frequency with which each document is retrieved. For better understanding and handling system bias, we need to understand the characteristics of documents that influence retrievability, and ideally be able to identify documents with high and low retrievability in advance. For this purpose, we identify a number of content-based features, which can be used effectively to classify a corpus into documents with low and high retrievability w.r.t a specific retrieval system. Our experiments on patent collections show that these features can achieve more than 80% classification accuracy on different systems, and hint at the need to combine different retrieval systems for optimizing recall.