The MAREC data set

The MAtrixware REsearch Collection


Description

MAREC is a static collection of over 19 million patent applications and granted patents in a unified file format normalized from EP, WO, US, and JP sources, spanning a range from 1976 to June 2008. MAREC is intended as raw material for research and evaluation in areas such as information retrieval, natural language processing or machine translation, which require large amounts of complex documents. It allows experiments with real data on a realistic scale.
The collection contains documents in several languages, the majority being English, German and French, and about half of the documents include full text.

In MAREC, the documents from different countries and sources are normalized to a common XML format with a uniform patent numbering scheme and citation format. The standardized fields include dates, countries, languages, references, person names, and companies as well as rich subject classifications. It is a comparable corpus, where many documents are available in similar versions in other languages.

The 19,386,697 XML files measure a total of 621 GB. Further statistics are available on the original website here .

Download

Creative Commons License MAREC by IRF is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Permissions beyond the scope of this license may be available at mailto:marec@fandan.net.

Download IREC.tar.bz2 (75GB) md5

The MAREC original collection was missing part of the European Granted Patents claim section (EP-B documents). An EPB_Bugfix folder existed to provvide those files corrected. The IREC simply merges the original EPB folder with the EPB_Bugfix in order to provide a uniform representation. The CLEF-IP collections have never been affected by this issue, as they were specially curated.