PatentSemTech 2023

4th Workshop on Patent Text Mining and Semantic Technologies

PatentSemTech aims to establish a long-term collaboration and a two-way communication channel between the IP industry and academia from relevant fields such as natural-language processing (NLP), text and data mining (TDM) and semantic technologies (ST) in order to explore and transfer new knowledge, methods and technologies for the benefit of industrial applications as well as support research in applied sciences for the IP and neighbouring domains.

PatentSemTech'23 workshop will be held as a full-day onsite event in conjunction with SIGIR 2023 .

Important Dates:


Time zone: Anywhere on Earth (AoE)

Submission deadline April 30 (25), 2023
Acceptance notification May 23, 2023
SIGIR PatentSemTech2023 workshop July 27, 2023

Challenges of using IP data for IR


From the definition of a search task perspective, users of patent information systems are highly specialised information professionals, who cooperate with research and/or legal departments in their institutions / companies. The search in this area is generally business critical. There are high requirements on the correctness and completeness of the data to search through, on the efficiency of the search interface, and on the trustworthiness of the provider, on the quality of the search results. For general language documents (like news articles, or Wikipedia articles) there is a variety of tools and methods to process and prepare them for a specific task. It is a most challenging undertaking to adapt or re-design such tools to address the requirements of working with patent and legal documents.

Patent Data Traits

Patent are a type of scientific text which is complex and difficult to analyse compared to the common language. Without being complete, some reasons are:

  • Patents, as a corpus and as a single document, are both very heterogeneous. A patent corpus covers very diverse scientific subject areas, such as chemistry, pharmacology, mining, and all areas of engineering, with the consequence that all kinds of terminology can be found in a patent corpus.
  • A patent corpus usually covers a long time span, often from the 1950s to the present.
  • Typographical errors are not uncommon, since many patents in their machine-readable form are derived from OCR-processing and machine-translation.
  • Patents are composed of detailed descriptions of the invention and the claims. As a result patents are on the average two up to five times longer than scientific articles.
  • Patents usually characterized by the use of the legal language.

Why work with Patent Data?


Working with patent data, besides its challenging aspects, does bring a richness of facets to be exploited with text-mining and semantic methods:

  • It consitutes a huge corpus of scientific-technical documents for a variety of technological domains.
  • They are rich in available meta-data such as spatial data, bibliographic data, classifications, temporal data, etc.
  • Patents describe essential scientific-technical knowledge enclosing solutions for real-world applications.
  • They are complementary knowledge to scientific literature, e.g. chemical and physical properties, bio-science knowledge for drug-target-interaction, which appears first in patents, mostly not published elsewhere.