Monday, December 12, 2011

Jay Kesan et al.--Developing a Comprehensive Patent Related Information Retrieval Tool

How can one reliably gather and identify pertinent data from such varied sources as litigation history, press releases, news articles, scientific literature, and foreign databases?  In a recent article, Jay Kesan and his colleagues in the theoretical and computational biophysics group at Illinois (Hang Yu) and in the civil and environmental engineering department at Stanford (Kincho Law, Siddarth Taduri, and Gloria Lau) explore potential frameworks for approaching this non-trivial task. Poorly performing search tools can result in millions of dollars in patent litigation or lost revenue—from erroneous preclearance searches to any resulting patent invalidation proceeding.  Accordingly, the inability to obtain and find critical information regarding patents and patent applications is a very serious problem.  Currently, cross-referencing hundreds of incompatible databases (operated by independent entities), wherein millions of documents are stored in various file formats and contain data arranged in a number of ways, is a very laborious and perhaps impossible task.  In Developing a Comprehensive Patent Related Information Retrieval Tool, Kesan and his colleagues directly address the need to create algorithms capable of cross-referencing data in multiple domains to identify the most relevant documents for a user's specific needs and also establish a basic framework for developing such tools.

There exists a glut of potentially relevant data for any patent issue.  To illustrate, there are over 7 million patents issued by the USPTO.  The number of records in the area of biomedical literature alone is over 19 million.  Over 49 million patents are contained in the Derwent World Patent Index.  In addition, there are tens of millions of pages of potentially relevant U.S. legislative history, statutes, notices, proposals, guidance documents, case documents, and administrative regulations. These numbers are staggering and present a unique challenge to researchers, such as Kesan and his colleagues, who strive to develop patent-related information retrieval tools.

Comprehensive retrieval tools differ from simple keyword search tools in many ways.  Typically, keyword search tools are most useful and efficient when searching a single type of domain.  In this context, “domain” refers to one of many areas where relevant data may be located.  For example, (i) administrative agencies, (ii) federal court systems, (iii) private databases, (iv) scientific literature, and (v) news media would each be considered a different domain.  The differences among these domains and data storage within each create inherent problems for traditional keyword search tools.

Kesan and his colleagues identify several problems that complicate the development of comprehensive patent-related information search tools.  First, many domains are incompatible because each domain organizes and expresses data differently and stores data in different formats.  For example, PACER (Public Access to Court Electronic Records) does not support keyword searches since documents within PACER are stored and rendered in image files.  Second, many domains are highly distributed among multiple entities.  For instance, many journals, courts, and conferences have independently maintained databases.  Third, the language used to identify certain technologies varies a great deal among domains.  Court cases are usually void of technical jargon that is present in scientific articles or patents and patent applications, while news reports often use “buzz words” not found in a patent to create catchy headlines.  Further, scientists often enjoy being lexicographers for their discoveries, creating situations where scientists in the relevant field identify the same natural phenomenon using different words or phrases.  For these reasons, the development of comprehensive patent-related search tools is not a simple endeavor.

Kesan and his colleagues approached this herculean task by creating a framework of computational steps and then using erythropoietin—a protein that plays a vital role in the production of red blood cells and is used to treat several medical conditions including anemia—as a test case.  Their approach consisted of keyword expansions and independent database searches, followed by cross-referencing the results.

The test case showed that synonyms and related concepts (such as terms derived from ontologies) could be used in the keyword expansion and searching steps to capture additional documents in the patent domain.  However, Kesan and his colleagues improved upon basic keyword searching by incorporating heuristic weighting functions to each search to increase the return of relevant documents and decrease the scores of irrelevant documents.   (Improvements were determined by the final weighted score of five “core” erythropoietin patents).  Kesan and his colleagues demonstrated that court documents frequently cite the patent number and do not contain many of the details found in patent applications, which allowed Kesan and his colleagues to search the domain of court documents using fewer keywords as compared to the patent domain and still discover all of the predetermined “core” court documents.  As a result, Kesan and his colleagues established a framework for developing a comprehensive search tool and used a test case to establish proof of concept.

Kesan and his colleagues identified multiple limitations in their work, including the inability to establish an automated retrieval of documents from Lexis or PACER and issues with respect to processing time.  However, the intent of the paper was to create a framework to develop search tools capable of identifying relevant information across multiple domains, which Kesan and his colleagues successfully demonstrated across two very large and very different domains.

With additional 102(e) prior art becoming available per the Leahy-Smith America Invents Act, the development of robust search tools is now more important than ever before.  As noted by Kesan and his colleagues, a substantial amount of work remains before such a framework can be successfully scaled to meet the computational demands required to use such a tool on a database of millions of files.  Although significant strides must be made before the methods described in this article can be developed into a functional framework for patent searching, this work represents an important first step toward a much-needed solution to a pervasive problem.

Posted by Randall Beane, PhD (rbeane@smu.edu), a 2013 Juris Doctor candidate at SMU Dedman School of Law and research assistant to Sarah Tran.  His primary scholarship interests are intellectual property, biotechnology, and environmental law.