Monday, December 11, 2017

Big Patent Data from Google

UPDATE: I got some new information from the folks at Google, discussed below.

Getting patent data should be easier, but it's not. It is public information, but gathering, organizing, and cleaning it takes time. Combining data sets also takes time. Companies charging a fee do a good job providing different data, but none of them have it all, and some of the best ones can be costly for regular access.

Google has long had patent data, and it has been helpful. Google patents is way easier to use than the USPTO website (though I think their "improvements" have actually made it worse for my purposes, but that's for another day). They also had "bulk" data from the PTO, but those data dumps required work to import into a usable form. I spent 2 days writing a python script that would parse the xml assignment files, but then I had to keep running it to get it updated, as well as pay for storage of the huge database. The PTO Chief Economist has since released (and kept up to date) the same data in Stata format, which is a big help. But it's still not tied to, say, inventor data or litigation data.

So, now, Google is trying to change that. It has announced the Google Patents Public Datasets service on its Cloud Platform and in Google BigQuery. A blog post describing the setup is here and the actual service is here. With the service, you can use SQL search commands to search across multiple databases, including: patent data, assignment data, patent claim data, ptab data, litigation notice data, examiner data, and so forth.

There's good news and bad news with the system. The good news is that it seems to work pretty well. I was able to construct a basic query, though I thought the user interface could be improved with some of the features you see in better DB drag and drop systems (especially where there are so many long database names).

The other good news is that it is apparently expandable. Google will be working with data aggregators to include their data (assuming a membership, I presume), so that you can easily combine from multiple sources at once. Further, there is other data in the system, including Hathi Trust books - so you could, for example, see if inventors have written books, or tie book publishing to inventing over periods of years.

Now, the bad news. First, some of the databases haven't been updated in a while - they are what they were when first released. This leads to second, you are at the mercy of the the PTO and Google. If all is well, then that's great. But if the PTO doesn't update, or Google decides this isn't important any more (any Google Reader fans out there?) then bye-bye.

I look forward to using this as long as I can - it's definitely worth a look.

Here is what I learned from Google:
1. Beyond data vendors, anyone can upload their own tables to BigQuery and choose who has access. This makes it a great fit for research groups analyzing private data before publication, as well as operating companies and law firms generating reports that combine private portfolio information with paid and public datasets.

2. Each incremental data addition expands the potential queries to be done, and you're no longer limited by what a single vendor can collect and license.

3. The core tables are updated quarterly, so the next update is due out shortly. As Google adds more data vendors, alternatives for the core tables will appear and be interchangeable with different update frequencies.

No comments:

Post a Comment