Tuesday, October 20, 2015

How Often are DMCA Takedown Notices Wrong?

A couple weeks ago, I blogged about Lenz v. Universal Music and wondered how often "bad" DMCA notices are actually sent. My theory was one of availability and salience - we talk about the few nutty requests, but largely ignore the millions of real takedown requests. I wrote:
How important is this case in the scheme of things? On the one hand, it seems really important - it's really unfair (pardon the pun) to takedown fair use works. But how often does it happen? Once in a while? A thousand times a month? Ten thousand? It seems like often, because these are the takedowns we tend to hear about; blogs and press releases abound. However, I've never seen an actual number discerned from data (though the data is available).
While there are some older studies on smaller data sets, no one has attempted to tease out the millions of notices that come in each moth now (like 50 million requests per month!). It turns out, though, that someone has attempted a comprehensive study through 2012. Daniel Seng (Assoc. Prof. at NUS/JSD student at Stanford) downloaded Google's transparency data and performed cross-checks with Chilling Effects data to give us 'Who Watches the Watchmen?' An Empirical Analysis of Errors in DMCA Takedown Notices:
Under the Digital Millennium Copyright Act (DMCA) takedown system, to request for the takedown of infringing content, content providers and agents issuing takedown notice are required to identify the infringed work and the infringing material, and attest to the accuracy of such information and their authority to act on behalf of the copyright owner. Online service providers are required to evaluate such notices for their effectiveness and compliance before successfully acting on them. To this end, Google and Twitter as service providers are claiming very different successful takedown rates. There is also anecdotal evidence that many of these successful takedowns are "abusive" as they do not contain legitimate complaints of copyright or erroneously target legitimate content sites. This paper seeks to answer these questions by systematically examining the issue of errors in takedown notices. By parsing each individual notice in the dataset of half a million takedown notices and more than fifty million takedown requests served on Google up to 2012, this paper identifies the various types of errors made by content providers and their agents when issuing takedown notices, and the various notices which were erroneously responded to by Google. The paper finds in that up to 8.4% of all successfully-processed requests in the dataset had "technical" errors, and that additionally, at least 1.4% of all successfully-processed requests had some "substantive" errors. As all these errors are avoidable at little or no cost, this paper proposes changes to the DMCA that would improve the takedown system. By strengthening the attestation requirements of notices, subjecting notice senders to penalties for submitting notices with unambiguously substantive errors and clarifying the responsibilities of service providers in response to non-compliant notices, the takedown system will remain a fast, efficient and nuanced system that balances the diverse interests of content providers, service providers and the Internet community at large.
I think this is a really interesting and useful paper, and the literature review is also well worth a read. I think the takeaways, though, depend on your priors. Some thoughts on the paper after the jump.

The strongest part of the paper is the analysis of technical defects - failure to attest under penalty of perjury (which happens very rarely given that web forms are used) as well as submission of mangled or missing URLs (technically URIs), which happens in about 8% of the notices, but only 5% of the URLs (there are multiple URLs per notice). Here's where priors matter. Many would say that a 5% error rate where blank or bad URLs are sent is too high and shows that copyright owners are just randomly asserting takedown notices. Others (count me among them) would say that when you are processing millions of URLs where content is transient, and where robots can make mistakes, 5% isn't a big deal - especially because it means there's nothing for the provider (who has automated URL checking) to do differently other than just reject the URL as bad.

Another technical error is when the copyright owner's name is left out. I'm not convinced that this is an error. The article notes that this only happens with foreign copyright holders, and the examples provided (when considered via Chilling Effects) shows non-latin characters in the requests. It's possible that the copyright holder name is merely missing from the database due to technical reasons on Google's end. In any event, there are very few of these problems, and Seng classifies them as technical.

The most important question is also the hardest one to answer: substantive errors. Seng capitulates that fair use analysis is impossible at this level. I'm not so sure - I wonder whether you can easily weed out URLs that are obviously pirated works (virtually all of them, as Seng later shows) and then hand consider the remaining takedowns. I don't know, because we don't get a sense for how many blog posts, etc. are buried in with the bit torrent links. That would be useful information.

The article uses two proxies to find substantive errors. The first is improper attestations of copyright ownership. Seng points to improper name of copyright holders as an example (that is, the reporting agency puts a Microsoft employee name as the copyright holder instead of Microsoft). I disagree with Seng and would put this in the technical error category.  Putting a contact name in the field asking for the copyright holder is likely a mistype. No rational Microsoft employee or agent believes he or she owns the copyright, and the company name is clearly marked on the notice. I think this, too, is pretty minor as there is no indication that the copyrighted work is not identified. Rather, it's probably a problem with the web form - this error is also a very small percentage of all claims.

The second proxy is the "dead site" proxy. The article examines takedown notices sent after sites had shut down, and some sent after Google had apparently removed all listings of the sites from its search engine. The article considers these substantive, because they show that the copyright holder does not consider actual infringement. I'm skeptical of this conclusion. First, one of the sources the article cites (a blog post) includes comments showing that Google search reported results from one of the closed sites eight months after Google supposedly removed all the listings. Of course, the target site, Megaupload, was long dead; your priors will affect whether you think the reporters had to check to make sure these were live links before reporting. I think it's not a big deal, but I can see how others might.

Second, even if the reporters should have made sure those links were live, there is no real dispute that all of the dead links were to pirated copies of the works that were the subject of the takedown. In other words, the most one can conclude is that the reporting agencies were aggressive about removing links without checking that the links were actually in the Google database or even actually in use. I don't think this pattern of behavior leads to the conclusion that valid fair use was ignored by the reporting agencies at any great scale.

Where does this leave us? On the one hand, the article presents us with a new and better look at DMCA takedown notices - a window we didn't have before. On the other hand, we still don't know how often substantive errors are made, and whether those errors are substantial in general magnitude or as a percentage of the number of takedowns.  With any hope Professor Seng can extend his analysis in future work to provide some answers.