Thursday, February 5, 2015

What is the Value of the Public Domain?

Paul Heald (Illinois), Martin Kretschmer, and Kris Erickson (both Univ. of Glasgow) have posted The Valuation of Unprotected Works: A Case Study of Public Domain Photographs on Wikipedia. The article is an attempt to value a small slice of the public domain. Here is the abstract:
What is the value of works in the public domain? We study the biographical Wikipedia pages of a large data set of authors, composers, and lyricists to determine whether the public domain status of available images leads to a higher rate of inclusion of illustrated supplementary material and whether such inclusion increases visitorship to individual pages. We attempt to objectively place a value on the body of public domain photographs and illustrations which are used in this global resource. We find that the most historically remote subjects are more likely to have images on their web pages because their biographical life-spans pre-date the existence of in-copyright imagery. We find that the large majority of photos and illustrations used on subject pages were obtained from the public domain, and we estimate their value in terms of costs saved to Wikipedia page builders and in terms of increased traffic corresponding to the inclusion of an image. Then, extrapolating from the characteristics of a random sample of a further 300 Wikipedia pages, we estimate a total value of public domain photographs on Wikipedia of between $246 to $270 million dollars per year.
I think this is a fantastic study. My thoughts follow the jump.

The authors calculate value of the public domain in two different ways. First, they calculate what these images would have cost on pay services like Getty and Corbis. Turns out more than I thought, with the default rate being around $110-$120 per year per image. This quickly added up, both in their sample of authors, lyricists, and composers, but also for the site as a whole, where they estimate some 43% of all pages contain a public domain image (more on that later).

They then use the following logic (backed by some evidence): pages with images get more hits, pages with more hits are worth more. Calculating just how many more hits are caused by images (and not something else) is tricky, but they use a variety of reasonable ways to do so. Applying these estimates from their sample, the extrapolate to increased page views of more than $200 million due to the ability to use public domain images.

I'm skeptical of empirical papers (I know, people who live in glass houses), and I'm doubly skeptical of valuation papers. I think the authors do a good job of answering most of my skepticism here. They anticipate objections, such as differences in hit rates due to author popularity or other page quality. The comparison to pay services was especially effective. Interestingly, they show for one segment that the increased value equals exactly the cost the services would have charged. Sounds like the services have a pretty accurate pricing model.

Despite my general agreement with the paper's conclusion, that having these photos in the public domain added value through saved costs and increased viewing, I remain skeptical and had a couple of general thoughts:

1. The sample size of pages used to extrapolate value to the whole site is 300, which seems awfully small given the great diversity of web page types. The authors discuss representativeness for many types of pages (like bios and places), but I wonder whether non-photo conducive pages were adequately represented. Typically, you solve this uncertainty with a confidence interval on the final value, but I didn't notice one reported. They report a confidence interval on the percentage of public domain photos at about 6% either direction, which is not huge (and perhaps why they don't focus on it for the final number), but that assumes that the 300 was a truly representative of the stratified population.

2. The valuation data relies on page view values of a third party based on advertising. The authors point to other studies that show similar values (though there is a range), but valuing hits is dicey, made especially more so when the report maker may have an interest. Nothing the authors could have done about it and they handle it well, but it's nonetheless an important piece of the puzzle that is not thoroughly vetted, which leaves me skeptical. Of course, you could cut the value in half, and it would still be more than $100 million - the key point is still made.

This leads to some final takeaways for me:

1. The authors talk about the ability to use orphan works whose public domain status may be unclear. I think that's important. I think also that copyright terms are too long and should certainly not be lengthened, though I believe this becomes a complicated question for long term serialized works, like Disney and comic books. Having term lengths based on continued "working" the copyright through new works wouldn't bother me in the least.

2. I would have coded some of the "public domain" photos differently. A lot of the photos were used under a Creative Commons license. I would have still included them in the study; this is a dedication in the sense that the public may use them (and that costs are lowered). But analytically, when we are considering the value of the public domain, we are valuing two different things if they are lumped together: we are valuing expiry v. dedication/licensing.

3. And to be clear, I do not consider most Creative Commons licenses to be public domain, because most include some sort of string attached, whether it be non-commercial use or attribution. That means that the owner is exerting control. To me, this is copyright acting as it should - giving the author the control to license as he/she sees fit. Attribution licenses can save gobs of money, but I don't know what it tells us about copyright policy of expired and orphan works, because those dedicating their works under Creative Commons might not want that; they might want that attribution.

4. I don't know what to make of the fact that Getty/Corbis license, for money, images that are in the public domain. Is this snake oil? Probably, but maybe part of it is ease of search - a single repository to search, and a quick transaction. This leads me to think that part of the value (or rather, cost) not captured here is the search time by the page builder, who must find the image and clear the rights. This is surely not $100 per year, but it's not zero.

5. One interesting finding of the study is that older authors were more likely to have images on their pages, not less. This is surprising, because we would expect fewer photos the older someone is. But a public domain story explains much of it as far as Wikipedia goes - the older the author, the more likely photos are in the public domain and to be used for free.

6. I wonder what in-line linking does to all of this. Why not just link to authorized web photos with an image tag without copying the image over? Is there a Wikipedia rule against this?

This all leads me to wonder where all of this leaves copyright in general. This paper is in part intended as an answer to claims that copyright must be extended to encourage more distribution. Here, we see more distribution and viewing on Wikipedia if the image is freely shareable - these folks are not going to pay, and if they had to pay, they would have to ramp up advertising to generate the funds to do so.

But at the same time, why were these photos available? Were they dug out of private photo albums? I suspect that many of these images first appeared in books - books that encouraged distribution and sharing of photos because they were initially protected by copyright. Without those books, it might have been harder for Wikipedia page builders to go find those photos. But now, in an age of crowdsourcing, perhaps individual users can plunder the private photo albums and we no longer need to rely on source research and the protection copyright might bring to the results. In either event, I'm convinced we don't want that protection to last forever.

