Tuesday, January 23, 2024

Beyond the AI Black Box: Links to Articles and Excerpts From Interview With Charlotte Tschider

Maybe others are in the same boat. Until ChatGPT took the world by storm, I didn't have specific plans to write about AI at a granular level. Now my students and in-laws are asking about it. I have to write essays explaining how AI affects my work. The cases are cropping up. For others seeking to write about AI, and especially generative AI, I am sharing links to some articles I found helpful for understanding basic issues, as well as excerpts from an interview I did with Professor Charlotte Tschider, where I asked her questions about AI that she patiently answered.

Links to readings:

The Sedona Conference published a very use-friendly explanation and survey of the implications of generative/AI for the law in general, issue-spotting all sorts of issues across practice areas: Artificial Intelligence (AI) and the Practice of Law by Hon. Xavier Rodriguez.

Two articles I found useful for understanding the very-specific-mechanics of how generative AI's are actually created (with an eye to assessing copyright issues) are: Building and Using Generative Models Under US Copyright Law by Van Lindberg and  Talkin’ ‘Bout AI Generation: Copyright and the Generative-AI Supply Chain, by Katherine Lee, A. Feder Cooper, and James Grimmelman.  

On trade secrecy and privacy issues implicated by people using large language models like ChatGPT, I found Dave Levine's essay and this HJOLT note by Amy Winograd really helpful. A pre-ChatGPT paper co-authored by Sharon Sandeen questions whether AI related information like algorithms and training data qualifies for trade secrecy.

On patentability of AI, I found this Stanford Law panel from 2019 featuring Lisa Ouellette and Mark Lemley very informative and not outdated.  On patentability of AI and patentability of AI-outputs, I found this 2020 USPTO  report very, very helpful. Other pre-ChatGPT work on patents and AI that I found helpful include articles by Ryan Abbott (the "everything is obvious" concern), Liza Vertinsky (introducing "M/PHOSITA" for obviousness that takes into account machine as well as human skill in the art), Dennis Crouch (assessing patentability of AI-generated inventions through lens of corporate ownership and copyright work for hire doctrine), Keith Robinson (enablement issues raised by patenting AI), John Villasenor (including his recent Brookings paper on patentability of AI generated inventions), Melissa Wasserman et al (discussing patentability of AI-generated inventions from perspective of patent incentives theory).

I found two articles especially compelling on why AI-training on copyrighted works is not generally copyright infringement...except when it is Mark Lemley and Bryan Casey's article argues that AI training is generally "fair learning," but that some outputs may infringe, such as wholesale copying that competes with the copyright owner's "core market." Matt Sag's article argues that training by AI's is generally not actionable because it's "non-expressive" use (copying expressive works for non-expressive purposes just like in the Google Books case), but concedes that there is a risk of that the generative AI will "memorize" and spit out whole copyrighted works, as may have occurred with the New York Times' content (although it's now appearing the prompts may have been highly leading?). Both approaches are outputs-focused: they look at whether the output of the AI is "substantially similar" to copyrightable expression and/or is fair use, and adopt a default assumption that training alone usually isn't infringing.  

On how copyright fair use doctrine will apply to AI training: Pam Samuelson gives a summary of on-point fair use cases that might be applied in asking whether training on copyrighted works is fair use, giving arguments for both sides.  I also found this article on EU law by Andres Guadamuz helpful for thinking through high-level copyright questions of AI-authorship and AI-infringement.   

Other pre-ChatGPT work I found helpful: Carys Craig (copyright and AI authorship) Daryl Lim (AI and innovation policy), Arti Rai & Nicholson Price (medicine drug development machine learning). 

And last but not least, Charlotte Tschider's papers, including Beyond the Black Box and Humans Outside the Loop.  Excerpts from our interview are below.

Excerpts from interview with Prof. Tschider:

A few weeks ago I interviewed Charlotte Tschider, a professor at Loyola University Chicago School of Law. Her 2021 paper, Beyond the Black Box, significantly deepened my understanding of AI. I asked her some very basic questions and tried to clear up some confusion/s I've been having.

CAH: Let's start with what artificial intelligence is. If you do an internet search for artificial intelligence, or AI, you often get something like "machine-assisted intelligence." Do you agree with that definition of AI?

CT: No. I do not. It's more complicated, and definitions matter a lot, including for intellectual property law. It would be better to something like "using a machine or computer to perform tasks commonly associated with humans." That way we avoid making the claim that machines emulate human intelligence, because they don't. When we talk about AI we are really talking about is very, very robust computing power that creates algorithms that can mine data and create inferences and predictions from that data.

CAH: How does AI differ from a prior technology with which IP scholars have become familiar: Software?    

CT:  AI right now is mostly versions of machine learning, which leverages data to create more and and more complex algorithms.  Software, loosely, is computer code and systems that are created and fully imagined by humans. Typically, with software, we know what it can do. Its functionality is limited. With AI, there is a degree to which we cannot know how decisions are made. So for example we get unpredictable results. We might get inaccurate results, so-called "hallucinations."

Another way software is different is because software involves creating computer code that can be re-used for other purposes.  Typically, if you took a fully developed AI model and wanted to re-use it in a new application, you would probably have to do something different.  

That said, a lot of software now incorporates AI too, so the line between them is blurring.

CAH:  I read your article, Beyond the Black Box, twice.  But before I read your article I often thought of AI as "just algorithms." But it is not accurate to say that AI is ultimately "just algorithms," correct? You talk in the paper about how AI runs on algorithms and that it's the algorithms that make it operate, but you draw a distinction between today's complex AI system that run on machine learning, deep learning, neural networks, what be it, and the "human-created algorithms" that we often think about in areas like internet search or credit scoring.  (689-690). 

Can you explain this distinction between human-created algorithms and the complex algorithms in advanced machine learning models?

CT  Ordinarily algorithms are made by humans. Most human-created algorithms are "locked" on release. We have an algorithm and it does "XYZ." Some machine learning algorithms might operate a lot like human-created algorithms. But often machine learning algorithms are much more dynamic. They continue to learn from new data over time.  This means that the algorithm itself can be different in a minute or an hour after the previous decision was made. It will be evolving and self-learning.  

But humans are still involved at all chains of the creation of AI development, not just at that initial moment of creation of the algorithm.  This is not a linear model even for data scientists.  The layers of algorithms can be unintelligible by humans, but there is still human involvement at all times. Humans are needed to fix and refine the algorithms, even if they can't fully understand them.   

Although we focus on "algorithms," the paper explains that every other human decision of the AI system creates the algorithm in a machine-learning system. The infrastructure, such as server selection, database selection, application or app design, networking solution, all affect the performance of the algorithm. The algorithm results from several decisions about training data, production data, limits of decisions, hard-coded goals and expected outcomes, and the number of algorithms and outputs needed to achieve these outcomes (in neutral networks, we call these 'layers').

CAH: And that issue, the continuing role of the human decision-maker, that's the topic of your new paper, Humans Outside the Loop, correct, so I'll post that link, forthcoming in Yale Journal of Law and Technology.

Let's turn to generative AI, the species of AI that ChatGPT falls into. How would you define this as a species of AI? How is generative AI different from other AI?

CT: I think of generative AI as using machine learning models but in a way that also includes some ability to mine and analyze human language and other expression (most of current focus is on pure language models, rather than language to image models as we might see in stable diffusion). Human language is actually very inefficient and difficult to analyze and express, whether it's English or anything else. Machines communicate in a truncated format that can be very efficient.  Language models have to be able to convert speedy computing language (zeros and ones) into something that comes back out as human language. The more complex what you are expecting the computer to do, the harder that is. The greater variation of language needed for a particular use defines whether the language model is a small or large model.  

Basic chat functions, like IVRs (interactive voice response systems) that you interact with when you call a toll-free number, or basic chat on Amazon, those are comparatively simple AI implementations if the universe of responses is limited. We know at least some universe of what people might ask about. If you're on an Amazon chat, for example, you might be asking about shipping speed or cost of shipping or something like that. And if it gets too complex, where the AI doesn't understand it, it hikes it over to a human.

So that's a very simple language model. In contrast, ChatGPT is a "large language model." The amount and the variety of language is much larger, as are the complexity of potential prompts, including a universe of contextual cues, including prompts for linguistic tone. The universe of language that this system is trained on is a subset of what you might find on the web.  This is part of why each release may not be 100% current with what is currently available on the web through a Google search.

CAH: One thing I am getting from this is just how much variety there is in how AI is designed, even between so-called chat bots.

CT: The reality is that every single AI implementation is different. You can't necessarily generalize about how AI functions, and you really have to talk to the person that creates it to understand specifically how it's implemented.  But you can at least say, OK, generally machine learning does this. Generally neural networks do this. And the degree to which you can understand what's happening is reciprocal to the to the number of layers in the system and the complexity of the system. In the law, we deal in generalities, but in regulating AI this becomes very difficult when we discuss things like safety, which is dependent on the details of a given implementation and its design for specific goals.

CAH: Let's talk about intellectual property.  Let's talk about patents. First there is the patentability of the AI itself, the algorithms and so forth.  You suggest in Beyond the Black Box that, from the public policy perspective, patents are superior because mandatory disclosure in patents could aide in transparency.  (714-716). But when are patents actually available for AI? 

CT:  First, I'm not sure that patents are always superior, but they are recognized as a stronger form of IP because an inventor can enforce against infringing patents—the right to exclude others from making, using, or offering for sale the patented invention.  Determining what is patentable about AI can be challenging in part because we have a history of denying patents for "abstract ideas," and some might consider AI as just an abstract idea. If we limit AI to just an algorithm, sure. But that is not all AI is.  AI is not just taking an existing process and automating it. It's inherently transformative when we factor the underlying system powering it and the process that creates it.when we factor the underlying system powering it and the process that creates it.

CAH: So even under the Alice framework, you think a generative AI model, for example, could be patented?

CT: Yes, the "process" involved in creating AI is a lot more than just an "abstract idea."  So except for a simple algorithm itself,  I think there is a way to get there.  The patent office patents many AI inventions. Within the limits of obviousness, enablement, and the other criteria of course.

CAH: And then there's the outputs of AI, the Thaler v. Vidal issue of whether AI can be an inventor for purposes of patent law.  What if I said, "An AI-generated invention cannot be patented." Do you agree?

CT: I don't agree with that, no.  And and I honestly think that the Patent Office really was following the Copyright Office on this, as a starting point, without fully understanding this technology yet.

CAH: You don't agree with the stance that "an AI-generated invention cannot be patented" as a matter of law or as a matter of policy?

CT:  Either actually. The reason I don't agree as a matter of law is because there are a variety of inventions that are patentable where humans have set things in motion but may not understand every detail of how it works. As humans, we've made decisions that ultimately result in something that is protectable. But did we make discrete choices about every single step in that process? I would say no, especially when we're talking for example about things like pharmaceuticals. WeIn many cases, we have combined some things together, and we gotthe molecular compound produced a result we didn't expect. And we patented those. There's this presumption that a human is sort of making every single discrete choice related to an invention, but the reality is that a lot of inventions occur by mistake and by happenstance and by luck. What matters is that a human designed it, not that a human understands all of it. Moreover, patents are cabined by what is actually written in the claims and written description, so what the invention is, or its ability to produce additional inventions, presumably would be limited in nature. However, expressive works, which are stand-alone and not cabined by a prosecution process, are likely distinct. 

If a human creates a patentable AI system that then produces something that is potentially inventive and would meet the other characteristics of an invention, why would we prevent that human from claiming the result of the invention that they created?

CAH:  So at minimum, with respect to the human that created the AI, there might be a way to get a patent for AI-generated outputs.

CT:   Yes. Let's use a real example. Let's say you create an AI application that is designed to identify new molecular compositions related to pharmaceuticals.   And guess what? It identifies 5 candidates. You test them and let's say one of them is totally legitimate. It very much could be a new composition. Is the expectation that you can't patent that?

CAH: Leaving patents behind, you suggest in Beyond the Black Box that AI is especially conducive to trade secrecy. You say, "the natural status of a dynamically inscrutable algorithm" makes it "a natural trade secret."  Parsing this, there could be many trade secrets here. You've got the algorithms. You've got the training data, that is the data that a specific AI was trained on. And you've got the overall methodology for how the model was developed and trained. 

What is the choice between patent and trade secrecy going to look like for AI?

CT: We often think about patents and trade secrets as interchangeable for AI. But the calculus is different depending on the type of information we are considering. Trade secrecy is probably going to be best for the training data and the algorithms, because they're reasonably easy to keep secret.  At least for the algorithm, we can't even understand what it is, at least today, or how it functions.  And for the training data set, that is not revealed to users or the public, and it is going to have some security around it. There is limited access to the dataset. This is all kept "reasonably" secret through a variety of different strategies.

But I think when it comes to protecting the whole system architecture, the design choices, the process of model development, I don't know that trade secrecy is the best approach. I actually think patent is better for this, for a variety of reasons. Number one,  those would be comparatively easier to reverse engineer. Second, there's the public benefit in disclosure, even later disclosure (for innovation, safety, and fairness concerns). I should also mention that trade secrets are not the only impediment to disclosure. Confidentiality obligations, limited data sharing agreements, and other contractual obligations may increase the amount of information maintained as proprietary and are relatively intractable as creatures of private law. I talk about this in some detail in Legal Opacity: Artificial Intelligence's Sticky Wicket.

CAH: Ok, so for algorithms and training data, trade secrecy works, but for the overall method used to develop the AI, the "process" as you said, patent may be better. 

Now for the crazy part. In Beyond the Black Box, you suggest that complex AI algorithms can "survive reverse engineering..."   I have heard that there are ways to reverse engineer even the most complex AI's.  I sent you a 2016 paper on so-called model extraction attacks. This paper suggests it is possible to reverse engineer some aspects of AI, even when the user only has "black-box" access like with ChatGPT.   What are your thoughts?  Is it possible?

CT: It depends on what we are talking about.  Most deep learning algorithms will not be susceptible to reverse engineering without other accompanying data about the training data and foundation model. If the data scientist who is privvy to this information cannot determine why an algorithm made a particular decision, I would say it is nearly impossible to reverse engineer. Although model extraction attacks could provide some information about the algorithm(s). I don't think that would be enough to completely destroy trade secrecy.  I think it might be enough to figure out something about the AI, but not to discover all of it, so it would be at best a "partial reverse engineering," is how I would describe it. 

CAH: Yeah. There is still a lot of secret information left over. Some elements are known, but not all of it.  The whole combination of training data can be what trade secrecy calls a "combination" trade secret. 

CT: And again, this is where distinguishing between one algorithm and many algorithms matters, because with a very simple AI implementation, you might be able to figure out how the algorithm works, but in complex implementations where you have, let's say, 1000 layers of algorithms (models more like 'deep learning' neural networks), it will be nearly impossible. Understanding how those thousand layers work together and how each of those algorithms functions in relation to the other, with different weightings between the layers? No, it's not going to happen.

CAH: Thank you so much for talking this me. This has been so helpful.