Eps 95 - Understanding Data Integrity and Liability in Generative AI with Joy Butler Transcript

Erin Austin: Hello, ladies. Welcome to this week's episode of the hourly to exit podcast. I have a very special guest today. My law school classmate, joy Butler joy. Welcome. And thank you so much for joining us.

Joy Butler: Thank you, Aaron. I am honored to have been asked to be a guest.

Erin Austin: Well, we're very excited to have you because AI could not be more top of mind, for this audience.

And so as someone who has written extensively and spoken about AI, I definitely wanted to have you on to, , go deep. So before we get started, would you introduce yourself to the audience?

Joy Butler: Sure. so as you already shared, I am an attorney and in my law firm practice, I provide product counsel services.

So that essentially means I provide a combination of strategic and legal advice. To companies that are, going into new lines of business or launching new products or new features of existing products or forming strategic partnerships. And I come by that from, two areas of wall where I have a deep, in depth knowledge, and that includes the technology side where I have worked on and, help to structure probably literally, over 1000 contracts over the course of my career for, all the contracts 1 would need when 1 is doing business.

Online and in digital technology, including end user license arrangements and terms and conditions and the other prong of my in depth legal knowledge concerns entertainment and copyright and this is where you and I overlap quite a bit. so I work on a lot of, creative content contracts also advise companies on.

Protecting their, copyrights and trademarks and, work with companies that want to use, someone else's content, doing a lot of work in the rights clearance area. And, just to give your audience a little more of a flavor of the types of projects I might work on, most of them are in the digital technology and entertainment space.

for So a couple of projects include helping an entertainment social media network launch, working with an commerce retail site that was incorporating a lot of album cover work and original artwork. another was, an ad supported, stock simulation game. And here's something that may resonate with your audience, helping a professional in the finance area take, this niche financial service he was offering and, convert it into an online software as a service.

product. so, that is me in a nutshell.

Erin Austin: Awesome. When did you first when I think even tell you the day I first heard about AI. Where were you when you first heard about it? What was the context and what were your initial thoughts?

Joy Butler: I don't remember, the first time I heard about, JATGPT. Right.

That may be what you're referring to. Yeah. but, actually within my practice, I have for quite some time been experimenting with, trying to take some of my knowledge And, develop it into, digital tools, making it more accessible to people. as you know, I've written a couple of books on my areas of in depth knowledge.

So, one of the things I've been experimenting with is, taking some of that knowledge and offering it in a digital format, one, experiment I believe I shared with you was A contest and promotion tool, which asked a number of questions and then gave you kind of a checklist. of the legal questions you might ask before going forward with that.

and I've asked, actually used a tool, that a lot of, attorneys and, well, it is a, Interview construction tool targeted to the legal space. It's called Doc. Assemble. It's actually open source and spent a little bit of time. tinkering around with that, is a long way to answer your question. I was familiar with automation and artificial intelligence through that process.

But when chat GPT came to my attention, that may have been around the same time as it came to everyone else's attention. I kept hearing about it and

Erin Austin: right.

Joy Butler: Right.

Erin Austin: I guess I'd heard about it, but it was just noise to me kind of like block train or crypto.

Like, that's like, I don't need to know that. I don't want to know it, until finally I could no longer ignore it, which was during. Yeah. And MCLE where I needed to get some credits, so I wasn't delinquent. And so I'm listening to this one about AI and it's describing, they were talking about chat, GBD in particular, and they're describing what you could do.

And they're having these samples and I'm like, what it can do. What? And so while I'm still in there, you know, it was just online. I'm silly. God forbid I go someplace in person. and then I'm on my computer, like. Doing stuff with it. I'm like, Oh my God, this is bad. And that was, well, it was February, 2023 and that was my initiation.

So what the last year has been, actually a fire hose of information and changes in that time.

Joy Butler: so I think chat GPT, it's the AOL of our times. It's this technology that's been around for a while, but we finally have this application that has made it a much more, accessible and user friendly for a much wider group of people.

Erin Austin: Yeah. I mean, I guess, you know, when you think artificial intelligence has been a while, I mean, obviously we've always had autocorrect and things like that, or, all those things were artificial intelligence, right? Like things like Alexa and Siri, right? I mean, those Versions of it. We just didn't think of it the way that we think of AI now.

Joy Butler: Exactly. It's been around for a while. We just finally got a killer app in chat GPT.

Erin Austin: Right. Awesome. So a lot of questions that I get are around, where's this data coming from? what is the black box of, generative AI in particular we're talking about. and what do I need to worry about?

are they taking my prompts and what are they doing with it? client who is, utilizing signing an agreement to utilize the contract review a I like, what are the issues regarding using 1 of those? So everybody has questions about, What happens when I use AI and what do I need to worry about?

And where does that, data come from and what is my exposure? So I would just like to start from the top. I think most of the audience is familiar, general generative AI, but let's talk about like what training data is. Like where does it get its information from? How does it get in there? And, yeah, just start there with a general.

Yeah.

Joy Butler: So when we talk about a I models and some of the copyright and licensing issues, there are kind of 2 categories. category is the input. And the 2nd category is the output. When we're talking about generative, a, I, So when you mention training material, you're talking about the first category of input.

And there has been a lot of controversy over whether or not, the training material that is required to train these models, can be used without permission. Because what the foundation models do, when I say foundation models, I mean that Maybe eight or 10 models are around that, literally have millions of pieces of content that they take into their kind of black box and, analyze it so that it can be a general use large language model.

and

many of these models do is they source that data by getting data from anywhere that they can, including, scraping the Internet for millions and millions of pieces of data. So, there's been, as I said, a lot of controversy around whether or not permission is required for them to do that. and many of these models are relying on now is an argument that, their use of that material, as training material qualifies as a fair use to the Copyright Act.

and,

I believe there are number of, Areas a number of factors that will gradually push these AI foundation models towards licensing that material. 1 of them is, is that there have been a number of lawsuits that have been filed against them, charging them with copyright infringement and other related infractions over their use of this material.

and. A lot of those suits while all of those suits are still pending and they may take a very long time to play out. I think we're going to see progress towards more licensing prior to that. And that's because people are very anxious. to, use a generative A. I. And, before they use that, though, they want some comfort level that their use that material is not going to subject them to any type of a copyright infringement or other claim.

So in order to make their customers, comfortable, With the fact that they can use this material without taking on any legal liability, we are seeing more and more of these AI companies gradually move towards licensing the content.

Erin Austin: want to follow up on that before I'm going to step back just a second, because you said large language models, and then we have machine learning and we have generative AI.

Are those synonyms? Or are they all different elements? Transcribed Okay,

Joy Butler: not the expert here, but I'll share with you my understanding. So, the large language models, they are, the general models that can process. the generated output, so that means they take all of the input, all of that training material, and they basically analyze it to see what the relationship of each data point is to this other data point.

So, when you ask it to produce something, it is, estimating or. Putting forth, it's analysis of what word should come next or what should come next in this particular graph, which is why it needs so much training material from which to learn.

Erin Austin: Got it. Okay. Now, you mentioned going back to where it's going towards licensing because users of AI, I want to know that they're not going to get sued when they use the output.

Well, what does that mean for all of the current data that has been scraped from the Internet and all these places? previously, I mean, isn't the data sets. and our use of AI as is almost like too big to fail. Like what could happen like with these lawsuits that are happening right now, if there are billions of pieces of, let's say pirated information and say, the chat GBT, open AI is training data set.

Like what could the possible remedy be if they lose?

Joy Butler: Okay, so I do want to separate this into 2 categories again, because when we talk about infringement, where there are 2 separate questions. The 1st question being whether or not just the process of. Of the, a I companies, taking in data as training material and using it to train their model, whether or not that's copyright infringement.

That's one question. And then the second question is if you as a user of these models, if you produce content and. Use it to produce generated content. Is there any legal liability for you? Now, there are circumstances that could be imagined where, it's possible for the, models, training data to be considered a fair use.

But maybe the way you've used it in creating output, is infringing or violating in some way. I'm not saying that scenario has actually come up or may come up often, but 1 can imagine a set of circumstances where that might be true. So, back to your original question, where is all this going? What are the potential remedies?

well, 1 remedy with respect to these lawsuits is that they will settle with a lot of these companies because the companies that have sued them have been the largest companies with the most resources and very large organizations. Like the author's guild, so they may settle, come to some agreement on what a settlement fee should be.

And it's also possible that part of their settlement might be a licensing agreement going forward. So that resolves matters for, the large organizations that have sued and. The large private companies, if it's an organization or association, representing much smaller players, it remains to be seen how much might flow to them as part of any judicial settlement.

It may be that as opposed to a private settlement, we might get some sort of a judicial settlement. I think. It's perhaps less likely, but it might be one of the outcomes and that might be a settlement like something that was proposed in Google Books. Now, the Google Books lawsuit, if anyone remembers his lawsuit from 2015.

This is the lawsuit that came out of Google Books starting its program where it digitized millions of books and use them and still uses them today to give a snippet of books in response to our search. So that is one of the cases on which a lot of these AI model companies rely when they argue that their use of the training material is a fair use.

For those who remember, the Google Books case initially, tried to resolve itself, via a judicial settlement agreement that would have permitted the snippets of those books and allowed the digitization, but that judicial settlement, or the private settlement that was proposed, went to court.

Very much beyond just providing snippets, which is, one of the reasons that it was ultimately, not approved by the court and kept going on and ultimately said, okay, well, we're stripping out all this information. you try to do in the settlement. But, as consolation, we decided Google books that your use is a fair use.

So, it might be that some of the parties, try to move in that direction of some type of a settlement that encompasses both small and larger players. some of the other types of resolutions that have been thrown out include kind of a collective. that would be parallel to, the way we collect, public performance royalties in the music industry.

So, for example, when a song is performed on the radio, all songwriters, receive some income from their songs they've written being played. Well. the radio station is not going out and I'm entering into license agreements with the millions of songwriters. They have collected is in the case of music, as cap and BMI and couple others.

that, have these collective agreements where they issue blanket licenses. So something like that has been proposed, potentially for, the training material space. So that brings in both, rights owners with very large catalogs and rights owners with very small catalogs. The copyright office had a comment period where it asked a bunch of industry players what they thought of this and most of the people who commented were very much in favor with, direct licensing, or perhaps even aggregated licensing now that may be in part because that's where, larger companies are going to get kind of the premium licensing.

Um, because the direct licenses we've seen, today have been between, AI model companies and very large organizations for millions of dollars. just like the, what we're talking about in the collective licensing, example, uh, those very large companies are not going to enter into license agreements with, millions of small players.

there'll be some balancing where they too can participate. 1 potential example of how this might be alleviated is through aggregators. So, one aggregator we have right now is the Copyright Clearance Center, which is aggregating, scientific papers for use in training material, and that allows, smaller rights owners to participate in, having their material and being paid for their material to be used as training material, if that's what they choose to do.

An example in, this space I've seen, come forward as a startup is called Dappier, and that is a startup that is, dedicated to getting those smaller, rights owners, giving them the opportunity to participate in being a part of training material. and making that training material more accessible to both the large, AI models, and, you know, the smaller, AI companies that might have fewer resources and not be as able to, compete when, you know, license, you know, is Agreements are going for millions and millions of dollars. Yeah.

Erin Austin: Yeah. I mean, it sounds like this would all have to be perspective. I mean, if, the AI companies have been scraping the Internet for we don't know how long and is it even able to distinguish 1 piece of data? In the data set from another, I don't, I don't know, like, how would you compensate all the people that are, you know, all the information that's already in there.

and in order to, parcel out payments, whatever fraction of a penny that, I might get for, something, and going forward. If you are a small content creator, kind of your everyday content creator, like the audience here, like it would then be on you to make sure that your content is registered somewhere.

So you'd be part of some aggregator that has a license who is getting paid by the AI, AI

Joy Butler: Okay. So several issues in that question. Uh, okay. Let's go with, That first part where you talk about kind of the provenance, what was the source of the data? Is it even traceable? And this is one of the pain points.

And this is also where that analysis about whether your output subjects you to any type of liability. Um, so back up for a second. Um, if you are any type of content creator, and if Uh, you are trying to determine whether or not the content you've created, um, is violating any rights, you need to know its source, right?

So, For the output that they have, if that provenance is not available, um, you using the generative AI, you can't even do that analysis. So that's part of the pressure on the AI model companies. Um, in not just waiting for these lawsuits to play out, but making their potential customers comfortable that you can use our AI models and it can produce You know, um, output that you can then use.

And so part of having to do that is knowing the provenance. Now, the extent to which they currently do that. I don't know. Um, I model companies have often. Been quite opaque and not very transparent about how the sausage is being made. Um, on the outside, though, like again as another example of where the industry is going, um, there has cropped up like another, um, kind of startup in this space called barely trained, which is offering certification for AI model companies.

Um, that have, uh, produced their models relying solely on an authorized data set. And then, you know, theory is, if you are, um, a company that wants to leverage AI, you can get more comfort in knowing that, um, you're relying on an AI model, an AI company that is fairly trained. And the last time I checked, there were only a few dozen companies that had that certification, but, uh, that may grow.

Erin Austin: So, maybe enterprise users would go for the fairly trained type, because they're much more concerned, frankly, than most kind of everyday users about the quality of that output. Um, it, it seems like if they're using it, um, you know, to create public facing materials, they would want that fairly trained data set behind it.

And they did, they also give reps and warranties when you go through them regarding the quality of the output.

Joy Butler: Do they give representations and warranties fairly trained provided

Erin Austin: anyone who well, fairly trained or someone who has licensed their data from fairly trained would they then in their terms of use have.

Represent fairly,

Joy Butler: fairly trained doesn't license data. Fairly trained is a certification program. So if an AI company, uh, wants this certification to show everyone, um, that. They have relied on an authorized data set, then this is a certification that they can apply for.

Erin Austin: Got it. Okay. So, so would, because I believe that there are some platforms that do provide indemnification, although they have a bunch of provisos, um, where's that going?

So that users feel more.

Joy Butler: Comfort there, right? So, I mean, I think that's part of their responding to, um, this pain point of needing to make their customers more comfortable with using their product. Um, they are, uh, providing a certain indemnifications, um. I, it remains to be seen, um, how effective those indemnifications would be if a customer were actually sued.

And as you mentioned, um, they do have, uh, a lot of exclusions, um, personally, I think that is just sort of an intermediate stop gap and they are going to be pushed more towards, um. More licensing of their data sets. Uh, so and I would say, you know, while we wait for this to play out, I mean, as you know, um, this lawsuit could take and probably will take a very long time.

Um, the Google books case, for example, on which the AI companies are relying to 10 years before finally reaching, um, that conclusion that, uh, you know, Google books digitization was a fair use. So I would say in the interim, um, AI companies, uh, and those. Uh, producing AI models should look more to, uh, using, um, authorized data sets or construction of their models and authorized data sets with a traceable provenance so that, um, their customers, when using the, um, output or wanting to put the output into use, um, can.

Know what the source is and do that analysis of is this violating any copyright? Is this violating any right of publicity or trademark or anything else? I would say for the companies that want to leverage a I, when you're looking for partners, you do want to look at partners who are using authorized data sets.

Right now, what I see is that a lot of companies, um, brands. Um, companies in the film and television industry that are actually leveraging AI and it is being leveraged, but they're using it for a first draft or a proof of concept for things that are iterative and you'll need to be turned around very quickly, but they're not using it.

As part of the final consumer facing output, just due to those copyright reasons, both the reasons we just discussed, you know, fear of having any type of legal liability, but also, um, because there are limitations on the degree to which you can protect, you know, output that's generated by, you know, artificial intelligence.

Erin Austin: Right, so when they have a. Authorized data set. And does the output come with footnotes with what does it come? What does that look like? Do you know? Have you seen, seen what that looks like that to

Joy Butler: tell us what the sources are with it?

Erin Austin: Like, does it identify? Yeah.

Joy Butler: Oh, oh, oh, I see. When is it authorized? It is right.

Um, To my knowledge, it is not coming with anything. And you're talking about the fairly trained component, right? Yeah, it is not coming with anything. Um, but I, that does need to be a path toward which we're traveling. And there has been like a lot of conversation about that in this space that it needs to be, you know, marked, um, needs to be traced in terms of, um, what was the source?

What did you rely on to do that?

Erin Austin: Yeah. And, and, As far as, you know, the magic that happens inside of generative AI platform, do we know what that is? Or is that kind of the trade secrets of each companies? Or is there kind of some general technology that makes that happen? Makes the magic happen?

Joy Butler: I am not the expert on the technology inside of the AI models. I can share what I know. In part, it does depend on the approach that they've used, whether it's supervised learning or unsupervised learning. Which to make it very simple depends on how much you assisted the machine, like, did you mark things and tell them, you know, this is a dog and this is a cat or did you just give them like kind of millions of pictures and kind of let them figure it out when you let them figure it out when it's unsupervised, it is more of a black box in terms of how they got to that answer, which brings up all sorts of other.

Um, you know, societal issue, right? I think it's a time to play it about

Erin Austin: a, uh, interesting. Well, let's, can we wrap up with some best practices just for your everyday kind of chat, GBT, Janet, what is the Google on Genesis? What does it, uh, uh, user like when they're using it, um, you know, for this audience, the expertise based business, maybe they're using it to create first drafts or to help them with social media posts or something like that, like just some general best practices.

Joy Butler: Sure, um, you want to be, uh, you know, circumspect about any, um, uh, confidential or proprietary information you include in a prompt, uh, you may want to anonymize it. Um, you need to keep in mind that, um,

you need to keep in mind that, um, whatever output you get from, um, the, the AI model, um, may not be eligible for copyright protection if this is, um. something, uh, material or output that you are passing on to a client or to a customer, uh, you may need to disclose that use, um, and you have to make sure that, um, you're using the AI output Depending on the extent to which you're using it, are you using it just for, um, a little bit of assistance in, um, modifying a few sentences?

Or are you actually producing images with it or producing an entire report with it? You gonna want to make sure that you're procedures for like using generative AI are consistent with the, uh, contract that you have with your customer. Mm-Hmm. , um. If you want to know whether or not, uh, your material, uh, your prompts are being incorporated into the training data and being used to further train that AI model, uh, take a look at the terms and conditions.

Um, to give you an example for chat, if you're using the free model, and it is, um, uh, recording your history. Of your prompts, then, uh, you're the prompts that you put in. There are subject to being included as part of a future training data.

Erin Austin: Yeah. Yeah. So, if it along the left hand side there, you have this scroll through and see all your graphs.

I have. Uh, that means it is, uh, it is going into the training data. That is excellent. So thank you so much.

Joy Butler: Oh, go ahead. I can't say with certainty that it is going into the training data. But I would say, um, it is susceptible to being used. It's like they have not provided you, CHAT2P has not provided you any representation that they will not use it for training.

Right.

Erin Austin: Very good. Thank you for making that distinction. So, uh, thank you for this. I, you know, this podcast is to help create a society that, and an economy that works for more of us. So I love to ask my guests, if there is an organization or a person who is doing the good and hard work to help make an economy that works for more of us, is there one that you'd like to share with us?

Joy Butler: Sure, I really like organizations whose mission it is to bridge the digital divide. And one of my favorite is girls who code that has as part of its mission, introducing more women. Uh, into the technology field, and that's very apropos to our conversation today, because as part of making AI, you know, beneficial for all humankind, uh, we really do need, um, a diverse perspective.

Erin Austin: Yeah, I mean, we know that just when we talk about the Trina dating sets, like what data is going in there, obviously. You know, the, what the output is only as diverse as the input, right. And how it's being trained. And I know that that has come up in a number of, you know, controversial ways as well, but whether something's leaning this way or that way, but we definitely want to make sure everyone has a voice in the future.

Thank you for that one. And we will put that in the show notes along with how people can reach you. Where do, where do you hang out, Joy? And how can people get in touch with you to find out more?

Joy Butler: Sure, so I'm always, um, happy to, uh, chat with, uh, people doing innovative things in, with technology, especially in the digital technology, online and entertainment space, so they can find me through my website, which is www.

joybutler. com. And I'm awesome. So, on LinkedIn, I have Joy Butler.

Erin Austin: Awesome. Well, thank you so much. And, yes, everyone, please, follow Joy and, let us know if you have any other questions about AI. I know it's constantly evolving. There's always going to be something new and we can continue this conversation in the future.

Thanks again, Joy.

Joy Butler: Thank you.