Eps 91 Behind the AI Curtain: Sourcing: A Look at the Data That Powers Generative AI Transcript

Erin Austin: So today we're talking about AI again, but more specifically about the training data sets for generative AI, where it comes from, what some of the legal issues are, what is it? You know, when we think about chat, GBT and other AI tools, I know I talk about chat, GBT all the time. Frankly, the one that I use.

Erin Austin: And so the one I'm most familiar with, but we'll hear, but this applies to, um, all generative AI platforms. You hear about the vast amounts of data that they utilize. And as you can imagine, trading, uh, trading data plays a crucial role in the development and the efficiency. effectiveness of generative AI platforms, but where does all of that data come from?

Erin Austin: And I know you have lots of questions about that because you're worried about that is coming from your website. So let's start with what is training data? So training data is the backbone of any machine learning. project, which is what generative AI is. It consists of large sets of information that's used to teach algorithm how to recognize patterns and make predictions.

Erin Austin: That's how it is creative, i. e. generative. And so you put in this vast amount of data and it's labeled in certain ways. I don't know how it does this, but it learns the patterns and then it can make informed predictions and create new content based on that. So given the scale of modern AI, uh, requirements, the data sets are absolutely enormous, often encompassing billions of parameters.

Erin Austin: And, uh, and that, of course, will. Change depending on the size and complexity of the model that is being trained. So the primary sources of training data, or I should say, traditionally, the sources of training data for platforms like open AI was scraped from the internet for free. And that was used to train the first generative AI models like chat, GBT, and they've done a pretty good job, I'd say, of learning to mimic human creativity.

Erin Austin: And, but, and they, uh, of course, uh, thought, believe, and I think they're still sticking to this story, that it was legal and ethical for them to do so, uh, relying on some prior cases that, um, you can use, publicly available information, so long as it's transformative, essentially making a fair use argument.

Erin Austin: I'm not going to go into the fair use argument, but, um, that is the basis of why they thought they could do this. Um, as you probably know, there have been a number of high profile lawsuits about their use. So we will see, and there has not been resolved yet. And so we will see if their reasoning and their defenses hold up.

Erin Austin: So, uh, to discuss a few of the, uh, ways that they do get training data. So web scraping, which you've already talked about. Um, so there would be crawlers, they send out, scours the internet. It should only be scouring for things that are publicly available, that are not behind a paywall. However, there You know, well, you can ask the crawler, I'm assuming, to go behind the paywall, which obviously would be a breach of, um, the, uh, terms and conditions of a site if you go around their paywall, and also, you know, Even if there is no paywall, many sites will have terms and conditions would say you do, you're not allowed to use crawlers.

Erin Austin: And so, if you don't, uh, uh, comply with those terms conditions, then you're you're also, um, obviously. Uh, breaching those terms and conditions of that, uh, as well as when they're scraping that data off many times, if not always, because like, we can't really quite see what's in the black box of that training data.

Erin Austin: They're taking off any copyright notices, and it is a violation of the Copyright Act to take off copyright notices. So there's a number of issues, um, involved with it. web scraping. And so that obviously is falling in disfavor. So what is replacing that licensed data sets? Very large data sets that are licensed from entities that own large amounts of data.

Erin Austin: I read this, uh, regarding this new, um, path forward. There is a rush right now. To go for copyright holders that have private collections of stuff that is not available to be scraped, which, uh, is, uh, so this is from a lawyer who is advising content owners on deals worth tens of millions of dollars apiece to license archives of photos, movies, and books for AI.

Erin Austin: training. Bruder spoke to more than 30 people with knowledge of AI data deals, including current and former executives of companies involved, the lawyers and consultants to provide the first in depth exploration of this fledgling market and Detailing the types of content that's being bought, the prices that they're getting, and any emerging concerns that come from harvesting this type of data, even if it's licensed, because of the personal data risks that go along with harvesting large amounts of data where, uh, the personal data of the, uh, human that it belongs to, uh, is done without the knowledge or consent of that person.

Erin Austin: So who is, who are these huge licensees? There's a number of, a number of them. We have tech companies who have been quietly, um, uh, buying, uh, content that is behind locked paywalls and behind login screens from companies like Instacart, Meta, Microsoft. X and zoom. And so this might be some long forgotten chat blogs or long forgotten photos from old apps that are being licensed.

Erin Austin: Um, tumblers, parent company automatic said last month, and I'm recording this in, uh, April 2024, right? It was sharing content with select AI companies. And in February, that'd be 2024, Reuters reported Reddit struck a deal with Google to make its content available for training the latter's AI models. Uh, of course there's going to be some customer blowback.

Erin Austin: And so, uh, while this type of licensed content is accelerating, uh, there will probably be some amendments still to it because, um, you know, yes, meta goes in and it changes his terms of use, but does anybody read the terms of use of meta or of X or of zoom even. And, uh, so. They're going in changing their terms and conditions without anyone kind of without it saying in bright red letters, Hey, we're going to be selling your data now to AI training.

Erin Austin: Um, we'll, we'll see what kind of, what comes from that. All right. Then there are archives that are owned such as the Associated Press and Getty images, or say aggregator. They don't own all those images. Um, and so you can go to them and license their entire. entire archives. And that provides a great amount of data for your data sets.

Erin Austin: Universities and research institutions are also owners or controllers of vast amounts of data that can be licensed all in one fell swoop. And then there are some nonprofit organizations that want to encourage the use of A. I. Just as we've had, um, other types of nonprofits in the past, such as creative commons, who want to help people get more access to, you know, copyrightable materials.

Erin Austin: Um, and so now there are some who feel the same way about. Making AI, uh, data more accessible. And so for instance, this, uh, nonprofit Allen Institute for AI released a data set of 3 million tokens from a diverse mix of web content, academic publications, code books, and encyclopedic materials. Now, another source is synthetic data when this is a new one to me, but it is, uh, really points to how powerful AI can be.

Erin Austin: So synthetic data generation means that you use one generative AI tool to create synthetic data. And then you use the data, that data, that synthetic data to train another. Generative AI tool. So let's say you're developing a customer service AI model. You could use another generative AI tool to create fictional customers and situations and interactions.

Erin Austin: And then you can use those fictional customer situations and interactions as the training data for your public facing AI model. So that way you're not at risk of exposing private customer information. If you were to directly put your customer information into your AI tool, first, you kind of anonymize it using one generative AI tool.

Erin Austin: And it's not just enough to de, de, um, you know, demonize it. de identify it because there could be customer situations that are so specific that you could only point to one person. It's possible. So you also have to make up perhaps new situations, new backgrounds, things like that. But then you can use that as your fictional customer for your AI, your, your AI.

Erin Austin: govern customer service model to then use that to train to help provide customer service on an AI basis. So we will see this with hospitals and banks as well that have sensitive information. Obviously they cannot use their customer's sensitive information. As a as training data, but they do want to have access to what is really kind of part of doing business these days of having some sort of a I based system training systems.

Erin Austin: All right. And then, of course, not last and not least, is the data that comes from from you and me. So, what. Does that mean when we are using AI generated, uh, AI platforms, when we input our prompts, if we, you know, put in something that we've written and ask it to, you know, create a summary of it, if we put in a transcript from something and ask it to create a, uh, a show notes, like everything that we put into that.

Erin Austin: has the potential to become training data for that platform. And so if we are doing that, we need to be aware of the terms of use of that platform. You know, most of them will tell you that, uh, uh, that it can be part of the training data. And it might also end up being an output for someone who puts in a query, a prompt that, that you're, what you put in as a perfect answer for, you just don't know.

Erin Austin: And so we need to be careful about what we are putting in as prompts or as, uh, the input for whatever the, the, the AI platform that you're using. Make sure you are aware of their terms and conditions. Do not use any confidential information in there. Do not whether it's yours or your clients. So make sure that you're really aware of that.

Erin Austin: Um, some, uh, AI platforms, I'm thinking in particular of DocuSign, they do use AI. And obviously, when you're using, uh, DocuSign, there are legal agreements that are going in there that have identifiable information of the parties, commercial terms, and things like that are in there. And so DocuSign, um, has, you know, said that they, uh, strip out any identifying data from that, um, so that they, they do use the agreements, uh, for training data, um, but that they do strip out identifying information from it.

Erin Austin: So things to be aware of. So in summary, the legal issues, I think we've covered, but just to sum them up, there are the copyright issues of Putting data into the database. If you, I believe it was last week, I talked about the copyright ability issues of the output. So now I'm talking about the copyright issues with the input, whether or not the AI platform or you have the right to, um, add it.

Erin Austin: Information to the training data set, whether or not that is a copyright infringement. Is that fair use of that data? There are the, uh, and, you know, one of the issues, you know, in the copyright side is, you know, sometimes the output will literally be an exact replica of what went in and it's hard to make a fair use argument when a verbatim, uh, Um, paragraphs, uh, in the case of the New York Times, which is the, the basis of their, their lawsuit against OpenAI, you know, a verbatim paragraph comes out as the output.

Erin Austin: Where's the fair use there? Same with Getty, uh, images. They've had exact replicas of their images come out of an AI platform. So that's obviously an issue. Uh, Copyright, um, uh, in addition to copyright issues, we have privacy concerns. Um, you know, maybe a real images of people where there are instances of real images of people coming out and, uh, most certainly, uh, you know, private photos that are from somebody's, uh.

Erin Austin: Facebook, uh, old Facebook or old trying to think of what the old ones are, you know, or old blog posts, old journals. Think about what original, um, blogs were kind of like journals, right? And people would use them, um, as a journal and they're probably hanging around somewhere. Think about, I mean, I'm thinking about a blogger blog, I guess it was at the time that I was started.

Erin Austin: I mean, it didn't last very long and no one ever saw it. But it's still somewhere. Like, I don't know if I could find it today, but it's still out there and somebody's a web crawl that could find it. I don't think they'd be very interested, but it's there. And so we do have the privacy concerns. And then we have the contract breach of, if we are using say a client's confidential information, we're entering it into a AI chat, you know, bot.

Erin Austin: And It's the potential to be shared. We are breaching our contractual obligations to our clients. If we're doing that without permission, even if it is silent with the specifics of whether or not you can use a I and some contracts are being explicit about it. But even if it's silent. And you are obligated to use the client's information, you know, keep it confidential and only share it under very specific circumstances, you putting it into an AI platform is probably not one of those permitted uses.

Erin Austin: So you do have issues there as well. All right. So that is what I wanted to cover today regarding AI training data. Um, as you know, this is a fast moving. Uh, matter, you know, who knows what will come next week. I'll try to keep you up to date, but always feel free to drop me, uh, connect with me and let me know what your questions are.

Erin Austin: I'm always happy to answer them. Thanks again. And don't forget IP is fuel.