Welcome again to the Data Connections blog! My last blog entry explored the thorny question of whether data is good or bad. If you have not already read it, you may want to start there before reading this entry. In this entry, we’ll take the next step and look at where data comes from (spoiler alert: it is not the stork!).
If you were to read the popular press or listen to regulators, you might believe that companies are creating all the data they use. Or, recalling my 3rd blog post, you might think that with 175 zettabytes of data in the world now there should be no problem getting data. So why is this even a topic? Companies already have all the data ... and, if not, data must be falling off the trees!
Neither of these things is completely wrong, but neither addresses the need for large volumes of high quality, see my last blog post, and structured (i.e., organized, as in a spreadsheet) data that is out there. Even large data creation companies like Google and Amazon, who have consumers and users providing them essentially free data all the time, simply don’t have enough of the right type of data for all that they want to do and still seek out even more data.[i] I recall talking to a client on my company’s research team who was creating an artificial intelligence system. She was constantly submitting requests to ingest 3rd party datasets. Each dataset had to be vetted and their terms had to be reviewed and negotiated by my legal team. We simply could not keep up with her requests. So, I asked her to help me to prioritize which types of data were most important to her work and which she needed most urgently. Her simple, if unhelpful for my purposes, response was “all of it… I need ALL the data I can get as quickly as I can get it.”
To get all the data, where would you look?
Sources for acquiring third party data include: 1) commercial data sets, bought or licensed from academic researchers and small data providers (such as a dataset you can download from Git Hub) or those that are regularly pushed out to you from a large commercial provider like Dunn & Bradstreet or FactSet; 2) open data that is shared, similar to open source software, at a low cost or at no cost, such as in a “papers with code/data” research scenario, or by government entities under open government programs (especially in the European Union); 3) creating or acquiring synthetic data[ii]; and 4) perhaps most controversially, data scraping.[iii]
Each of these options has positive and negative implications. Except for data scraping, the biggest single issue from all these options is that they simply do not provide enough of the right type of data, specifically targeted to the needs of an AI developer, for example, at a low enough cost to enable rapid development of data intensive applications like AI.
Below is a graphic that I created for my Data Law class to try to simplify this discussion a bit.
Of course, this graphic doesn’t cover all the issues with these types of data. It does make clear that each of the three most obvious ways to get data: create it; buy it; or get it from public sources; has issues that are problematic for data developers. These include, for some or all these sources, the cost of the data, the supply/quantity of available data, and the relevance/quality of the data.
This is to say nothing of issues with the terms and conditions related to the data. Large commercial data providers like Dunn & Bradstreet and FactSet might provide warranties as to certain aspects of the quality of their data (e.g., that it is current) or even the ability to use their data without intellectual property issues. However, the cost of their data is high for use in large volumes and their data coverage is not comprehensive. By contrast, smaller data developers or even researchers will typically make data available for a one time download, usually from a platform like GitHub, on an “as is” basis with no real warranties or commitments about where it came from. Open Data providers (e.g., government entities) likewise will typically make data available on an “as is” basis, since there is typically no or a very low cost to use the data.
Importantly, Open Data providers may also add unexpected additional terms in their data use agreements that include such things as limiting the use of the data to only non-commercial purposes, e.g., academic research. This makes the data essentially worthless for compliant AI developers. Meta’s Llama 2 License additionally prohibits use of generative models created while using its large language models to only being used with Llama models and their derivatives (and expressly not with their competition).
Where can (and do) developers go for the large quantities of data that they need for product development including to train AI systems?
One increasingly popular resource is synthetic data. This is data created, typically using AI, to have the same parameters as real data and is often based on altering real datasets. The benefit of this type of data is that, once all the required characteristics are defined for the system creating the data, it can be created in large volumes without concerns that plague actual datasets like elimination of personally identifiable information or inaccurate/bad labeling of the data in the metadata.[iv] Synthetic data has these positive aspects to it, but it also has potential negatives. Unless the parameters used to create the data and the details on the any original data set used to create it are disclosed, we may know little about the quality of the synthetic data and whether it is, among other things, unbiased and representative.
We now know where some of the data comes from, but each of those options seems to have potential issues with scaling. So, where does all the data come from? It would not be an exaggeration to say that many of our best known AI systems today would not exist without data scraping. While this approach to gathering training data is critical to AI, it is also quite controversial. Companies who need data would rather not speak about their data scraping activities. Case in point, I was at a conference for lawyers of companies in the AI space and one of the key speakers backed out of a panel discussion at the last minute when she realized, and told her general counsel, that we would be discussing data scraping.
Data scraping involves exploring web pages, scanning them, and then storing the data in a semi-structured format. Done correctly, the bots doing the scanning only scan the public portions of web pages and comply with the “do not scan” or robots.txt[v] notices for other portions of web sites that are posted by the website owners and hosts. However, even when data scraping honors the robots.txt protocol, it is still collecting and storing copyrighted information. Not only do the owners of this copyrighted information often feel that they should be compensated for this use, the AI systems may later be asked to use the data collected on a particular author or artist to create an entirely new work in the style of that author or artist – without any compensation to the artist.
The concerns over substitution of AI for human creators are so great that the recent writers and actors’ union strike in Hollywood focused, in no small part, on the possibility that writers might lose work to AI in the future. Likewise, programmers have accused Microsoft of using coding that they store and access from Microsoft’s GitHub repository to train its Copilot AI system to do programming without human coders.
AI developers on the other hand, lean on the fair use provisions of the US Copyright Act in Title 17 of the US Code as justification for their use of this data. Among other things, the fair use test allows for transformative use of the original copyrighted work provided that the new (or transformed) work has new meaning or message. In the case of AI, the data that is used to train the systems is rarely presented in the outcomes derived by the AI. If, for example, I ask an AI to create a children’s story in the style of Steven King, it will do that without copying over text from prior works. Instead, it would use those prior works to inform how the story is written and would create a remarkably similar new work.
A great deal of litigation is currently underway on whether use of data scraping in AI is a fair use. These cases are being closely watched and may determine the future livelihood of both creatives and AI companies. The recent Supreme Court case Andy Warhol v Goldsmith (2023) did not involve AI or data scraping but it did give us a sense of where the US Supreme Court might come out in this discussion. In that case, an Andy Warhol painting of the singer Prince that arguably made minimal changes to a photograph of Prince by Lynn Goldsmith was used in a memorial publication of Vanity Fair magazine. In the editorial process at Vanity Fair, the staff needed to choose between the painting and the original photo, and they elected to use the painting. This led the Court to find that when a new work (the Warhol painting) is used as a substitute for the original work, that does not qualify as a transformative (and thus not “fair”) use.
This case may indicate that if a work of AI authorship, for example the Steven King children’s book I mentioned above, created by data scraping, eliminates the sale of an actual work by the author, it will not qualify as fair use.
That would be a good result for the author, but likely not one that would be embraced by AI developers. Of course, AI creators could license these materials from the authors, but that adds costs and complexity the AI developers generally do not want. More to come on this, I am sure.
The quest to get “all the data” is a challenging one. Sources like open and commercial data provide the easiest approach to data licensing legally but continue to present challenges in providing the quantity of data needed for AI at the lowest possible cost. Synthetic data may be an answer but there is a need for much more transparency as we layer in the challenge of understanding artificially created data and the unknowns related to potential bias in its creation. Finally, there is the wild west that is data scraping with the promise of great quantities of relevant data offset by uncertainty around fair use and the public uproar over damage that AI using this data will do to the livelihood of creators (including computer code, books, movies, paintings, etc.).
The good news is that if you get hit with the awkward question from your children about “where does data come from?”… I am hoping you can now confidently answer!
In my next blog entry, we’ll look at an overview of the four legs of what I call the data stool. Then, we’ll take a deeper dive into each of these legs.
See you next time as we continue to explore data connections!
[i] Even Google needs more data: https://www.eff.org/deeplinks/2020/04/google-fitbit-merger-would-cement-googles-data-empire
[ii] Per Wikipedia: “Synthetic data is information that is artificially generated rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models”
[iii] Data scraping (also called Web Scraping) is the process of extracting data from web pages. Often this includes use of bots to download the data. This is frequently seen as a benign action when used on public data. But, when data scraping is done without a website’s permission, it can be malicious. Once collected, the data is put into a format more useful to an end user.
[iv] Metadata is like an index or card catalog for the data. It is the data that describes the dataset.
[v] More on robots.txt here: https://developers.google.com/search/docs/crawling-indexing/robots/intro
If you have questions about your data and your legal compliance programs for data, Mortinger & Mortinger LLC can help! Contact me directly at: steve@mortingerlaw.com
One thought on “Data Connections Blog #5: Mommy, where does data come from?”