The Battle Against Involuntary Data Harvesting
Becoming a wide-spanning movement in recent months, the restrictions of AI web scraping tools have scoured the world of journalism amidst the proliferation of Generative AI in 2023. With top names in the field — including the New York Times, BBC, Bloomberg, CNN, and many more — implementing safeguards to protect their content from unapproved data harvesting, how exactly will these standards protect intellectual property, and what risks ensue with these measures?
A Closer Look at Web Scraping and Generative AI
A cornerstone to this intersection of Natural Language Processing (NLP) and Machine Learning (ML), Generative AI almost necessarily depends on the variety of data it can collect from the internet to structure and learn for its outputs. With the size of these models such as ChatGPT, LLaMA, and Bard increasing exponentially — built with potentially hundreds to billions of tokens worth of information — the internet provides the needed trove of diverse training sets to ensure coherent, competitive outputs.
This competitiveness is key to the viability of models against their peers; Katie Gardner, a partner at international law firm Gunderson Dettmer, noted the importance of user-generated content and writing for firms specialized in sharing various forms of blog works — a point that is only further extrapolated to journalism. The “competitive moat is in the data,” as Gardner declares, with companies such as X (formerly Twitter) and Reddit recognizing this advantage and instead taking a paid approach for data transmission.
The Word from News Sites
One of the most significant reasons for journalism taking such a strong stance in banning/preventing web scraping is due to the nature of the business; naturally, news sites have a high concentration of reputable, high quality written information to train models with. Per a CNN interview, a variety of actions are being taken or considered, with ranging severity from reaching licensing arrangements to lawsuits leveraging from copyright laws.
Another top concern of news and publishing companies is the use of their content for false or misleading outputs, as reported by the Guardian. Known as hallucinations, generative AI may inadvertently generate responses that have false or misleading information which is promoted as fact. Principally, many in the world of writing and journalism find the use of these web scrapers as a blatant violation of their intellectual property — feeding into a system that can then mimic their work without just compensation. Similar holds for content and information gathered by individuals in non-work contexts. Gardner notes, “It’s very difficult to know if personal data was used, and if so, how it is being used and what the potential harms are from that use — whether those harms are to an individual or society in general”. With several intricate overlapping concerns, a legal approach must also be considered.
The Legal Argument
The legal approach comes equipped with several nuances, namely in the differing jurisdictions across countries. In the US, data scraping is argued to be just due to “fair use” laws, which “permits limited use of copyrighted material without having to first acquire permission from the copyright holder”. Companies such as OpenAI rally from this notion, especially with regards to the current Supreme Court ruling in allowance of copyrighted material as long as it holds a transformative nature (altering the work rather than transplanting it). That being said, many question whether this transformative always holds true, Gardner continuing, “...we’ve already seen many cases where user prompting can generate output that very plainly appears to be a derivative of a copyrighted work, and thus infringing”.
In Europe, the GDPR stands as the current barrier to large data collection without explicit agreement from affected parties. Italy’s temporary ban of ChatGPT in March 2023 stands as a concrete definition of these concerns, only allowing use once OpenAI had made adjustments to the data privacy concerns. Additionally, as the EU AI Act comes into action, generative AI will face increased disclosure requirements alongside general AI safety necessities.
As a consequence to these legal concerns, a decrease in the “black box” nature of many generative AI models has unfolded. Margaret Mitchell, researcher and chief ethics scientist at Hugging Face, reflected on these developments and discussions on data scraping, stating how it is a “good sign that AI ethics discourse is further enriching public understanding”.
Beyond Journalism: Crackdowns in Enterprise
Notably, journalism is not the only field at risk of this clash against AI data scraping. In the field of music production, Universal Music Group — one of the largest distributors on the market — requested a block to Spotify and Apple for scraping its catalogs for data, amidst the inception of AI-generated music. Moreover, several listing and informational sites such as Amazon and Airbnb have put a stop to web crawlers fueling generative AI, as well as Ikea, Indeed.com, and even BAILII, the UK’s court records system.
Implications on the Field of AI
With a grand number of firms across all fields banning the use of web scraping tools, it is worth considering the implications that lie ahead for generative AI companies. As datasets become more restrictive and costly, the quality of outputs may in fact face a significant decline as reputable sources become an out-of-reach notion for such firms. Much more careful AI risk management must be carried out — ensuring that the optimized outputs produced by GenAI models are sourced by reliable, trustworthy data with careful attention to bias-free outputs. Consequently, Calvin looks forward to guiding this area through its adherence to quantitative solutions in AI risk management, providing enterprises with verifiable results on their array of models within AI portfolios.
Interested to learn more about our Responsible AI proposition? Book a demo with us today!