Practical Viability of Opt-Out Regime for AI Training

on the practical viability of an opt out regime n.w
1 / 26
Embed
Share

Explore the practicality of implementing an opt-out regime for AI training, delving into source-based and content-based opt-outs, edge cases, and safe harbors. Understand the impact on AI development, the necessity of massive pre-training data for frontier models, and the crucial role of the internet as the primary data source for AI advancement.

  • AI Training
  • Opt-Out Regime
  • Source-Based
  • Content-Based
  • Internet Data

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. On the Practical Viability of an Opt-Out Regime for AI Training Andy Gass Latham & Watkins LLP April 23, 2025 WIPO Conversation AI & IP: Infrastructure for rights holders and innovation

  2. Agenda Inescapable Technological Facts Source-based Opt Outs Content-based Opt Outs Edge Cases & Safe Harbors Impact on AI development?

  3. Frontier Models Require Massive Amounts of Pre-Training Data We pre-train Llama 3 on a corpus of about 15T multilingual tokens (Nov. 2024) Largest collection of public domain books and newspapers (OpenCorpus): ~ 885 billion tokens 100-year archive of New York Times articles: We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens . . . (Feb. 2025) ~ 5.5 billion words Music library with 1M songs: ~ 300 million words All the books in the world: The overall data mixture for training consisted of more than 30 trillion tokens (Apr. 2025) ~ 13 trillion words

  4. Frontier Models Wouldnt Exist Without Training On The Internet The web has acted as the primary data commons for general-purpose AI. Its scale and heterogeneity have become fundamental to advances in capabilities. Shayne Longpre, et al., Consent in Crisis: The Rapid Decline of the AI Data Commons, arXiv (2024) Kyle Lo et al., Opening the Language Model Pipeline: A Tutorial on Data Preparation, Model Training, and Adaptation, NeurIPS 2024

  5. The Internet Is Messy

  6. The Internet Is Messy

  7. The Internet Is Messy

  8. The Internet Is Messy

  9. The Internet Is Messy

  10. Agenda Inescapable Technological Facts Source-based Opt Outs Content-based Opt Outs Edge Cases & Safe Harbors Impact on AI development?

  11. Robots.txt No scraping by any bot ( User-agent: * ) of these sub-domains Googlebot-News not allowed to scrape these sub- domains No scraping of any part of the site by Common Crawl (CCbot) https://sfgate.com/robots.txt

  12. User-Agent Disclosures

  13. What is not captured by source-based opt-outs robots.txt robots.txt robots.txt robots.txt robots.txt User-agent: GPTbot Disallow: / ??? ??? ??? ???

  14. Agenda Inescapable Technological Facts Source-based Opt Outs Content-based Opt Outs Edge Cases & Safe Harbors Impact on AI development?

  15. What do content-based opt-outs look like? The hallway smelled like dust and old paint. Chairs scraped against the tile as students filtered in. Nothing on the whiteboard had changed in weeks. Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty. The ceiling tiles didn t match and nobody seemed to care. He scrolled through the report again, finding the same typo. The blinds were always closed, even on sunny days. There were three pens in the drawer, none of them worked. A plastic fork sat on the windowsill, warped from heat. They updated the software but didn t tell anyone what changed. Traffic moved slower than expected, even for a Monday morning. She poured the coffee and didn t drink it. One shoe was under the couch, the other in the hallway. They never fixed the leak under the sink. It made a slow drip that echoed through the night. Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty. The keys clattered as he typed, pausing only to delete half the sentence. Something buzzed in the wall but maintenance never found the source. The vending machine had the same items since last fall. She opened the door to a room full of silence. There was a calendar from 2022 still pinned to the wall. He passed the same billboard every day but never read it. The water from the cooler tasted faintly metallic. He left a sticky note on the monitor with a reminder he didn t need. Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty. The hallway light flickered every third second. The delivery came early, for once, and nobody was ready. The conference room clock was five minutes fast. They kept the old projector even though it barely worked. She clicked through Find the text to filter through n-gram searching

  16. What do content-based opt-outs look like? The hallway smelled like dust and old paint. Chairs scraped against the tile as students filtered in. Nothing on the whiteboard had changed in weeks. Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty. The ceiling tiles didn t match and nobody seemed to care. He scrolled through the report again, finding the same typo. The blinds were always closed, even on sunny days. There were three pens in the drawer, none of them worked. A plastic fork sat on the windowsill, warped from heat. They updated the software but didn t tell anyone what changed. Traffic moved slower than expected, even for a Monday morning. She poured the coffee and didn t drink it. One shoe was under the couch, the other in the hallway. They never fixed the leak under the sink. It made a slow drip that echoed through the night. Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty. The keys clattered as he typed, pausing only to delete half the sentence. Something buzzed in the wall but maintenance never found the source. The vending machine had the same items since last fall. She opened the door to a room full of silence. There was a calendar from 2022 still pinned to the wall. He passed the same billboard every day but never read it. The water from the cooler tasted faintly metallic. He left a sticky note on the monitor with a reminder he didn t need. Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty. The hallway light flickered every third second. The delivery came early, for once, and nobody was ready. The conference room clock was five minutes fast. They kept the old projector even though it barely worked. She clicked through Delete text before training

  17. How do you know what to filter out? I opt-out of AI training for all my works What works are covered? I opt-out of AI training for my book, Harry Potter How do I get the text for my filter? I opt-out of AI training for the text attached as Ex. A Content goes in centralized registry accessible to any AI labs for filtering?

  18. Agenda Inescapable Technological Facts Source-based Opt Outs Content-based Opt Outs Edge Cases & Safe Harbors Impact on AI development?

  19. What about derivatives / non-perfect matches? Modified text might not match How to avoid filtering out public domain text (e.g. Bible verses that appear in books)? Images / video hard to search Impossible for musical compositions How to deal with translations?

  20. Metadata solution? Protocol for embedding do not train metadata in a work itself Doesn t work for text Metadata can be stripped (including inadvertently) Prospective solution only

  21. Timing? Web Scrape / Common Crawl Download Model Training Begins Processing, filtering, tokenization, etc. Party A opts out Party B opts out

  22. Role for a safe harbor? Requirement to implement agreed-upon opt-out mechanism E.g., robots.txt, filtering content provided by rightsholder, metadata ID Requirement attaches at specific point in the data collection process (e.g., during processing) Safe harbor for edge cases arising from flaws in mechanism Inadvertent training on data that slipped past the agreed opt-out mechanism Inadvertent training on data if opt-out occurred after data collection phase

  23. Agenda Inescapable Technological Facts Source-based Opt Outs Content-based Opt Outs Edge Cases & Safe Harbors Impact on AI development?

  24. How will this affect AI models? Shayne Longpre, et al., Consent in Crisis: The Rapid Decline of the AI Data Commons, arXiv (2024)

  25. How will this affect the AI landscape? [O]ur results show web domains are rapidly restricting crawling and use of their content for AI . . . If these rising restrictions are . . . legally enforced, the availability of high-quality pretraining sources will rapidly diminish . . . [potentially] biasing [the available] data toward older content and less fresh content. AI Lab Internet Shayne Longpre, et al., Consent in Crisis: The Rapid Decline of the AI Data Commons, arXiv (2024)

  26. How will this affect the AI landscape? Vertically Integrated Tech Company Owned Media Subs (e.g. TV studio, news) Social Media Platform User-Generated Content AI Lab Internet

Related


More Related Content