Why is human downloading a file called pirating and AI scraping called training

18 points by nutanc a month ago · 12 comments · 1 min read

In this age of AI, with respect to copyright, looks like AI has more freedoms than humans. Sites like Anna's Archive, the PirateBay etc are blocked for humans(in India for example) and if you download and read a book, its called piracy. But if the same book is fed to an AI for training, apparently its fine and dandy. So Artificial Intelligence has more freedoms than Actual Intelligence?

chasing0entropy a month ago

They are the same, however one has Money to defend the accusations.

aebtebeten a month ago

Recall the "Golden Rule": those who have the gold make the rules.
- gsf_emergency_6 a month ago
  
  Used to be "eine Handvoll Soldaten" so it's progress!
  - aebtebeten a month ago
    
    I got nerdsniped by this quantity, and here's more or less where I wound up after monte-carlo sampling some rabbit holes:
    - the Praetorian Guard, famous for having been involved in many a roman imperial coup, varied in size between 4'500-6'000.
    - on 18 Brumiare VIII, Napoleon had at least 6'000 men at his disposal.
    - modern brigades are around 5'000.
    - the smallest successful coups since 2010 have been in Africa, with force estimates of 4'000-6'000.
    At least in the pre-drone era, "handful" has quantitatively meant at least several thousand (although it's probably true that any political component would have to liaise with only a few senior officers, and a modern brigade is composed of a handful of modern battalions)
    [note that Napoleon's was recursively a coup-within-a-coup; his political partners thought they were the brains and he was the muscle, but events proved them mistaken]
    
    gsf_emergency_6 a month ago
    
    Gemini offers
    https://en.wikipedia.org/wiki/2016_Turkish_coup_attempt
    as an example of a modern failed coup where exactly 5 soldiers were killed (on the gov side)
    (Not sure if drones were included in the planning)
    While at least one of the dead on the other side was a history teacher
    https://en.wikipedia.org/wiki/G%C3%B6khan_A%C3%A7%C4%B1kkoll...

general1465 a month ago

Buy some cheap computer like X99 with Xeon from AliExpress, add some cheap GPU like Tesla K80 and "train" your LLM models on it. Now you can pirate what you want and you are untouchable because every big AI company will give you lawyers for free of charge because if judge would decide against you, then the precedents would be against them as well.

ThrowawayR2 a month ago

There were a series of whitepapers commissioned by the FSF a while ago on Copilot when it was first released, one of which was "Copyright Implications of the Use of Code Repositories to Train a Machine Learning Model" and its lead author was a professor of law. The analysis concluded that use of copyrighted works for training was legally defensible. The paper is here: https://www.fsf.org/licensing/copilot/copyright-implications...

_wire_ a month ago

Time to "train" on Marshall McLuhan:

See The Gutenberg Galaxy (book) (1962)

McLuhan's Wake (documentary movie, narrated by Laurie Anderson) (2002)

Re Wake: Listen to the accompanying full interviews with McLuhan's colleagues from which the documentary is drawn.

dauertewigkeit a month ago

Western politics is all about constructing these narratives that hide the hypocrisy and self-serving nature of the dominant political factions. You can see it everywhere, but this is one clear example of it.

markus_zhang a month ago

Because ordinary people don’t make calls.

ben_w a month ago

I do not take strong views of what "should" be, the following is merely my opinion on what "is".

The legal judgement in the case of Anthropic may answer your question, although with the caveat that I'm not a lawyer, that I have no legal training, and that I may be misreading what looks like plain language but which has an importantly different meaning in law.

The judgement is here: https://cases.justia.com/federal/district-courts/california/...

To quote parts of the section "overall analysis" (page 30):

  The copies used to train specific LLMs were justified as a fair use. Every factor but the nature of the copyrighted work favors this result. The technology at issue was among the most transformative many of us will see in our lifetimes.

…

  The downloaded pirated copies used to build a central library were not justified by a fair use. Every factor points against fair use. Anthropic employees said copies of works (pirated ones, too) would be retained “forever” for “general purpose” even after Anthropic determined they would never be used for training LLMs. A separate justification was required for each use. None is even offered here except for Anthropic’s pocketbook and convenience.

In a way, this seems to be a repeat of the "The 'L' in 'ML' is 'learning'" argument:

You are not allowed to use the photocopier in the library to make a copy of the entire book. If your local library is anything like the ones I remember back in the UK, there's even a sign right next to the photocopier telling you this.

You are in fact allowed to go to a public library, learn things from the books within, and apply that knowledge without paying anything to any copyright holder. If/once you buy a book, likewise, because once it's been bought you don't owe the copyright holder anything for having learned something. This is the point of a library, of education, and indeed of copyright: the word is literally the right to make a copy, as in giving authors control over who has the right to make a copy, this is not the right to an eternal rent from what is learned by reading a copy.

(If you then over-train a model so it does print verbatim copies, this is bad for both legal and technical reasons: legal, because it's a copy; technical, because using a neural net to do a lossy compression of documents is a terrible waste of resources, which is just like humans in exactly the way that nobody has any interest in reproducing in silicon).

Nextgrid a month ago

Same reason that when a person lies (sometimes even by omission) it's called "fraud" but when a company does it it's just business as usual, or at worst, a "mistake" resolved by employee training.

Settings

Why is human downloading a file called pirating and AI scraping called training

Keyboard Shortcuts