Settings

Theme

Stack Overflow has stopped publishing data dumps to the Internet Archive

meta.stackexchange.com

101 points by JasonPunyon a year ago · 19 comments

Reader

Noble6 a year ago

The rationale obviously points to stack exchange blocking AI from training off their content on archive.org. They go on to demand adherence to “socially responsible” AI training which requires cash-flow between AI companies and the data sources they train from.

First, and most obviously, stack exchange does NOT own the forum content. It has been provided for FREE by the larger developer community, and that same community regularly makes use of the AI tools which will be inhibited by this policy change. Second, stack exchange is questioning the integrity of archive.org by hiding the data.

Developers are the real victims here, and the audacity of Stack Exchange to demand money for work they DIDN’T do, but continuing to NOT pay their forum contributors is peak irony.

  • fragmede a year ago

    You did read the TOS where you agreed that they DO own the content, yeah?

    • philipwhiuk a year ago

      Actually no, we agree to provide it under two licenses, one of which is CC-BY-SA. We don't give them ownership, we give them irrevocable usage rights.

      > You agree that any and all content, including without limitation any and all text, graphics, logos, tools, photographs, images, illustrations, software or source code, audio and video, animations, and product feedback (collectively, “Content”) that you provide to the public Network (collectively, “Subscriber Content”), is perpetually and irrevocably licensed to Stack Overflow on a worldwide, royalty-free, non-exclusive basis pursuant to Creative Commons licensing terms (CC BY-SA 4.0), and you grant Stack Overflow the perpetual and irrevocable right and license to access, use, process, copy, distribute, export, display and to commercially exploit such Subscriber Content, even if such Subscriber Content has been contributed and subsequently removed by you as reasonably necessary to, for example (without limitation):

      • abdullahkhalids a year ago

        Yes, but does that mean that SO is obligated to share the data with AI companies?

        I know that the CC-BY-SA [1] says

        > No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

        But I don't know if it is relevant here.

        [1] https://creativecommons.org/licenses/by-sa/4.0/

        • toomuchtodo a year ago

          It just means others will scrape and push into the Internet Archive (or publish torrents). They aren’t obligated, but they also have little control regardless of gating mechanisms.

      • fragmede a year ago

        That's just nitpicking on the definition of "do" and "own" though.

    • odo1242 a year ago

      They don't own the content according to the TOS, they get a license to use it (the Creative Commons Attribution-ShareAlike 4.0 license). They could still use it for AI training, but the model would have to be CC BY-SA 4.0 (not that AI companies care).

      This definitely forbids the "I will not transfer it to others without permission from Stack Overflow" checkbox, as the CC BY-SA 4.0 license says "You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits."

    • kmontrose a year ago

      The Stack Exchange TOS ( https://stackoverflow.com/legal/terms-of-service/public ) doesn't assign ownership - posters retain copyright, SO gets a non-exclusive license to it, and everybody else gets it under various CC wiki terms.

  • CamperBob2 a year ago

    Developers are the real victims here

    Yes, with the promise of access to godlike oracles that the ancient Greeks couldn't have imagined, we're the real victims here.

binarymax a year ago

I see where they’re coming from but they need to sort out the license confusion.

Stack Exhange data really is the worlds best open Q&A dataset. Far cleaner and more reliable than anything else.

But LLM trainers are going to use it no matter what. It’s not like they care about copyright or licenses.

JasonPunyonOP a year ago

You may remember a carbon copy of this event from a year ago. https://meta.stackexchange.com/questions/389922/june-2023-da...

Discussion from then https://news.ycombinator.com/item?id=36257523

swatcoder a year ago

Paraphrased: "Now that OpenAI is paying us for your freely contributed Creative Commons content, we share an interest in constructing their moat by making it harder for others to access both mechanically and legally"

PreInternet01 a year ago

Well, SO is now (possibly was?) owned[1] by the same group of companies[2] that failed to secure their own TLDs[3] for purely technical reasons, so, before nefarious intent, please also consider plain incompetence....

[1] https://techcrunch.com/2021/06/02/stack-overflow-acquired-by... [2] https://www.google.com/search?q=prosus+multichoice [3] e.g. https://www.icann.org/en/registry-agreements/terminated/mult...

precommunicator a year ago

I wonder if archives downloaded by two different people have different checksums? That would mean they have hidden a paper town (fake entry/signature) somewhere. I would be surprised if that's not the case, or will be the case.

luke-stanley a year ago

"Stack Overflow is no longer uploading the data dump to archive.org." "We would really rather users do not upload the file to archive.org or similar data pile sites." They have no way to stop people from doing that under the license. Only kind words. Since they've made it deliberately hard for people to train on, I'd be really surprised if people didn't put it on Archive.org and HuggingFace Datasets. So long as it's under the license, it should be fine, right? I am not a lawyer. What they said about access speed issues makes little sense to me, I torrented their dumps before just fine and was very happy to seed it.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection