You cannot have our user's data

sourcehut.org

116 points by Tomte 11 days ago


simonw - 11 days ago

Blocking aggressive crawlers - whether or not they have anything to do with AI - makes complete sense to me. There are growing numbers of badly implemented crawlers out there that can rack up thousands of dollars in bandwidth expenses for sites like SourceHut.

Framing that as "you cannot have our user's data" feels misleading to me, especially when they presumably still support anonymous "git clone" operations.

bee_rider - 11 days ago

On the topic of licenses and LLM’s—of course, we have to applaud sourcehut at least trying to not allow all their code to be ingested by some mechanical license violation service. But, it seems like a hard game. Ultimately the job of their site is to serve code, so they can only be so restrictive.

I wonder if anyone has tried going in the opposite direction? Someone like adding to their license: “by training a machine learning algorithm trained on this source code or including data crawled from this site, you agree that your model is free to use by all, will be openly distributed, and any output generated by the model is licensed under open source terms.” (But, ya know, in bulletproof legalese). I guess most of these thieves won’t respect the bit about distributing. But at least if the model leaks or whatever, the open source community can feel free to use it without any moral conflict or legal stress.

RadiozRadioz - 11 days ago

From the Anubis docs

> Anubis uses a multi-threaded proof of work check to ensure that users browsers are up to date and support modern standards.

This is so not cool. Further gatekeeping websites from older browsers. That is absolutely not their call to make. My choice of browser version is entirely my decision. Web standards are already a change treadmill, this type of artificial "You must be at least Internet Explorer 11" or "this website works best in Chrome" nonsense makes it much worse.

My browser is supported by your website if it implements all the things your website needs. That is the correct test. Not: "Your User-Agent is looking at me funny!" or "The bot prevention system we chose has an arbitrary preference for particular browser versions".

Just run the thing single threaded if you have to.

dale_glass - 11 days ago

Yeah, I don't like this.

We have an Apache licensed project. You absolutely can use it for anything you want, including AI analysis. I don't appreciate third parties deciding on their own what can be done with code that isn't theirs.

In fact I'd say AI is overall a benefit to our project, because we have a large, quite complex platform, and the fact that ChatGPT actually manages to sometimes correctly write scripts for it is quite wonderful. I think it helps new people get started.

In fact in light of the recent Github discussion I'd say I personally see this as a reason to avoid sourcehut. Sorry, but I want all the visibility I can get.

matt3210 - 11 days ago

Anubis has had great results blocking LLM agents https://anubis.techaro.lol/

xvilka - 11 days ago

The solution is to make Git fully self-contained and encrypted, just like Fossil[1] - store issues and PRs inside the repository itself, truly distributed system.

[1] https://fossil-scm.org/

rvba-fr - 11 days ago

Looks like git diffs are the new gold for training LLMs : https://carper.ai/diff-models-a-new-way-to-edit-code/

sltr - 11 days ago

> a racketeer like CloudFlare

Could anyone teach me what makes this a fair characterization of Cloudflare?

- 11 days ago
[deleted]
mvdtnz - 10 days ago

What I don't understand is why these scrapers so aggressively scrape websites which barely change. How much value is OpenAI etc getting from hammering the ever-living shit out of my website thousands of times a day when the content changes weekly at most? I truly don't understand the tactic. Surely their resources are better spent elsewhere?

imglorp - 10 days ago

I would like to see a scraper tarpit that provides an endless stream of sub pages all filled with model training poison. Enough inaccurate or inappropriate material from enough tarpits will make this practice less profitable.

RadiozRadioz - 11 days ago

How about use this to contribute an absolutely tiny amount of hashes to a mining pool on behalf of the website owner, instead of just burning the energy

frohwayyy_123 - 11 days ago

> All features work without JavaScript

Maybe they should update their bullet points...

The footnote saying "fuck you now, maybe come back later" is really encouraging.

M95D - 8 days ago

> Anubis uses a multi-threaded proof of work check to ensure that users browsers are up to date and support modern standards.

OMG! Yet another firewall telling me what browser and OS to use.

immibis - 11 days ago

How sure are we that they're actually LLM scrapers and not just someone trying to DDoS source hut with plausible deniability?

- 11 days ago
[deleted]
efitz - 11 days ago

[flagged]

sneak - 11 days ago

Pretending that published data isn’t public is a fool’s errand.

The point of a web host is to serve the users’ data to the public.

Anything else means the web host is broken.