Settings

Theme

AI-Shunning robots.txt

github.com

44 points by glynnormington 2 years ago · 91 comments

Reader

nerdjon 2 years ago

I am curious, do we have any evidence that AI is adhering to robots.txt and isn’t ignoring it since they are not technically crawling in the traditional sense?

Even if they are right now it would be a quick switch for them to just ignore it.

  • omoikane 2 years ago

    I have examples in my logs of GPTBot fetching only /robots.txt, and nothing from the same /24 block fetched anything else after that, so it seems at least that bot respects robots.txt.

    Maybe your question is "how do we know if whatever system GPTBot feeds downstream didn't just get your content via something else that crawl your site?" I am not sure we have anything to defend against those, other than signalling via robots.txt to say that our content is not intended for AI use.

  • mrkramer 2 years ago

    Internet Archive's crawler is not respecting robots.txt because they want to archive everything not just parts of the Web. But if you are actively breaking robots.txt then your crawler will have a bad reputation and you will have an army of webmasters trying to block your crawler by any means. You can see crawling requests in your sever logs, that's how you know if they are respecting it or not.

    Imo, they best solution would be to license your content so crawlers pay a fee for crawling and using your content.

    • nerdjon 2 years ago

      Well TIL that IA does not respect robots.txt.

      Does IA themselves block crawlers? It doesn't look like it according to their robots.txt, even going so far as to say "Please crawl our files."

      What would stop an actor from maliciously complying with a robots.txt file by just going to the internet archive instead.

      • mrkramer 2 years ago

        >Well TIL that IA does not respect robots.txt.

        At least, that's what they say[0].

        >What would stop an actor from maliciously complying with a robots.txt file by just going to the internet archive instead.

        Nothing; as far as I understand scraping public web is legal or that's what courts are saying lately. Btw, it's mind boggling to be me that after 30 years of commercial Internet and Web, we still don't have a definite answer is scraping of public websites and public web content legal or illegal.

        [0] https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

        • nerdjon 2 years ago

          > Nothing; as far as I understand scraping public web is legal or that's what courts are saying lately. Btw, it's mind boggling to be me that after 30 years of commercial Internet and Web, we still don't have a definite answer is scraping of public websites and public web content legal or illegal.

          I was more thinking from a public perception side instead of legal, but legal would be a good question too.

          Something like, "Yeah I totally respected your robots.txt file the only reason I have your data is because I crawled IA, see they are the ones you should be mad at not us"

  • andybak 2 years ago

    This is about crawling for training data by the look of things. Not sure if the CHatGPT browsing mode uses a different user-agent but most of the entries in that list look like crawlers.

    • nerdjon 2 years ago

      I had assumed this is related to sites like chatgpt going out and searching with a specific request.

      Regardless, my original question is still valid. The companies have already shown a lack of care about the data they train off of. So if ethics have already gone out the window, what is to stop them from ignoring this file if they are not already.

vouaobrasil 2 years ago

Nice. Let's all contribute to this...ideally, web-hosts should provide this sort of thing by default so we can starve AI companies from training data and combine it with other strategies to put them out of business for good.

  • andybak 2 years ago

    How about AI from non-companies? Or genuinely non-profit or open projects?

    Also - out of curiosity - do you use any AI yourself?

    • vouaobrasil 2 years ago

      > How about AI from non-companies? Or genuinely non-profit or open projects?

      AI from any project will allow AI to be used commercially, and thus I oppose it. Moreover, I oppose AI on various other princincples even independent of this: it further isolates people and can be used to develop other technologies that are too powerful for us to handle. In short, I believe human beings en mass are too stupid to use AI.

      > Also - out of curiosity - do you use any AI yourself?

      I do not, or at least I try my best not too. In fact, I hate AI with a passion. Obviously, there may be products here and there that have used AI that I in turn use. What can you do? But I attempt to minimize any contact I have with AI: I don't use Grammarly, any form of auto-suggest, I use an ancient phone (and I RARELY use it, I hate smartphones), I don't use AI features in software such as AI-noise reduction, I turn off all automatic features in software that may have some AI behind it.

      If I find out a website uses AI for content generation, I ban it and never visit again.

      The other day I downloaded a text editor that looked cool but I deleted it because I realized it has an AI-console (even though I never used it).

      I also work for a business and I convinced them not to use AI. We're an online magazine and it turns out the vast majority of our readers supported that decision.

      In short, I am against AI because I believe it provides virtually no benefits to humanity, only detriments.

      • nerdjon 2 years ago

        I think this is an interesting situation where “AI” has become just a general term that it has lost so much meaning thanks to things like chatgpt.

        Video game AI is obviously in a different league than ChatGPT but uses the same label.

        Some AI is machine learning and some isn’t.

        I agree with a lot of what you are saying, but I think there are valid use cases of AI (before chatgpt) that is actually a benefit.

        You don’t use a smart phone, but auto correct is a genuinely great addition. It doesn’t remove anything the human does and improves the usability.

        On its own autocorrect isn’t going to write a story. Even the suggestions that have been added in recent years are more for human usability than anything else.

        Handwriting reading models, fall detection models, etc.

        I do think we need to separate generative AI that replaces humans from traditional AI that offers assistance.

        I know someone is going to argue, well chatgpt is augmenting me by checking my code, emails, etc. and that may be true right now but we are kidding ourselves if that will be the situation long term.

        • mrkramer 2 years ago

          >I think this is an interesting situation where “AI” has become just a general term that it has lost so much meaning thanks to things like chatgpt.

          AI is just a branding buzzword by Big Tech companies....before AI, there were "machine learning" and "deep learning" buzzwords widely used. "Blockchain" and "SaaS" are also some of the most infamous buzzwords.

          • nerdjon 2 years ago

            I get what you are saying, but AI isn't a new term. Even when ML was a commonly used term AI was still thrown around.

            For as long as I can remember AI has been the term used for video games.

            But that is the problem when talking about "AI". Because Video Game AI isn't even Machine learning, but we also call ChatGPT AI.

            So I just caution against saying "No AI" when that could get rid of things that are nowhere near what we are currently talking about for AI.

      • rideontime 2 years ago

        Likewise, I've unsubscribed from multiple paid Patreons and Substacks as soon as they started using AI to generate content. I'd rather see an amateur MSPaint scribble than some dall-e monstrosity at the head of a newsletter.

        • vouaobrasil 2 years ago

          Exactly! Even if a person isn't that great at drawing or painting, it's so much more interesting to view their attempts because they reflect something about the person. The AI-generated fluff might look nice (even I'll admit some look interesting), but it's devoid of human soul. And there's really not much point in art if it has no communicative value that originates from living beings.

          • rideontime 2 years ago

            Yeah, I should be clear that I don't use the word "monstrosity" to suggest that all AI-generated images are ugly, but rather that they're inherently, viscerally disgusting regardless of their visual appeal. No amount of progress in the visual quality of their output will change how I feel about it.

      • andybak 2 years ago

        OK - that's fine then. I might disagree but I can't accuse you of inconsistency.

jddj 2 years ago

The named source, https://darkvisitors.com, is interesting.

cabirum 2 years ago

The crawlers can simply stop identifying themselves via custom user agent, can't they?

Also why are "AI" crawlers are worse than "normal" crawlers?

Either way, this is an exercise in futility.

  • vouaobrasil 2 years ago

    > Either way, this is an exercise in futility.

    Is it really? Every drop of opposition towards AI in my book is a good thing. This robots.txt thing is a small drop maybe, but over time public hatred for AI can build and it might in fact be taken down. Especially outside the tech bubble, many people are ambivalent towards AI.

    Yes, in modern society were are taught to value innovation and ignore its downsides, but the more vocal opponents are against it, the more those downsides will become apparent. Hopefully, it will bring the ruin of all AI companies and research.

    • cabirum 2 years ago

      I'm kind of out of the loop in regard why do we need to hate on AI? The bubble will burst given time, like all the other bubbles before.

      What needed is indifference, not hate.

      • nerdjon 2 years ago

        I think we need to be careful to assume that the AI bubble is really going to burst, there is way too much money being put into it by big players for them to just give up on it. I mean Microsoft is even pushing for a new key on Windows keyboards.

        We also have to look at this bubble differently. With Crypto there was a general technical level that was required to use it that many people just did not have. I would say the same is true for most bubbles we have seen.

        AI is generally available for anyone that has enough understanding on how to open a browser to use. Too many non technical people have just accepted it without understanding the risks.

        Edit: You can really see just how bad this is going to get if you look at Apple. How they are deemed to be "behind" because they don't have generative AI right now and how they are so desperate that they are looking at Google.

        I really don't think this is going to burst and really all that is going to happen is even more consolidation and we will be screwed.

        • starbugs 2 years ago

          > Edit: You can really see just how bad this is going to get if you look at Apple. How they are deemed to be "behind" because they don't have generative AI right now and how they are so desperate that they are looking at Google.

          > I really don't think this is going to burst and really all that is going to happen is even more consolidation and we will be screwed.

          It's always FOMO until it's FUD.

      • starbugs 2 years ago

        > What needed is indifference, not hate.

        Agree that hate is not the solution. But saying "no" decidedly and loudly is absolutely something that people need to relearn and I am happy to see that some do now.

        Just being indifferent is not sufficient.

        Finding and taking measures to protect yourself and others is a positive way to approach this IMO.

      • vouaobrasil 2 years ago

        I disagree. If something is truly damaging to society, we need to opposite it actively and crush it.

        Indifference is what leads people to ignore climate change. Indifference is what leads people to allow corporations to destroy communities. Indifference is what allows global capitalism to keep going.

        I am not concerned with an AI hype bubble bursting. AI already has enough power to strengthen the industrial society outside the hype. Perhaps hatred is not quite the right word. Perhaps love is a better one: a love for smaller communities and more sustainability. Wasn't it Che Guevara who said, "the true revolutionary is guided by strong feelings of love."?

        Regardless, indifference is not he answer: it is passionate opposition with a zero tolerance policy towards AI, AI researchers, and AI companies.

    • Wissenschafter 2 years ago

      Only on hackernews you get the most ironic of takes. Supposedly someone who is educated and technologically literate to a high degree, thinks opposition to AI is a good thing.

      Crazy world.

      • quickslowdown 2 years ago

        Crazy world when any random person on the Internet just assumes everyone else thinks the way they do. Put me solidly in the camp of educated & "technologically literate to a high degree" and also in opposition of AI.

        • Wissenschafter 2 years ago

          The two are inherently incompatible to me. I don't understand how you can hold those positions.

          To me, being against AI is pretty much 'evil'. You support humans existing in a barbaric existence like animals, suffering. I support laws to criminalize people hampering it like they are doing in Japan.

          I literally will fight against people like you.

          • ler_ 2 years ago

            Well, with AI people will still likely be suffering, except those at the top of the pyramid will be doing even better. Not so great. The AI debate is not about technology per se, but more so about seeing how gains in technology / productivity only go to the top, and extrapolating how bad that will be when greed meets the hyper-productivity of AI.

          • quickslowdown 2 years ago

            I'm not going to fight you lol, and your hyperbole means nothing to me. You strut in here making assertions and assume everyone just feels the same and is going to go along.

            Your mentality is irresponsible and exactly the problem I have with AI.

      • lewhoo 2 years ago

        The level of scrutiny and plurality of opinions towards technology in general is imho what makes a tech forum good.

        • Wissenschafter 2 years ago

          I'm not saying it's a bad thing, it's just so unexpected that every time I see it it fascinates me.

      • nerdjon 2 years ago

        > thinks opposition to AI is a good thing.

        Because we don't want nearly every job to be automated by AI?

        "Crazy world"

        • Wissenschafter 2 years ago

          It's crazy you think we don't want our jobs automated away. The whole point of automation is to reduce human labor. Mind boggling people literally choose to want to have to labor even if it is no longer necessary.

          • krapp 2 years ago

            >Mind boggling people literally choose to want to have to labor even if it is no longer necessary.

            It is necessary, it just isn't available. We still live in a capitalist society in which anyone not a member of the capitalist class is required to labor in order to afford the necessities of survival. AI means fewer opportunities to do so, despite the requirement remaining constant. No one is choosing to labor under this system, any more than one chooses to eat, drink or sleep.

            AI is not being implemented to free the labor class from this obligation, it is being implemented to free the capitalist class from the obligation to provide the means of survival to the labor class in exchange for their labor. The end result will not be the labor classes living lives of luxury in creative and intellectual pursuit, but as much unemployment and poverty as the market can bear.

        • bembo 2 years ago

          Why wouldn't we want jobs to be automated exactly?

          • vouaobrasil 2 years ago

            Because if a job is automted, it means that the person whose job is automated now is unemployed. Even if we have UBI, it means that the person DOING the automating will get a disproportionate share of the resources compared with the pittance given to the person who was automated.

            Personally, I don't want my job to be automated. I write for a living and if AI takes my job, I won't get paid. I prefer to create value in the world that other people appreciate. I don't WANT to sit in a concrete cage (an apartment) and consume media, with no real purpose in society.

            Believe it or not, the majority of people in the world need to feel like they are working for something. Yes, some people will be able to find other causes (mine will be the opposition of AI), but others won't. Of course, that will mean the necessity of drugging people with media (and physical substances...why do you think marijuana is becoming legal in more places?).

            The end result is a mode of pure consumption for almost all except the elite who control all the production, and they will decide what happens with the world. Personally, I don't want that: I want land and autonomy to use it to grow food and preserve ecosystems. I want the world to be sustainable, and not just set up for the purpose of furthering technology.

            You speak of societal changes on a year-scale. I'm talking about decades and the long-term. This level of automation is bad, and won't do any favours for humanity except the ultra-rich, who will eventually perish like everyone else.

          • nunez 2 years ago

            Because every job that gets automated creates massive unemployment for those who were skilled in it

            what do you think will happen to us devs if AI gets good enough to do our jobs? Do you think our companies will keep us around because we're just so darn smart?

            What do you think is _already happening_ to in-house artists, content/technical writers, marketing analysts, and other jobs that are directly impacted by LLMs in their current form?

            • Wissenschafter 2 years ago

              I'm not a selfish asshole, just a normal asshole. Yeah, I'm going to be affected by my job also being automated, everyone will. It's not a field specific problem. It's a societal paradigm change.

          • nerdjon 2 years ago

            Well at this point in time we live in a capitalist world and people need money to survive?

            I assume you like being paid, buying things, food, etc.

            Would it be great if we lived in a utopian society that money no longer mattered. Sure! Even with AI I see basically zero chance of that happening in any reasonable amount of time before AI destroys our society.

      • vouaobrasil 2 years ago

        Actually, I believe it was my education in pure mathematics (PhD), computer programming, and technology, as well as living in three countries and seeing the word that allowed me to truly realize how damaging AI can really be. Believe it or not, I was heavily into technology when I was in my 20s.

        But after seeing environmental damage, reading widely in philosophy and sociology, writing about it to clarify it, I came to a different conclusion: that technology is not all it's cracked up to be, especially when it is plugged into a system of global capitalism whose ultimate aim is consumerism.

        Just think: one of the biggest companies in the world (Google/Alphabet) has as its primary goal to promote unsustainability. If that doesn't make you think, what will.

        And let me ask you this: are you so sure about technology, given that you were raised in a world that praises it like a religion? I think the fanatical religious witch hunters also thought they were right, simply because they were raised in such an environment.

        Loving technology is the default position of the rich. Is that just a coincidence?

        • Wissenschafter 2 years ago

          I'm 30, you sound older. I did my education in philosophy and physics and I'm a data engineer at a F200. I think you couldn't be more wrong, and out of touch. You sound like Kaczynsky.

          I love technology because it's interesting, who cares what rich people think. I can't understand this oldhead defeatist mentality a lot of people here seem to have similar to yours.

          • vouaobrasil 2 years ago

            I love wildlife and the natural world, and technology is utterly dependent on destroying it via mining. The directions we go in shouldn't be determined based on just whether they are interesting. They also need to be constrained based on whether they are sustainable. You obviously don't care much about the negative environmental effects of technological development.

            By the way, I am not defeatist because:

            1. I think we can make great progress, only progress towards rewilding nature

            2. I only consider technology a dead-end, not humanity! I believe we can move past arbitrary technological development and discard our consumerist ways.

            • Wissenschafter 2 years ago

              A mine is just as natural as an antill. It's your perspective that is the unnatural one.

              Humans are not separate from the nature we exist in.

              • pseudonamed 2 years ago

                Indeed, we aren't separate; that does not mean that swimming in an acid mine lake is equivalent to swimming in a clear glacial lake... one is more likely to kill you than the other.

                You mentioned studying philosophy, you're confusing your ontological and epistemological positions.

              • vouaobrasil 2 years ago

                In that case, we should just eradicate all life on this planet because any action we do is natural....

                ...or, we could actually evaluate the transition of life from natural to technological. Yes, in a philosophical sense, you are right, what we do is "natural". But then if we just say everything we do is natural, then we might as well just do nothing. But there are still meaningful distinctions between the technological human organization (even if natural) and the rest of the world, and we would do well to examine if what we are doing is really harmonious with everything-but-us -- because if not, then I reject it outright even if it is natural in the sense that you describe.

                Your wordplay is really not very impressive.

          • pseudonamed 2 years ago

            "who cares what rich people think" is an astoundingly politically naive statement.

          • beeboobaa3 2 years ago

            You sound incredibly naive. Good luck enjoying technology when all you get to do is manual labor to feed the machine. Everything else will be done "for us", with the rich people in charge of it all.

            • Wissenschafter 2 years ago

              I sound naive? Have you even read any Marx?

              • beeboobaa3 2 years ago

                Are you going to reply to the comment, or just keep slinging vague personal attacks?

                • Wissenschafter 2 years ago

                  I did reply to the comment, you didn't even ask a question. What is your point? Also, I will vaguely attack people. I don't care. Stupid people and things annoy me and I don't have a filter for it. Sue me.

      • starbugs 2 years ago

        > Supposedly someone who is educated and technologically literate to a high degree, thinks opposition to AI is a good thing.

        Ignoring that your comment is phrased in a hurtful way towards the parent which got you a downvote from me, why do you think these two are connected in any way?

  • karaterobot 2 years ago

    > Also why are "AI" crawlers are worse than "normal" crawlers?

    A search engine will index your content to bring people to it through search. An AI crawler will take your content to recapitulate it and sell it to others. Obviously it's more complicated than this, but this is how one might see it who wishes to use this file.

    > Either way, this is an exercise in futility.

    Not necessarily disqualifying. Laws against theft are also futile, in the sense that honest people don't need them and dishonest people don't follow them, and history since at least Hammurabi has been replete with examples of such laws not stopping theft. And yet. Seems worth the calories it costs to say "for the record, I do not give my consent for what you're doing".

    • cabirum 2 years ago

      Search engines are not the beacons of holiness - they sell ads, they sell data on who searched what, they manipulate results.

      Search engines and AI things are typically owned by the same company. AIs are fed with the data collected by a search engine. The only difference is whether AI gets the data in realtime or waits for the search engine to collect another data dump.

      Fighting windmills as I see it.

      • nunez 2 years ago

        Search engines manipulate results way less aggressively than LLMs do.

belter 2 years ago

Or redirect them to poisoned material?

  • vouaobrasil 2 years ago

    That is a good idea. Maybe redirect them to massive datasets to cause the company mass embarrassment. There are already some image-modifying programs that generate poison images, and the bots could be redirected to such images...

internetter 2 years ago

This is missing a couple, one that comes to mind is `FriendlyCrawler`, which is most definitely not friendly, and very likely for AI

andybak 2 years ago

As someone who uses and benefits from the results of AI crawlers, I would only want to block crawls under very specific circumstances.

I would back a general move to block crawlers from non-open models (whatever that means and if such a thing was practical) as it might be a strong lever to encourage good behaviour.

rocky_raccoon 2 years ago

Not that I'm arguing for or against preventing access from AI crawlers, but wouldn't it make more sense to block them at a higher level, e.g. the webserver, and not even give them the choice to obey/disobey robots.txt?

  • rideontime 2 years ago

    How would you propose doing so?

    • rocky_raccoon 2 years ago

      Off the top of my head:

      - Cloudflare

      - Webserver-level user-agent blocking (Apache, nginx)

      - Application-level user-agent blocking (`if request.user_agent == 'OpenAI'`)

      None of them are ideal since you can simply change your user agent, but all of them seem like better options than robots.txt to me.

    • adrianN 2 years ago

      We could repurpose the evil bit.

    • gtirloni 2 years ago

      Web servers can check the user-agent and block the request.

      E.g. nginx $http_user_agent

CalRobert 2 years ago

Given how intertwined AI and search engines are it's hard to see how this helps aside from _maybe_ making things easier for Google, Microsoft, etc., unless you also don't want to be indexed by search engines.

bakugo 2 years ago

This makes complete sense because, as we all know, AI companies are very concerned with respecting the rights of the people they steal data from, and totally won't just ignore this.

  • frizlab 2 years ago

    At least you show intent and can then potentially prove they are not respecting your wishes. It’s better than doing nothing.

natch 2 years ago

We need AIs to know more, not less. If many people block AIs from reading their sites, AIs will just be stuffed with biased information from people pushing agendas.

  • nerdjon 2 years ago

    So the value of them will plummet? That sounds like a win for society.

    • natch 2 years ago

      Why would the value of AIs plummet if they know more?

      Or did you mean sites? Information wants to be free.

      If AI is trained only on data provided by those with agendas, you won’t want to live in that world.

      • nerdjon 2 years ago

        I am saying the opposite, if they have less data the value of AI's will plummet and hopefully their use will plummet.

        That is a good thing.

        • natch 2 years ago

          Market dynamics. Since use of better AIs confers advantages, they will be improved. No set of players will be able to stop this because the incentives are so strong for others to continue using and developing them. The best we can hope for is AIs that are not misled.

          Sometimes you have to work with the tide, because fighting it is futile and even self defeating.

          • starbugs 2 years ago

            > Sometimes you have to work with the tide, because fighting it is futile and even self defeating.

            Said the fish before approaching the waterfall.

            While I agree that the incentive structure is set up in a strong way for AI to be further improved and rolled out, what's the endgame here? Who can build the most powerful centralized AI so that nearly everyone else is out of a job? And who is that going to benefit?

            I just don't get it.

            Have we all decided to "just play the game" and ignore how dumb it is?

            • natch 2 years ago

              The endgame is the singularity, which, by definition, offers no clear view of the future.

              I don’t see any effective way to fight it other than joining it. Any well intentioned steps can be easily and even unintentionally subverted by other players who have different perspectives on the ethical landscape.

              It’s not that we have decided. Others will decide for us regardless of our decision.

              • starbugs 2 years ago

                > I don’t see any effective way to fight it other than joining it.

                How do you fight it by joining it?

                • natch 2 years ago

                  Yeah I butchered that expression a bit. I mean that if we participate we at least have a higher likelihood of helping to steer things and continuing to be relevant going forward.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection