Ask HN: How will Google differentiate between content written by AI and human?
The chatGPT broke the internet these last days by invading the web with screenshots of the different exchanges between testers and artificial intelligence. We can ask it what we want, like writing an article or a publication (don't worry, I'm the one typing behind my keyboard).
I was wondering last night how Google (not to mention the other engines) will adapt its SEO strategies to discern a human's written content from an AI's content.
Because if everyone starts to generate articles with AIs, the web will become a pile of content created by AIs based on content written by humans for humans. But we know that training models are based on text corpora from different sources (including the web).
Where is the border? The more content generated by AIs, the more the new AIs will train on content generated by other AIs to create new content.
We will end up reading content that humans no longer wrote, which could become almost rare.
Content is one of many variables taken into account by the algorithm to index pages, but it will be interesting to see how search engines will solve this issue.
Tell me what you think: Does it matter? Search engine works on how useful 'knowledge' is for the seeker. One of the parameters is how many others have back linked to the same source - among various other signals. If AI content is highly cited and was found useful by many others, then I would want it at the top of search result. The question will have to be asked in the age of misinformation and misrepresentation. As it is, AI is trained on content often from trusted third parties (or scientific articles in many cases). I just ask the question of a snowball effect, where a low quality content will generate another one. Obviously as I said and as you remind, the reputation and backlinks of a content are variables that can allow better indexing or not pages. But if these variables are also manipulated? Google results are already bad, ChatGPT will only accelerate the death of the current algorithm. Check out this discussion https://news.ycombinator.com/item?id=30347719 I was thinking of this issue this morning, I wonder if anyone tried to create a search engine that has weight for content approved by experts and will add more smaller weight based on engagement or other factors. I think Pocket is an example for what I have in mind. It will definitely require massive resources to make, but it will be the new trusted Google. Yes, this suggests an obligation of transparency or the establishment of a "signature" that would identify content from a human or not I wrote a userscript (with help from ChatGPT) that identifies if comments on HN are written by AI or a human. I based it on https://huggingface.co/openai-detector. Its still a little shabby and only works on HN, but I imagine this is going to be required for general non-specific Internet browsing going forward. Looks like this: https://i.imgur.com/BTt1DTh.png Very interesting! How does it work? It puts every comment into that GPT output detector and colors and writes a short comment on the HN comment, like you see in the screenshot based on a threshold. >0.7 is probably AI, >0.9 is definitely AI. Lower than that is most likely human. Most comments still appear to be human. It only becomes reliable after about 50 tokens (one token is around 4 characters) so I mark the comments that are too short with gray and make no assessment on those. I've put it on https://github.com/chryzsh/GPTCommentDetector I see. My question of how it works was more about the method you were using to identify content as something written by an AI, but from what I saw on your repo, you rely on a set of GPT-specific configurations that identify a percentage of similarity to the content being analyzed? Like desrcibed in the repo, I just feed it to the GPT output detector. I didn't write that tool, but from my understanding they trained a GPT model to recognize itself. Okay cool, I had heard the same kind of AI training during the release of DALL.E 2 where one of the AI was dedicated to the generation of the image and another AI which checked if the generated image corresponded to an AI or not. +1 star for this repo btw With pleasure, we are also working on an open-source project (called Luos engine). Receiving support just by clicking on a star is a quick click for a big effect. There isn't a hard boundary between "content written by an AI" and "content written by a human". I can say to ChatGPT: "Here are my ideas in point form, write them out as full paragraphs: ..." And ChatGPT will do that for me. So, is that content written by an AI, or content written by a human? I'm talking more about the snowball effect that this kind of practice would have in the long run. Training models working on contents that have been generated in the past by AIs, again injected as a source for a new AI. Are we heading towards an impoverishment of information? Maybe not if we include trusted third parties in the data source. The question of the source of training data will be important to monitor. I don't know how Google will determine the difference but I hope the technology is called "Bladerunner" (at least as an internal code name). Ahah, +1 for the idea Question answering out of a knowledge graph.