Reddit sues Perplexity for allegedly ripping its content to feed AI

5 min read Original article ↗

Reddit is suing Perplexity and three “data-scraping service providers” to “stop the industrial-scale, unlawful circumvention of data protections by a group of bad actors who will stop at nothing to get their hands on valuable copyrighted content on Reddit,” according to the complaint.

The company equates the data scraping companies — SerpApi, Oxylabs, and AWMProxy — to “would-be bank robbers” who “knowing they cannot get into the bank vault, break into the armored truck carrying the cash instead.” Reddit alleges that Perplexity is a customer of “at least one” of the data scraping companies, saying that it “will apparently do anything to get the Reddit data it desperately needs to fuel its ‘answer engine’ — that is, anything other than enter into an agreement with Reddit directly, as some of its competitors have done.”

According to the lawsuit, Reddit sent a cease-and-desist letter to Perplexity in May 2024 “demanding that it stop scraping Reddit data.” While Perplexity told Reddit at the time that it didn’t use Reddit content to train AI models and that it would respect Reddit’s robots.txt, after that letter, the volume of Reddit citations on Perplexity actually increased. Reddit also created a post that could only be crawled by Google, and “within hours,” Perplexity “ produced the contents” of that post, the company says.

“The only way that Perplexity could have obtained that Reddit content and then used it in its ‘answer engine’ is if it and/or its Co-Defendants scraped Google SERPs for that Reddit content and Perplexity then quickly incorporated that data into its answer engine,” Reddit writes.

Reddit’s data — posts on all sorts of topics written by and ranked by humans — is hugely helpful to help train AI models, and the company knows it; the API changes that sparked the 2023 protests were positioned as a way for the company to be compensated for that data. Reddit has struck deals with AI companies including OpenAI and Google, and it reportedly wants better ones. And Reddit has previously taken legal action against Anthropic, alleging that Anthropic’s bots accessed Reddit’s platform even after Anthropic said they wouldn’t be doing that.

“AI companies are locked in an arms race for quality human content — and that pressure has fueled an industrial-scale ‘data laundering’ economy,” Ben Lee, Reddit’s chief legal officer, says in a statement. “Scrapers bypass technological protections to steal data, then sell it to clients hungry for training material. Reddit is a prime target because it’s one of the largest and most dynamic collections of human conversation ever created.

“Defendants Oxylabs UAB, AWM Proxy, and SerpAI — a Lithuanian data scraper, a former Russian botnet, and a company that openly advertises its shady circumvention tactics — are textbook examples of this illegal behavior,” Lee says. “Unable to scrape Reddit directly, they mask their identities, hide their locations, and disguise their web scrapers to steal Reddit content from Google Search. Perplexity is a willing customer of at least one of these scrapers, choosing to buy stolen data rather than enter into a lawful agreement with Reddit itself.”

“Perplexity has not yet received the lawsuit, but we will always fight vigorously for users’ rights to freely and fairly access public knowledge,” Jesse Dwyer, Perplexity’s head of communication, told The Verge in a statement when we first published this article. “Our approach remains principled and responsible as we provide factual answers with accurate AI, and we will not tolerate threats against openness and the public interest.”

Later on Wednesday, Perplexity published a longer statement on Reddit. “Whenever anyone asks us about content licensing, we explain that Perplexity, as an application-layer company, does not train AI models on content,” Perplexity says. “Never has. So it is impossible for us to sign a license agreement to do so. A year ago, after explaining this, Reddit insisted we pay anyway, despite lawfully accessing Reddit data. Bowing to strong arm tactics just isn’t how we do business.”

Oxylabs’ Chief Governance and Strategy Officer, Denas Grybauskas, shared this statement with The Verge:

We’ve read about Reddit’s lawsuit naming Oxylabs, along with three unrelated and unaffiliated companies. We are shocked and disappointed by this news, as Reddit has made no attempt to speak with us directly or communicate any potential concerns. Oxylabs has always been and will continue to be a pioneer and an industry leader in public data collection, and it will not hesitate to defend itself against these allegations. Oxylabs’ position is that no company should claim ownership of public data that does not belong to them. It is possible that it is just an attempt to sell the same public data at an inflated price.

Our business creates real-world value for thousands of businesses and researchers, such as those driving open-source investigations, disinformation tackling, or environmental monitoring. We strongly believe that our core business principles make the internet a better place and serve the public good.

Oxylabs provides infrastructure for compliant access to publicly available information, and we demand every customer to use our services lawfully.

Update, October 23rd: Added statements from Perplexity and Oxylabs.

Follow topics and authors from this story to see more like this in your personalized homepage feed and to receive email updates.