This is on purpose
The Company has intentionally added two bogus posts to every Data Dump export for every Network website. The Company has declined to comment on the presence of these two posts, and has not explained their existence. However, with some detective work, it becomes rather obvious that this is an intentional addition.
These two posts will have the following characteristics:
- One post is a Question (Post Type = 1), published by the Community User (User ID = -1).
- One post is an Answer (Post Type = 2), published by the Community User (User ID = -1).
- The Bogus Question will be Post ID
1000000001. - The Bogus Answer will be Post ID
1000000010. - The
TitleandBodywill vary for each network site. However, in all cases, the bogus post will contain factual inaccuracies, typos, or other nonsense that identifies it as a work of fiction, nearly immediately.- The writing is consistently so poor/incorrect that it makes Generative AI look well-researched & carefully proofread in comparison.
- These posts are NOT present in SEDE, which is the source of the Data Dump.
What to do?
If you are consuming the Data Dump, you should explicitly exclude posts with IDs 1000000001 & 1000000010. Post IDs are autonomically increasing ID values, beginning at 1 and ticking up for each new post. A number of scenarios will result in numbers being "skipped," and never used. However, the gigantic gap indicates that these numbers were chosen intentionally to avoid a conflict with real data that is being generated.
More generally, if there are a small number of posts with IDs that are order-of-magnitude larger than the rest, then those posts should be considered suspicious, and filter them out.
But why the heck are they doing this?
The Data Dump was unchanged for years until the Generative AI boom resulted in significant impact to Stack Overflow and the public network. At that time, the company began making changes to the Data Dump, first to attempt to end its public distribution, then the Company began selling/monetizing site data to LLM producers, then to change the posting location to a "walled garden" and attempt to add additional terms to the download preventing use for training AI models, and now again to begin injecting these bogus posts.
The company has not explained why they are doing this, but the reasoning seems to be a hamfisted attempt to set a "honeypot" to find people using the data for commercial services, if they have not paid Stack Overflow to license the data. Presumably, the company is monitoring for web traffic hitting the 404 errors for any user trying to hit the various slug formats like /q/1000000001 for the 1000000001 and 1000000010 post IDs. Additionally, they may have other monitoring looking for information related to the use of the fictional products, URLs, etc from the content itself.
Speed Trap without a Speed Limit
Even though the company has introduced their "honeypot" trap, the enforcement mechanism is unknown, and what they are enforcing is wholly unenforceable. Here's why:
I obtained my copy of the data dump from the open internet, not from Stack Overflow. The Data Dump is officially referred to as the "Creative Commons Data Dump" in Section 6 of the site TOS. The Terms of Service explicitly declare the Data Dump to be covered by the same CC BY-SA license as the posts themselves.
From time to time, Stack Overflow may make available compilations of all the Subscriber Content on the public Network (the “Creative Commons Data Dump”). The Creative Commons Data Dump is licensed under the CC BY-SA license. By downloading the Creative Commons Data Dump, you agree to be bound by the terms of that license.
Last year, when the Company moved to self-hosting the Data Dump, they added this controversial checkbox to the Data Dump download page:
I understand that this file is being provided to me for my own use and for projects that do not include training a large language model (LLM), and that should I distribute this file for the purpose of LLM training, Stack Overflow reserves the right to decline to allow me access to future downloads of this data dump.
However, they did not change the license under which the data dump is licensed. The license.txt contained in the download is as thus:
All content contributed to Stack Exchange sites is licensed under the
Creative Commons CC BY-SA license (various versions, including 2.5, 3.0,
and 4.0). We also provide data for non-beta sites as part of the data
dump, which is licensed as a whole under CC BY-SA 4.0:
https://creativecommons.org/licenses/by-sa/4.0/
Some of the content may have initially been contributed under earlier
versions of the license (2.5 or 3.0):
https://creativecommons.org/licenses/by-sa/2.5/
https://creativecommons.org/licenses/by-sa/3.0/
The CC BY-SA licensing, while intentionally permissive, does require
attribution:
Attribution — You must attribute the work in the manner specified by
the author or licensor (but not in any way that suggests that they
endorse you or your use of the work). If you republish this content,
we require that you:
1. Visually indicate that the content is from the Stack Exchange site
it had originated from in some way.
2. Hyperlink directly to the original question on the source site (e.g.
https://stackoverflow.com/questions/12345).
3. Show the author name for every question and answer.
4. Hyperlink each author name directly back to their user profile page
on the source site (e.g. https://stackoverflow.com/users/123/username).
By "directly," we mean each hyperlink must point directly to our domain
in standard HTML visible even with JavaScript disabled, and not use a
tinyurl or any other form of obfuscation or redirection. Furthermore,
the links cannot be marked with the nofollow attribute.
This is about the spirit of fair attribution: to the website and, more
importantly, to the individuals who generously contributed their time
and knowledge to create that content in the first place.
Thus -- one can obtain the data dump from elsewhere without agreeing to the additional LLM-prohibiting terms, which apply only to the user who actively downloads from the website.
Each quarter, someone downloads the data dump from Stack Overflow, and re-posts those dumps on the Internet Archive, for the purpose of archival, and promoting open data. This is a perfectly allowed use under both the CC BY-SA license, and the additional terms imposed upon the downloader by that checkbox. The Internet Archive hosted binary-identical copy continues to be licensed under the CC BY-SA license, but the "checkbox terms" are not applicable to the downloaded data dump.
If someone wants to skirt the "checkbox terms" they just need to download the dump indirectly (i.e., via a mirror like the Internet Archive).
If I choose to use, remix, and attribute the data dump per the license (ie, by attempting to link to the fabricated Q&A), I am fully compliant with all the data licensing. If my project is an LLM, let us look at what that might mean.
- Scenario A - Because I am linking to the question to provide attribution required by the license, then I am following all the legal provisions of the license. Additionally, my LLM would qualify as "Ethical AI" under the definitions used by the company, and be a type of LLM that the Company has claimed support for.
- Scenario B - If I do not provide attribution, then my LLM is both in violation of the license terms, and what Stack Overflow has deemed "unethical," and is the type of LLM that they claim to be working to combat. However, because this scenario would not be linking to 404'ed fabricated posts, I'm not sure how the Company intends to use these honeypot questions to find someone.
Poison Data, not a honeypot
The only purpose of this seems to be to reduce the quality of the data for users of the Data Dump. This adds a step to the Data Dump, requiring legitimate users to trim out the fabricated data before they use it. It is unlikely, but not impossible, that the single Q&A could cause problems for applications or research that use the data dump for legitimate purposes.
Poorly executed moneygrab
The Company began selling the Data Dump to LLM-providers (OpenAI and Google) were both promoted in press releases and as "Responsible AI partners" on the Company Partnership Page,
At the OpenSaaS conference in early 2024, Stack Overflow CEO Prashanth Chandrasekar talks about selling data to Google Gemini, and OpenAI. He also talks about how all the big AI companies are interested in buying the "Overflow AI" product, which includes both the Data Dump, and API access. Based on the Prosus Annual Report which mentions "Stack Overflow...significantly reduced
losses by US$65m to US$33m". When combined with other statements by Company Leadership, and analyst analysis, this suggests these data dump sales representing as much as half of the company's Annual Recurring Revenue (ARR), exceeding the revenue actualized from the product launches of the Enterprise & Teams products.
Paid copies of the Data Dump available through "Overflow AI" either do not contain fabricated data, or the data is included as a documented "example" that those customers can exclude.
The company has not responded to this post, so it is unclear what other changes the company has made to dilute the value of the data dump by altering the accuracy of the data.