Announcing a change to the data-dump process

9 min read Original article ↗

On July 26, Stack Overflow updated the post to include different modifications to the CC BY-SA license than the prior version. This remains a breach of the license terms, as the CC BY-SA license does not allow additional restrictions, as previously outlined below in this answer and in other answers.


There is a lot going on here... I'll come out and say it that after first reading, it feels like a great big distraction because Senior Leadership is treating the data dump like a boogeyman, when that's wholly unfounded.

That said, let me try to organize my thoughts some.

Missing the July Dump deadline.

Just over a year ago when I was still staff at the company, I was personally in the unenviable position of having been instructed by the Stack Overflow CEO to disable the Data Dump, and to not re-enable it because he wanted to end the dump. That decision ultimately snowballed until Stack Overflow made commitments to continue the data dump quarterly. Data Superstar Aaron ultimately made some improvements and there was a shift made to the delivery schedule, to make it align better with quarterly boundaries. This is all excellent news for those of us who use the data dumps, and/or are proponents for equal data, and/or are defenders of the open data commitments made by and for the community.

Now, just one quarter after the company's most recent commitment to a schedule, it's shifting, again. For no reason. Apparently undoing the most recent schedule-shift by bumping (at least) a month.

WHY CAN'T THE DUMP BE POSTED TO ARCHIVE.ORG ONE MORE TIME?

There is no rational reason given as to why to delay the July dump. There are no blockers preventing the company from continuing the existing process until the new process is ready.

Stack Overflow has had plenty of time

Evidenced by the Data Dump's recent history, the company has had plenty of time to pursue changes. It certainly seems like you set an arbitrary deadline, missed it, and are going to now cause arbitrary delays because YOU did not prioritize the work well enough to meet your own deadline. I won't belabor why that's problematic.

If you want to build good will with the community, I'd suggest that the data dump promised to the community by July 31, 2024 still be delivered by that deadline, regardless of the state of the new process you want (but do not need) to use.

Stack Overflow will be violating the BY-SA license

Policing commercial use goes against the Creative Commons BY-SA license

The Creative Commons license is very clear (emphasis mine).

Share — copy and redistribute the material in any medium or format for any purpose, even commercially.

Ethical AI is a great talking point, but I'm not sure of the ethics behind preventing commercial use of something when it is legally, explicitly available for commercial use. The freedom to use it for commercial purposes obviously comes with the need to follow all license terms. The Creative Commons license explicitly says this:

The licensor cannot revoke these freedoms as long as you follow the license terms.

In essence, Stack Overflow is limited to ensuring that downstream users provide Attribution, and that the data continues to be shared under the same open license. Again, quoting from the BY-SA license:

No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

For folks reading along, this last quote from the license does NOT disallow requiring a username/password to limit access, but does prohibit any sort of DRM, "watermarking", or other means to limit use.

This license quote DOES prohibit the checkbox included in the data download mockup. By requiring a user to "agree that [they] will use this file for non-commercial use...and [they] will not transfer it without permission..." Stack Overflow is violating the license terms. Period.

A breach of the license terms results in automatic termination

From the Creative Commons FAQ:

All CC licenses are non-exclusive: creators and owners can enter into additional, different licensing arrangements for the same material at any time (often referred to as “dual-licensing” or “multi-licensing”). However, CC licenses are not revocable once granted unless there has been a breach, and even then the license is terminated only for the breaching licensee.

If Stack Overflow violates the CC BY-SA license, the users who created the content (ie, "us" not the Company) can terminate the license granted to the Company. This could be DEVASTATING to the community. Particularly on smaller SE sites, a small set of users forcefully revoking Stack Overflow's ability to use the data under CC BY-SA could set a site back years.

I would be excited to see the Company helping to enforce BY attribution and SA share-alike licensing by others on the internet. However, it is incredibly disheartening for individual contributors to see the Company being the potential breacher, rather than defender.

The data dump is licensed under CC BY-SA explicitly.

Quoting from the Stack Overflow Terms of Service:

From time to time, Stack Overflow may make available compilations of all the Subscriber Content on the public Network (the “Creative Commons Data Dump”). The Creative Commons Data Dump is licensed under the CC BY-SA license. By downloading the Creative Commons Data Dump, you agree to be bound by the terms of that license.

Users grant Stack Overflow the CC BY-SA license. Stack Overflow uses that license to "remix" the individual posts into the compiled data dump, and appropriately covers the entire derivative work under the same CC BY-SA license. By requiring any individual to agree not to use it for commercial purposes, Stack Overflow would be violating the "no additional restrictions" clause, and be subject to automatic revocation of the CC BY-SA license from the grantor (users, authors, content creators).

"We would really rather users do not upload the file to archive.org or similar data pile sites"

I can appreciate that the company might rather this not happen. But unfortunately, the CC BY-SA license means that the Company can't restrict this. Stack Overflow could try changing the actual format, and licensing that anthology format differently, and place restrictions on that new archive product, but the content itself will always be free to be uploaded to someplace like Archive.org.

"When organizations are able to skip out on their obligations to contribute back..."

Unfortunately, this is not an obligation that is covered by the CC BY-SA license. And even more unfortunately, adding this restriction on top of the BY-SA license would be a breach of that original license, and thus Stack Overflow would be the entity in legal hot water, not the downstream users that Stack Overflow is trying to police.

Let's assume Stack Overflow proceeds...

even in a unique format to circumvent CC BY-SA...

Because many of us are technologists and software developers, I can almost guarantee that someone will create a process that:

  • downloads the Data Dump for individual, hobby use
  • creates a process to ingest the data dump, and reformat as XML if necessary
  • uploads the new data in a backwards-compatible format to archive.org

Stop Gaslighting me

It’s important to say that when you breach the agreement that you make when downloading the dumps file, we do have the option to decline to provide you with future versions of the data dumps. But we really don’t want to have to do that.

  • User-generated content is licensed under CC BY-SA to the entire world.
  • Stack Overflow compiles those CC BY-SA licensed creations into a data dump, and then licenses the product that is the data dump under the same license.
  • The "Creative Commons Data Dump," being licensed under the CC BY-SA license, permits any use, including commercial, so long as derivative works include the S and the A--"Share Alike" (continued BY-SA licensing) and "Attribution".
  • By attempting to enforce a non-commercial limitation on a BY-SA licensed creation, Stack Overflow is breaching the agreement with the community.

It’s important to say that when you breach the agreement that you make when creating the dumps file, we do have the option to revoke the license we granted when we posted content on the site. But we really don’t want to have to do that.

Dual licensing

Nearly everything I said above has a big caveat... At the time of posting, users granted the entire world a CC BY-SA license, and additionally granted a second license to Stack Overflow (emphasis mine):

You agree that any and all content, including without limitation any and all text, graphics, logos, tools, photographs, images, illustrations, software or source code, audio and video, animations, and product feedback (collectively, “Content”) that you provide to the public Network (collectively, “Subscriber Content”), is perpetually and irrevocably licensed to Stack Overflow on a worldwide, royalty-free, non-exclusive basis pursuant to Creative Commons licensing....and you grant Stack Overflow the perpetual and irrevocable right and license to access, use, process, copy, distribute, export, display and to commercially exploit such Subscriber Content, even if such Subscriber Content has been contributed and subsequently removed by you

Stack Overflow could use their perpetual, irrevocable license to do anything they want with the data. Stack Overflow doesn't have to release the "Creative Commons Data Dump" under the CC BY-SA license -- except that it does. Stack Overflow could change the TOS to stop licensing the Data Dump under CC BY-SA, and then add any restrictions they like. This would essentially be a proprietary "box" filled with freely reusable data--folks would be bound by the Company's terms for the "box", but they could take the CC BY-SA contents out of the box, throw away the box, and use the contents of the box in any way that meets the terms of the CC BY-SA license (including putting that data into a CC BY-SA box).

However, at the end of the day, everyone in the world can use the post content however they like, so long as they continue to follow the CC BY-SA license restrictions, regardless of how they access that data.

My promise to Stack Overflow

I intend to vigorously defend my rights as a content creator on the Stack Exchange Network. I will ensure that my content (which I licensed to the entire world under CC BY-SA) continues to be used properly according to the license terms. If/when someone breaches the license terms, I will revoke that license and defend my content's copyright.