News Stack Overflow bans users en masse for rebelling against OpenAI partnership — users banned for deleting answers to prevent them being used to train...

JamesJones44

Reputable
Jan 22, 2021
704
649
5,760
There have been lawsuits for copy and pasting from StackOverflow. Granted these cases apply to when a someone copy and pasted code from a non permissive license product to StackOverflow and then another user took that code from StackOverflow and then copy and pasted into their product, but it seems like the same risks would apply to LLM/Gemini/ChatGPT.

The terms about how the content can be used by the StackOverflow and the various other stack exchange sites is fairly clear that the content is covered under Creative Commons. I would be curious to know if ownership in the way the posters want could be granted, seems unlikely to me, but I would need a lawyer to answer that one. However, whether the content scraped from the site or done from a backend database it don't think it really matters under Creative Commons, the data is free to be seen (NOTE: I'm saying seen, not used. Used depends on where that code snippet came from and the ability to prove you didn't copy it from a licensed product).
 

bit_user

Polypheme
Ambassador
I don't really see why people are so outraged over their public posts being used like this. When you post something on a public forum, especially a privately-owned one, you're basically giving it away and all control over it.

It would be different if these were presumed private communications, but there was clearly no pretense of that ever being the case.
 

bit_user

Polypheme
Ambassador
All of your data is now capital for someone else because it just so happens to be sitting on their server.
It's sitting there because you put it there!

This isn't like a search or advertising giant collecting stats on people's web usage patterns, or credit reporting agencies collecting data on your spending habits, this is a site you voluntarily made an account on and posted to! Furthermore, they're not asserting control over anything of yours you didn't actively upload to them!
 

CmdrShepard

Prominent
Dec 18, 2023
262
217
560
I don't really see why people are so outraged over their public posts being used like this.
Because the original purpose of your posts was to benefit other fellow human beings, not to enrich some soulless corporate entity?
When you post something on a public forum, especially a privately-owned one, you're basically giving it away and all control over it.
If I am not mistaken, the established law is that you have and retain copyright over what you created and you aren't automatically giving that copyright up by publishing.

To elaborate a bit further why people are enraged:

When a developer posts an answer on SO they are giving back to the community for all those times they got an answer from there.

Nobody posting code answers on SO ever thought to include a usage license -- it looks now they should have but it's easy to be smart now after the fact.

I don't mind SO monetizing my answers by putting ads next to them like other public platforms do.

What I do mind is a 3rd party paying a pittance (if that) for those answers in bulk and being able to use them to make a giant recurring profit by selling a service based on something I did. Even if they are paying SO, it's still theft -- they are stealing from me and using it not for one-off use like other fellow developers, but for perpetual revenue generation.

I did it for free for other people who did it for free for me. OpenAI never did anything for me for free, not to mention their name is a biggest lie ever as neither their code nor their models are open-source.
 

bit_user

Polypheme
Ambassador
Because the original purpose of your posts was to benefit other fellow human beings, not to enrich some soulless corporate entity?
Regardless of their intent, people posting on there should've been smart enough to know what it means to use a platform like that.

If I am not mistaken, the established law is that you have and retain copyright over what you created and you aren't automatically giving that copyright up by publishing.
According to what others have said, in the ToS agreement you license your content to them under Creative Commons (?), which means giving up most of the control over what happens after that.

If you indeed retain copyright, then you're free to dual-license your content to another party or in another context, but you can't retroactively revoke the Creative Commons license on stuff you published, after it's already out there. Anyone who managed to get a copy of the content, while it was covered by Creative Commons, can continue to use it under those terms. And that includes the platform, themselves.

To elaborate a bit further why people are enraged:
Sorry, in a colloquial turn of phrase, I happened to have mischaracterized my bewilderment. I actually do understand why they're upset. I guess what I find surprising is how many didn't appreciate the implications of posting on the internet, not even to speak of a privately-owned platform. Same with Reddit. Also Github.

And don't think we're immune from that on here, either. In fact, the main concern I have with the idea of someone using these forum posts to train AI is not that it'll misappropriate my knowledge, but feeding it all the other garbage that so many people post on here! Not anyone in this thread, of course - just those other posters!
😅

What I do mind is a 3rd party paying a pittance (if that) for those answers in bulk and being able to use them to make a giant recurring profit by selling a service based on something I did.
Eh, so very little of what's in our brains is truly original. The vast majority of it we're regurgitating, just like AI. It just so happens that AI is more efficient at it than we are. Before long, it'll be better, too.
 

Findecanor

Distinguished
Apr 7, 2015
259
179
18,860
If I am not mistaken, the established law is that you have and retain copyright over what you created and you aren't automatically giving that copyright up by publishing.
IANAL, but I believe that for web sites with user-generated content, the default law most commonly practised is that when no other ULA is in place, the user-generated content is owned by both the poster and the web site it was posted on. Each owns their own copy.
In other words, they can do whatever they want with their own copy of what you have posted.

If the content is under a Creative Commons license however, ... then any copy must retain attribution to the original poster.
Here is where I can see a legal conflict: feeding content into a LLM strips away the attribution.
Content can be fed into a LLM without issue only if the content is owned by them, is explicitly Public Domain or if they somehow found a novel way to weave attribution into the LLM. (which I find unlikely)

You'd often see the argument "But it is learning. Humans are allowed to learn stuff freely. This is just machines doing it".
That argument does not hold up:
First, humans and LLMs learn differently. Humans create mental models of what they ingest and feed those interpretations into their neural networks. LLMs feed the raw text. It is actually more difficult for humans to learn raw text.
Second, even if humans do learn raw text, it is still plagiarism to publish copyrighted text without attribution, even if you recite it from memory. LLMs will do that given the right parameters: but chatbots typically have measures in place to detect when the LLM does it, and stop the output from happening.
 
Last edited:

bit_user

Polypheme
Ambassador
First, humans and LLMs learn differently. Humans create mental models of what they ingest and feed those interpretations into their neural networks. LLMs feed the raw text. It is actually more difficult for humans to learn raw text.
That's a false distinction, IMO. The goal of LLMs is also to learn mental models, not to do rote memorization and regurgitation.
 

35below0

Commendable
Jan 3, 2024
1,145
511
1,590
It's feudalism, but with a techno touch to it.
All of your data is now capital for someone else because it just so happens to be sitting on their server.
Feudalism rested on land ownership, not capital. Capital began to matter with the industrial revolution.

Land owners HATED capitalists because they didn't inherit land (power) or had it given to them as favor. Power shifted from land ownership to money and capitalists.

Data is not capital. Does not fit any definition of capital at all. Marx could not or would not recognize it as value. His theory was completely different. It was flawed because he was trying to treat history (incomplete, selective) using scientific methods he devised himself (also selective).
His one gift is exposing the inhumanity of unrestricted capitalism. The other gift is a false prophecy that would lead to an even more inhuman place.

Worth remembering Marx was an aristo himself. And that he had no interest in the working class suceeding in freeing itself, but rather re-establishing a benevolent leadership to run the dictatorship for them.
Why did he not suggest abolishing classes altogether?

There. Some context for the data metaphore.


If you don't want your data to be used, don't use "free" services that harvest your data as their business model.
You can pay for email, you know.
https://www.fastmail.com/?direct=1

I am aware that it's very hard to avoid completely the BS practises of many businesses that harvest data. It's worth trying, if privacy or data management really matters.

I'm also aware that it's lol to pay for email, but don't bleat about Google if you use gmail
 
  • Like
Reactions: palladin9479

Notton

Prominent
Dec 29, 2023
380
339
560
It's sitting there because you put it there!

This isn't like a search or advertising giant collecting stats on people's web usage patterns, or credit reporting agencies collecting data on your spending habits, this is a site you voluntarily made an account on and posted to! Furthermore, they're not asserting control over anything of yours you didn't actively upload to them!
Yes, because web platforms have been famously free to operate and maintain, especially when they grow to the size of stackoverflow.
 
  • Like
Reactions: NinoPino

USAFRet

Titan
Moderator
For instance, Law.se:
https://law.meta.stackexchange.com/questions/1701/policy-chatgpt-is-banned?cb=1

"Posting GPT answers on Law Stack Exchange is temporarily banned.

This policy has been adopted from Stack Overflow's current stance to give us time to work out our own. Most of the arguments for and against the use of AI generated content for coding questions are equally applicable to legal questions."

etc, etc, etc....


"Users who disagree with having their content scraped by ChatGPT are particularly outraged by Stack Overflow's rapid flip-flop on its policy concerning generative AI. For years, the site had a standing policy that prevented the use of generative AI in writing or rewording any questions or answers posted. Moderators were allowed and encouraged to use AI-detection software when reviewing posts."
 
  • Like
Reactions: NinoPino

Li Ken-un

Distinguished
May 25, 2014
101
74
18,660
We had a great couple decades of information being freely shared. One might wonder whether this will reverse the trend and take us towards a time when access to knowledge was closely tied to your social circles and who you had connections to. As long as there is an incentive to hide content from the public, this stuff will move to private, curated members-only forums.
 

jackt

Distinguished
Feb 19, 2011
188
18
18,685
I tryed to delete and some dont let delete cos others did an effort... and max 5 per day !
Why they need a partnership if the code is public anyway ?
 
  • Like
Reactions: phenomiix6

JamesJones44

Reputable
Jan 22, 2021
704
649
5,760
I tryed to delete and some dont let delete cos others did an effort... and max 5 per day !
Why they need a partnership if the code is public anyway ?
Scraping websites is a difficult and ever changing problem. It also makes detecting and getting changes difficult and if you get too fancy trying to scrape too much content at once you could end up overwhelming the service.

Throwing some money at Prosus & Naspers (owners of StackOverflow) to get direct access to the data in their database solves all of those problems.
 

CmdrShepard

Prominent
Dec 18, 2023
262
217
560
Oh, and Sam Altman is an *.
Regardless of their intent, people posting on there should've been smart enough to know what it means to use a platform like that.
There was a time when even mobsters had some rules and code of honor you know.
Anyone who managed to get a copy of the content, while it was covered by Creative Commons, can continue to use it under those terms. And that includes the platform, themselves.
While it is publicly posted on a platform itself it is clearly attributed to me. Once the AI model sucks it in it is no longer attributed to me -- it's "owned" by the model creator. How long before they just outright shut down SO and make you pay for OpenAI service to get the same answers you had there?
Eh, so very little of what's in our brains is truly original. The vast majority of it we're regurgitating, just like AI. It just so happens that AI is more efficient at it than we are. Before long, it'll be better, too.
Not even close because of sheer scale. You could never learn what LLM can memorize even if you lived a 1,000 years.

On the other hand, LLM will never become an AGI -- if you played with freely available models locally using Ollama you would've realized that already. Just read the parameter descriptions for chat prompts:

num_predict -- Maximum number of tokens to predict when generating text
top_p -- Instead of sampling only from the most likely K words, Top-p sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability p . The probability mass is then redistributed among this set of words. This way, the size of the set of words ( a.k.a the number of words in the set) can dynamically increase and decrease according to the next word’s probability distribution

So it's statistical modelling all the way down. That's not how human brain works and it is wrong to attribute it any intelligence.
 
Last edited by a moderator:

CmdrShepard

Prominent
Dec 18, 2023
262
217
560
Feudalism rested on land ownership, not capital. Capital began to matter with the industrial revolution.
He probably meant feudalism in the sense of few chosen rich people being born into having everything (money, status, power) while the rest of us work our asses off on tedious menial jobs and wait for their tax collectors to come and collect whatever they want whenever they want.
If you don't want your data to be used, don't use "free" services that harvest your data as their business model.
Company I work for is paying for Windows Enterprise, Azure, etc. That doesn't stop Microsoft from harvesting our data. As a matter of fact, you have to go way out of your way with GPO to lock it down and stay ever vigilant because they just keep adding new stuff to siphon your data out with every update.

The moral of the story is -- even if you pay it's not a guarantee you won't be screwed. There's no honor today and nothing is sacrosanct to those corporate MBA bastards and those vulture capitalists.
 
  • Like
Reactions: RichardtST

Rob1C

Distinguished
Jun 2, 2016
93
13
18,635
If the content is under a Creative Commons license however, ... then any copy must retain attribution to the original poster.
Here is where I can see a legal conflict: feeding content into a LLM strips away the attribution.
Content can be fed into a LLM without issue only if the content is owned by them, is explicitly Public Domain or if they somehow found a novel way to weave attribution into the LLM. (which I find unlikely)

It's part of the deal that the answer attributes the OP, see: https://stackoverflow.blog/2024/02/29/defining-socially-responsible-ai-how-we-select-api-partners/

It's like virtual Soylent Green, with the bits attributed.

While it is an about face on prior policy, AI has improved a lot in the past few years; without moving forward (responsibly) there won't come a day when it works perfectly and without fault. If one day you can simply ask a question and get a correct AI generated answer then Stack Exchange will have driven its own nail in.
 

35below0

Commendable
Jan 3, 2024
1,145
511
1,590
Company I work for is paying for Windows Enterprise, Azure, etc. That doesn't stop Microsoft from harvesting our data. As a matter of fact, you have to go way out of your way with GPO to lock it down and stay ever vigilant because they just keep adding new stuff to siphon your data out with every update.
I meant non-corporate users, but i take your point.