News Linux distros ban 'tainted' AI-generated code — NetBSD and Gentoo lead the charge on forbidding AI-written code

bit_user

Titan
Ambassador
The article said:
For the most part, though, it seems like Debian is unwilling to put its foot down and outright ban generative AI from their project like Gentoo and NetBSD, which could have interesting long-term implications if Debian winds up proving the dangers of letting an AI write an OS.
Debian is a distro, not an OS. Anything they upstream to the Linux kernel would be subject to the Linux Foundation's policy. While distros sometimes maintain their own kernel patches, these tend to be fairly small and probably things they pull from an upstream branch or that they subsequently try to upstream themselves. The main things Debian maintains is their packaging + related tools + packaging repos.

If they had a blanket "no AI content" policy, then it could have implications on what sorts of software packages could be included in their package repos. Instead, the post linked by the article said:

"Apparently we are far from a consensus on an official Debian position regarding the use of generative AI as a whole in the project. We're therefore content to use the resources we have and let each team handle the content using their own criteria"
That's really not saying much, especially if you don't know what all the teams are and their individual policies.
 

Findecanor

Distinguished
Apr 7, 2015
292
196
18,860
NetBSD is not a Linux distribution. It is a different operating system: a Unix, with long lineage. It was forked from 386BSD in 1993.

Linux is a Unix clone... Actually "Linux" is only a kernel, but a distribution contains also "userland" of libraries and programs to make it useful: often from GNU (hence the common "GNU/Linux" moniker).
NetBSD includes its own userland.
 
Last edited:

bit_user

Titan
Ambassador
NetBSD is not a Linux distribution. It is a different operating system: a Unix, with long lineage. It was forked from 386BSD in 1993.

Linux is a Unix clone... Actually "Linux" is only a kernel,
If we're going to be pedantic, then Linux is not a UNIX clone! Linux inherits heavily from both UNIX and BSD, but has since developed many of its own innovations and APIs. At this point, the distinction between UNIX vs. BSD is pretty pointless and Linux is pretty far beyond either. Linux is simply Linux.

a distribution contains also "userland" of libraries and programs to make it useful: often from GNU (hence the common "GNU/Linux" moniker).
GNU contributes some core utilities, GCC, and glibc. By this point, there are non-GNU versions of pretty much all the GNU stuff, should anyone want a totally GNU-free Linux distro.

Also, GNU has their own a kernel called Hurd, which has been limping along for like 30 years and I think still in a semi-working state.
 
Meanwhile a plethora of security vulnerabilities are patched every month by humans fixing bugs and critical issues in code created by humans, and those patches can, and often do, cause other issues from broken functionality to a bricked device.
 

CmdrShepard

Prominent
Dec 18, 2023
522
392
760
Ban or just covering themselves. When Ai generated code is submitted it the the submitter who did the "bad thing" not them.
If they are the ones allowing the commit it is their fault if the tainted code ends up in their codebase.

The decision is sensible -- all code generated by AI is tainted because you can't prove it wasn't trained on copyrighted code. Better to be safe then sorry.
 

CmdrShepard

Prominent
Dec 18, 2023
522
392
760
Meanwhile a plethora of security vulnerabilities are patched every month by humans fixing bugs and critical issues in code created by humans, and those patches can, and often do, cause other issues from broken functionality to a bricked device.
This is not about the quality but of the legality of "AI" contribution. If a model was trained on copyrighted code and / or code which has non-permissive license incompatible with whatever they need then it's not legal to include it.
 

thisisaname

Distinguished
Feb 6, 2009
889
497
19,260
If they are the ones allowing the commit it is their fault if the tainted code ends up in their codebase.

The decision is sensible -- all code generated by AI is tainted because you can't prove it wasn't trained on copyrighted code. Better to be safe then sorry.
The problem is how do you spot this AI generated code? All this ban does is tell people not to commit AI generated code. It does not say they will stop people committing AI generated code.
If Ai tainted code it submitted and later found out they will blame the person who submitted it, with the line they where told not to di it.
 
  • Like
Reactions: slightnitpick
May 19, 2024
2
0
10
If they are the ones allowing the commit it is their fault if the tainted code ends up in their codebase.

The decision is sensible -- all code generated by AI is tainted because you can't prove it wasn't trained on copyrighted code. Better to be safe then sorry.
At the same time, whatever test you apply to AI code you should apply to human code because it could also be tainted from the human training on copyrighted code.
 

CmdrShepard

Prominent
Dec 18, 2023
522
392
760
At the same time, whatever test you apply to AI code you should apply to human code because it could also be tainted from the human training on copyrighted code.
You people still don't get the difference between human and "AI", do you?

I'll say it once more -- it's the scale.

No human developer in their lifetime can be exposed to the same amount of copyrighted code as the AI models were during training.

Therefore, the risk of a human developer introducing copyrighted code is several orders of magnitude lower, especially given that human developers working on such projects will make a conscious effort to avoid it because, unlike LLM, they understand the concept of copyrighted code while LLM just spits out statistical probabilities often without the ability to reference sources.
 
May 19, 2024
2
0
10
You people still don't get the difference between human and "AI", do you?

I'll say it once more -- it's the scale.

No human developer in their lifetime can be exposed to the same amount of copyrighted code as the AI models were during training.

Therefore, the risk of a human developer introducing copyrighted code is several orders of magnitude lower, especially given that human developers working on such projects will make a conscious effort to avoid it because, unlike LLM, they understand the concept of copyrighted code while LLM just spits out statistical probabilities often without the ability to reference sources.
Scale doesn't matter. What matters is if you were to judge one code source on copyright and license issues, you should judge all code sources on copyright and license issues. Just as they have plagiarism detectors for university papers, we should have them for all code submitted to all projects.

You argue that the scale of exposure matters, but I say it doesn't. If someone lived on a desert island and never saw a single YouTube video, and then we gave them the camera and had them create a youtube short. If the work they generated collided with the copyright of somebody else's work, would we give them a pass? No, we wouldn't. They would still be guilty of copyright violations. In copyright law, intent and lack of exposure don't matter. All that matters is that the creator substantially replicated the original work.

However, I believe copyright violations are inherent in code. There is no way we can avoid them. We have a number of tools that define good code, design patterns, and code idioms, and all of these restrictions on what makes good code narrow the range of what can be generated.

The above brings us back to the need for plagiarism detectors. How can you know if you violate copyright if you can't test it?

Also, I suggest looking at the claims of plagiarizing in the music, publishing, and academic paper industry. Numerous plagiarism, some intentional, some accidental, some editing mistakes, forgetting to put in quotations and bibliographical references. Again I think many of these plagiarism claims were unavoidable because of limited way of expressing music, or ideas using a highly restricted set of tools.

In over 20 years of programming, I've always assumed my code violates copyright on somebody else's code somewhere just because, as I said above, the limited expressiveness of coding it's highly likely there will be a copyright violation.

When I solve a problem using my Junior programmers, ChatGPT and Copilot, I give them the specs and descriptions of what needs to be done the same way I would humans. I shape their code by telling them what they did wrong and almost never change their code directly. Just like other Junior programmers, I need to keep a careful eye on the code and make sure it generates the right test cases, but at the end of the day, I get code that looks like something I would've generated myself.

I think another reason why people are getting cranky is because LLM systems creating code is another example of how humans are not special. History is full of examples of when some other creature is doing something we thought was uniquely human (tool use, language use), and somebody gets their knickers in a twist. I suggest giving up on the idea that humans are any center of any universe. We are just a collection of cells bumbling around, occasionally doing something good.

The need for humans in software development will drop significantly in the next 20 years. Start planning your exit path now. At this time, it looks like there will be a core of people generating new ideas, expressions, etc., to be fed into AI systems and become available to everybody. However, since the future will arrive too soon and be in the wrong order, we will have AI systems that generate new ideas independently and don't need humans to solve problems.