News Generative AI Goes 'MAD' When Trained on AI-Created Data Over Five Times

That means you can poison internet content with image previews and data that appears has the traits of AI generated one so AIs get MAD when crawling. So now AI owners will need to start validating the data the AI trains on. What worries me is that we hold them to no standard when it comes to the selection they feed the AI when training it and that trustworthiness of the output is very dubious, image and code generation aside.
 
This is by far the number one issue all future LLMs will run into. And that is the lack of "clean" data for training. More so, when you have an arms race, it's very possible for an opponent to introduce subtle poison into the data in order to manipulate the training outcome. Especially if an LLM is being trained essentially by just crawling the internet on a 10Gbps link
 
" ...starts outputting results that are more aligned with the mean representation of data, much like the snake devouring its own tail."

... I never knew that the ouroboros was known for outputting results that align with the mean representation of data. I thought it represented infiinity.
 
  • Like
Reactions: bit_user
Idk if this is inappropriate for comments so if it is please let me know.

But the first thing that came to mind is that we found out that AI to AI intimate contact can lead to contracting Artificial Syphilis (Syphilis is the STD that makes humans go mad in the brain if left untreated)

More like inbreeding.

And we are already on *at least* the second generation of AI training AI. The overwhelming majority of online text content has been AI generated for at least a few years. Pretty much since content farms realized the only way to SEO for google listings was to grab "common knowledge" information (which is always unsourced and often incorrect) then repeatedly restate and repackage those bullet points into different ways people might ask full questions into a smart speaker or digital assistant. This destroys the nuance of the information, to the point it quickly loses meaning and eventually becomes wrong and contradictory.
It's why you can't find the info you're looking for on google anymore. It's also why Bard is essentially unusable, and already showing signs of getting worse.
 
It's not surprising, really. Your model is only as good as your data.

I do find it interesting that, in the sample images, it seems to have latched onto JPEG-style image compression artifacts and treated them as if they're part of the underlying object.

But, one thing about the article @Francisco Alexandre Pires : the primary focus seems to be on image generators, and yet you seem to conflate them with LLMs:

"In essence, training an LLM on its own (or anothers') outputs creates a convergence effect on the data that composes the LLM itself. "

Yes, the phenomenon was found to apply to LLMs, but let's be clear: image generators aren't LLMs. For instance, Stable Diffusion is a latent diffusion model, as its name implies. The paper actually used an image generator called StyleGAN-2, which is a generative adversarial network.

Please take care not to treat LLM as a short-hand for advanced neural networks, except when it actually applies.
 
Last edited:
So would this be a case of "Mad AI" Disease?
Mad Cow Disease is caused by a specific defective protein (AKA prion), which interferes with protein synthesis in a way that causes yet more of them to be produced. It got propagated through the cattle industry feeding slaughterhouse scraps to other cows. Even cooking the scraps isn't enough to breakdown enough of the prions.

So, in a way, it's not completely off the mark. However, mere cow cannibalism isn't enough. You need to introduce the defective protein into the cycle, at some point.
 
Mad Cow Disease is caused by a specific defective protein (AKA prion), which interferes with protein synthesis in a way that causes yet more of them to be produced. It got propagated through the cattle industry feeding slaughterhouse scraps to other cows. Even cooking the scraps isn't enough to breakdown enough of the prions.

So, in a way, it's not completely off the mark. However, mere cow cannibalism isn't enough. You need to introduce the defective protein into the cycle, at some point.
Prions are occasionally found in the brain matter of almost all mammals and does not need to be introduced as it occurs when specific mutations occur in DNA, that encodes proteins, causing an abundance of beta-sheet secondary conformations that then influence the tertiary 3D conformation of the folded protein. My university dissertation was on the propagation of human derived prion disease in isolated tribes of Papua New Guinea practicing ritual funeral cannibalism. The effects on the human mind are fascinating and frightening at the same time. Autopsies of members who succumbed to the prion infestation revealed that the majority of the diverse proteins found in normal brain tissue had slowly been displaced by the misfolded prion over a period of 10-50 years physically transforming the organ into a sponge-like material by the time the disease becomes fatal.

The fact that we have known about prions since the 1920s yet are still unable to find a way to reverse or slow down the disease leads many of my peers and I to hypothesize a scenario where over-population with over-stressed food production and high environmental pollution could produce the variables needed for a prion to become “virulent” via animal meat consumption. And the frightening kicker is that if we don’t catch it in the livestock before the human infection becomes widespread, then the only way we would know a pandemic was afoot is when a sizable portion of the human population starts showing the mental and physiological symptoms 5-10 years after infection.
Sorry for the doom and gloom, but it is a very interesting subject to me. There is hope though, recent testing of the same tribal populations show that the human body after 10 generations may finally be adapting to the presence of prions by creating a stabilized variant which resists the infectious prion it is based on.
 
My university dissertation was on the propagation of human derived prion disease in isolated tribes of Papua New Guinea practicing ritual funeral cannibalism.
🤯

...not that I hadn't heard of the practice (wasn't there also a tribe in the Andes?), but I mean it's the first post in more than 10k that I've made on here where I've mentioned BSE and you just happen to have done a dissertation on it.

if we don’t catch it in the livestock before the human infection becomes widespread, then the only way we would know a pandemic was afoot is when a sizable portion of the human population starts showing the mental and physiological symptoms 5-10 years after infection.
After "mad cow disease" (BSE) was properly understood, I avoided eating beef for about 2 years. Then, when it became apparent that only a relatively small subset of the human population was susceptible to CJD (the human equivalent) and the cattle industry had claimed to put safeguards in place, I slowly started eating it, again.

That was about 25 years ago. Fortunately, we now have more alternative sources of protein.
 
This does not make any sense... are there some sort of 'small errors' in the images due to the AI generation that makes this happen? Have they tried printing out the images and then rescanning them into the AI to see if that solves the issue?
Has this issue happened with AI trained on actual human-handmade artwork?
Lots of questions here.
 
🤯

...not that I hadn't heard of the practice (wasn't there also a tribe in the Andes?), but I mean it's the first post in more than 10k that I've made on here where I've mentioned BSE and you just happen to have done a dissertation on it.


After "mad cow disease" (BSE) was properly understood, I avoided eating beef for about 2 years. Then, when it became apparent that only a relatively small subset of the human population was susceptible to CJD (the human equivalent) and the cattle industry had claimed to put safeguards in place, I slowly started eating it, again.

That was about 25 years ago. Fortunately, we now have more alternative sources of protein.
I must admit I was in my teens and didn’t care too much about Mad Cow lol. It’s funny how growing up changes your priorities from “don’t care” to “let me dedicate an entire year to research it”.

I specialize in biochemistry and molecular biology, my prize peer reviewed publication was on the effects of mutated UNC-33-like transporter proteins causing neuronal outgrowth defects (IE: inhibition of the neuron-to-neuron connection process that normally leads to memories being established) in Alzheimer’s patients due to mis-location and buildup of outgrowth aiding proteins within the neuron cell instead of the neuron cell wall.

Funnily enough, the Alzheimer’s drug that was in the news very recently targets and reduces the said protein buildup and, although it does not fix the mis-location of outgrowth proteins, it shows efficacy in slowing down the onset of Alzheimer’s when caught early.
 
I must admit I was in my teens and didn’t care too much about Mad Cow lol. It’s funny how growing up changes your priorities from “don’t care” to “let me dedicate an entire year to research it”.
For me, what set it apart was understanding its (to me) novel mechanics and the risks to human health. For instance, I never worried much about hoof & mouth disease, even though a major outbreak is potentially every bit as devastating to the cattle industry.
 
This makes perfect sense. Deep learning, machine learning, AI, whatever you call it is just a utilization of various optimization methodologies to find the best "answer" on the multidimensional surface the data is located. It's an iterative process, hence the training methodologies we employ. The data provided only reveals a small portion of that surface, so the minimization or maximization of the surface may be at a local extrema, but missing the true extrema. Feeding the model on data that is already skewed to believe the local extrema is the true "answer" aka mean, just further pushes you away from the true mean of the data. It's pure statistics. If given an infinite amount of data for a random variable, the value of the random variable will approach the mean of the data with a standard deviation of 0. I.e. it is no longer a random variable, but a constant. The only way to avoid this is to adjust/correct the data being fed to the model. That means work and in the progression of AI development, we continue to find simpler methods to train the AI. I'm sure some researcher will come up with a solution soon.
 
are there some sort of 'small errors' in the images due to the AI generation that makes this happen?
Yes, I think so. The generated images look reasonably convincing, but they can deviate from an actual image in small but somewhat consistent ways. These errors are amplified over successive generations of networks each trained on the output of prior generations.

It's a little like the generation loss you get when you copy an analog recording of a song. With each copy, the high frequencies might get attenuated, some resonances are amplified, hiss and distortion increase.

Has this issue happened with AI trained on actual human-handmade artwork?
I'd expect it should apply to any sort of data. How quickly the degradation occurs will depend on things like the sophistication of the model, the number of training samples & their quality, etc.

In spite of the dire predictions, I'm not really worried. I think people won't use the output from an image generator, if it looks too fake. The majority of content we see will probably either be real images or within the first couple generations. Plus, if an image looks too fake to us, then you ought to be able to train a model to detect such images and reject them from training sets.
 
This may be an issue for LLM, but it's definitely a problem for generative imaging. Luckily, there's an invisible watermark embedded in output images that can be used to cull future training sets.

That said, the real issue is they just didn't spend the time validate the data that was being fed back in to the training. There are often small imperfections in outputs and if those are added to the data set and retrained they become more prominent in the next batch. Rinse and repeat and the dataset is ruined.

In practice, I do the exact opposite of what these researched did. If my dataset contains an underrepresented detail that I want in the set, then I use prompts to generate images that bring that detail out. Then I inspect those outputs for flaws and/or crop out flaws, and feed those extra images back in to the next training round.
 
There is a growing pool of AI applications that can fix flaws and remove unwanted details in a more or less automated fashion. If those were added to the work flow, it would slow or eliminate the problem in this article.
 
You always have such interesting contributions (in multiple articles); thank you for devoting your time to commenting and adding to the community.
You're too kind, but thank you! I'd recommend taking my comments with a grain of salt.

Also, thank you for bringing us so many interesting and informative articles! The effort you put into you work really shows.

I'm aware that the main focus isn't on LLMs; the reason I spun the article more towards that is that they're simply the most recognizable AI systems for the average user (more people have used ChatGPT than Midjourney), and it serves as a way to both increase interest and increase the impact on how relevant this "MADness" is. I'd hope I straddled the line between the focus and spin in an acceptable way.
I was not so much concerned about this article, but more as a forward-looking suggestion.

In this AI revolution, I believe the more people there are with a richer and more accurate understanding of the technology, the better the chances are that we can navigate it successfully. I'm not very optimistic, but we won't succeed if we don't try.

To that end, have you guys consider putting together some AI explainers, to help level-up the readership's understanding of how these technologies work, what are their most interesting capabilities, and how they've truly advanced beyond anything that came before?