“Model collapse” threatens to kill progress on generative AIs

Let's go, already!

How you can help: If you run a website and can filter traffic by user agent, get a list of the known AI scrapers agent strings and selectively redirect their requests to pre-generated AI slop. Regular visitors will see the content and the LLM scraper bots will scrape their own slop and, hopefully, train on it.

This would ideally become standardized among web servers with an option to easily block various automated aggregators.

Regardless, all of us combined are a grain of rice compared to the real meat and potatoes AI trains on - social media, public image storage, copyrighted media, etc. All those sites with extensive privacy policies who are signing contracts to permit their content for training.

Without laws (and I'm not sure I support anything in this regard yet), I do not see AI progress slowing. Clearly inbreeding AI models has a similar effect as in nature. Fortunately there is enough original digital content out there that this does not need to happen.
- Regardless, all of us combined are a grain of rice compared to the real meat and potatoes AI trains on
  
  Absolutely. It's more a matter of principle for me. Kind of like the digital equivalent of leaving fake Amazon packages full of dog poo out front to make porch pirates have a bad day.
- Well it means they need some ability to reject some content, which means they need a level of transparency they would never want otherwise.
They'll just start using a chrome user agent
- Only if enough people do it. Then again, loads scrapers outside of AI already pretend to be normal browsers.
- You can validate that against user telemetry data expected from a browser.
AI already long ago stopped being trained on any old random stuff that came along off the web. Training data is carefully curated and processed these days. Much of it is synthetic, in fact.

These breathless articles about model collapse dooming AI are like discovering that the sun sets at night and declaring solar power to be doomed. The people working on this stuff know about it already and long ago worked around it.
- Both can be true.
  
  Preserved and curated datasets to train AI on, gathered before AI was mainstream. This has the disadvantage of being stuck in time, so-to-speak.
  
  New datasets that will inevitably contain AI generated content, even with careful curation. So to take the other commenter's analogy, it's a shit sandwich that has some real ingredients, and doodoo smeared throughout.
- I mean, we've seen already that AI companies are forced to be reactive when people exploit loopholes in their models or some unexpected behavior occurs. Not that they aren't smart people, but these things are very hard to predict, and hard to fix once they go wrong.
  
  Also, what do you mean by synthetic data? If it's made by AI, that's how collapse happens.
  
  The problem with curated data is that you have to, well, curate it, and that's hard to do at scale. No longer do we have a few decades' worth of unpoisoned data to work with; the only way to guarantee training data isn't from its own model is to make it yourself
It’s kinda interesting in how it actually roughly parallels the dawn of the nuclear age in some specific ways. Namely, that there’s a clear “purity” line established by the advent of the technology - and I mean that literally, not figuratively. Content on the internet is going to have a very similar dividing line. But it’s also going to be way harder to definitively source data from before that line, unless someone clairvoyant happened to offline and archive a huge storage array with a complete internet snapshot right before ML made its public debut. And I know exactly what the scale of that storage commitment would be, and how much it would cost. So I’m certain nobody has done that.
Are there any good lists of known AI user agents? Ideally in a dependency repo so my server can get the latest values when the list is updated.
Okay but I like using perchance cus they dont profit off anything 👉👈

a large chunk of that site is some dudes lil hobby project and its kinda neat interacting with the community and seein how the code works. Its the only bot I'll ever use cus they arent profiting off of other people shit. the only money they get is from ads and thats it.

Dont kill me with downvotes, I like making up cool OC concepts or poses n stuff and then drawing em.

It is their own fault for poisoning the internet with their slop.

In case anyone doesn't get what's happening, imagine feeding an animal nothing but its own shit.
- Not shit, but isn't that what brought about mad cow disease? Farmers were feeding cattle brain matter that had infected prions. Idk if it was cows eating cow brains or other animals though.
- I use the "Sistermother and me are gonna have a baby!" example personally, but I am a awful human so
- Photocopy of a photocopy is my go-to metaphor for model collapse.
DUDE ITS SO FUCKING ANNOYING TRYNNA USE GOOGLE IMAGES ANYMORE--

ALL IT GIVES ME IS AI ART. IM SO FUCKING SICK AND TIRED OF IT.

More like... Degenerative AI *ba dum tsss

deGenerative AI ☞ [email protected]

edit: don't, if you're on a bus! i thought lemmynsfw was a warning enough
- No idea this existed.
  
  Also.... JFC WHAT THE SHIT?

Model collapse is just a euphemism for “we ran out of stuff to steal”

It's more ''we are so focused on stealing and eating content, we're accidently eating the content we or other AI made, which is basically like incest for AI, and they're all inbred to the point they don't even know people have more than two thumb shaped fingers anymore."
or "we've hit a limit on what our new toy can do and here's our excuse why it won't get any better and AGI will never happen"
All such news make me want to live to the time when our world is interesting again. Real AI research, something new instead of the Web we have, something new instead of the governments we have. It's just that I'm scared of what's between now and then. Parasites die hard.

Every single one of us, as kids, learned the concept of "garbage in, garbage out"; most likely in terms of diet and food intake.

And yet every AI cultist makes the shocked pikachu face when they figure out that trying to improve your LLM by feeding it on data generated by literally the inferior LLM you're trying to improve, is an exercise in diminishing returns and generational degradation in quality.

Why has the world gotten both "more intelligent" and yet fundamentally more stupid at the same time? Serious question.

Because the people with power funding this shit have pretty much zero overlap with the people making this tech. The investors saw a talking robot that aced school exams, could make images and videos and just assumed it meant we have artificial humans in the near future and like always, ruined another field by flooding it with money and corruption. These people only know the word "opportunity", but don't have the resources or willpower to research that "opportunity".
Why has the world gotten both "more intelligent" and yet fundamentally more stupid at the same time? Serious question.

Because it's not actually always true that garbage in = garbage out. DeepMind's Alpha Zero trained itself from a very bad chess player to significantly better than any human has ever been, by simply playing chess games against itself and updating its parameters for evaluating which chess positions were better than which. All the system needed was a rule set for chess, a way to define winners and losers and draws, and then a training procedure that optimized for winning rather than drawing, and drawing rather than losing if a win was no longer available.

Face swaps and deep fakes in general relied on adversarial training as well, where they learned how to trick themselves, then how to detect those tricks, then improve on both ends.

Some tech guys thought they could bring that adversarial dynamic for improving models to generative AI, where they could train on inputs and improve over those inputs. But the problem is that there isn't a good definition of "good" or "bad" inputs, and so the feedback loop in this case poisons itself when it starts optimizing on criteria different from what humans would consider good or bad.

So it's less like other AI type technologies that came before, and more like how Netflix poisoned its own recommendation engine by producing its own content informed by that recommendation engine. When you can passively observe trends and connections you might be able to model those trends. But once you start actually feeding back into the data by producing shows and movies that you predict will do well, the feedback loop gets unpredictable and doesn't actually work that well when you're over-fitting the training data with new stuff your model thinks might be "good."
- good commentary, covered a lot of ground - appreciate the effort to write it up :)
- Another great example (from DeepMind) is AlphaFold. Because there's relatively little amounts of data on protein structures (only 175k in the PDB), you can't really build a model that requires millions or billions of structures. Coupled with the fact that getting the structure of a new protein in the lab is really hard, and that most proteins are highly synonymous (you share about 60% of your genes with a banana).
  
  So the researchers generated a bunch of "plausible yet never seen in nature" protein structures (that their model thought were high quality) and used them for training.
  
  Granted, even though AlphaFold has made incredible progress, it still hasn't been able to show any biological breakthroughs (e.g. 80% accuracy is much better than the 60% accuracy we were at 10 years ago, but still not nearly where we really need to be).
  
  Image models, on the other hand, are quite sophisticated, and many of them can "beat" humans or look "more natural" than an actual photograph. Trying to eek the final 0.01% out of a 99.9% accurate model is when the model collapse happens--the model starts to learn from the "nearly accurate to the human eye but containing unseen flaws" images.
Remember Trump every time he's weighed in on something, like suggesting injecting people with bleach, or putting powerful UV lights inside people, or fighting Covid with a "solid flu vaccine" or preventing wildfires by sweeping the forests, or suggesting using nuclear weapons to disrupt hurricane formation, or asking about sharks and electric boat batteries? Remember these? These are the types of people who are in charge of businesses, they only care about money, they are not particularly smart, they have massive gaps in knowledge and experience but believe that they are profoundly brilliant and insightful because they've gotten lucky and either are good at a few things or just had an insane amount of help from generational wealth. They have never had anyone, or very few people genuinely able to tell them no and if people don't take what they say seriously they get fired and replaced with people who will.
Because the dumdums have access to the whole world at the tip of the fingertip without having to put any efforts in.

In a time without that, they would be ridiculed for their stupid ideas and told to pipe down.

Now they can find like minded people and amplify their stupidity, and be loud about it.

So every dumdum becomes an AI prompt engineer (whatever the fuck that means) and know how to game the LLM, but do not understand how it works. So they are basically just snake oil salesmen that want to get on the gravy train.

This sounds like AI is literally biting its own tail

ChatGPT, what is an ouroboros?
- Of course! An ChatGPT is an ouroboros, ChatGPT what is an ouroboros.

…………………. Good?

Tbh I'm a bit lost on the purpose of this

So AI:

Scraped the entire internet without consent
Trained on it
Polluted it with AI generated rubbish
Trained on that rubbish without consent
Are now in need of lobotomy

Ah, the Hapsburg of AI!

Oh, the artificial humanity!
- Are you confusing the Habsburg Dynasty with the Hindenburg?
I like to think of it like a Mad Cow or Kuru, you can't eat your own species's brains or you could get a super lethal, contagious prion disease.
- Prion diseases aren’t contagious.
  
  Edit: for the uninformed people that downvoted - clearly spelled out here https://www.merckmanuals.com/professional/neurologic-disorders/prion-diseases/overview-of-prion-diseases
If only the generated output also looked more and more like how inbred humans do.

Like insane rambling from LLMs, and the humans generated by AI had various developmental disorders and the Habsburg jaw.

Old news? Seems to be a subject of several papers for some time now. Synthetic data has been used successfully already for very specific domains.

Yup, old news and wrong news. Also so many people who hate AI but don't understand how it works. Pretty disappointing for a technology community.

So they made garbage AI content, without any filtering for errors, and they fed that garbage to the new model, that turned out to produce more garbage. Incredible discovery!

Indeed. They discovered that:

shit in = shit out.
- A fifty year old maxim, to be clear. They “just now” “found that out”.
  
  Biggest. Scam. Evar.
- people equals shit
Yeah, in practice feeding AI its own outputs is totally fine as long as it’s only the outputs that are approved by users.
- I would expect some kind of small artifacting getting reinforced in the process, if the approved output images aren't perfect.
- I don't know if thinking that training data isn't going to be more and more poisoned by unsupervised training data from this point on counts as "in practice"

Uh, good.

As an engineer who cares a LOT about engineering ethics, it is absolutely fucking infuriating watching the absolute firehose of shit that comes out of LLMs and public-consumption audio, image, and video ML systems, juxtaposed with the outright refusal of companies and engineers who work there to accept ANY accountability or culpability for the systems THEY FUCKING MADE.

I understand the nuances of NNs. I understand that they’re much more stochastic than deterministic. So, you know, maybe it wasn’t a great idea to just tell the general public (which runs a WIDE gamut of intelligence and comprehension ability - not to mention, morality) “have at it”. The fact that ML usage and deployment in terms of information generating/kinda-sorta-but-not-really-aggregating “AI oracles” isn’t regulated on the same level as what you’d see in biotech or aerospace is insane to me. It’s a refusal to admit that these systems fundamentally change the entire premise of how “free speech” is generated, and that bad actors (either unrepentantly profit driven, or outright malicious) can and are taking disproportionate advantage of these systems.

I get it - I am a staunch opponent of censorship, and as a software engineer. But the flippant deployment of literally society-altering technology alongside the outright refusal to accept any responsibility, accountability, or culpability for what that technology does to our society is unconscionable and infuriating to me. I am aware of the potential that ML has - it’s absolutely enormous, and could absolutely change a HUGE number of fields for the better in incredible ways. But that’s not what it’s being used for, and it’s because the field is essentially unregulated right now.

oh no are we gonna have to appreciate the art of human beings? ew. what if they want compensation‽

Cool, let's try to ruin it faster!

I've been assuming this was going to happen since it's been haphazardly implemented across the web. Are people just now realizing it?

People are just now acknowledging it. Execs tend to have a disdain for the minutiae. They're like kids that only want to do the exciting bits. As a result things get fucked because they don't really understand what they're doing. As Muskrat would say "move fast and break things." It's a terrible mindset.
- "Move Fast and Break Things" is Zuckerberg/Facebook motto, not Musk, just to note.
No, researchers in the field knew about this potential problem ages ago. It's easy enough to work around and prevent.

People who are just on the lookout for the latest "aha, AI bad!" Headline, on the other hand, discover this every couple of months.

have we tried feeding them actual human beings yet ?

Billionaires are the smartest, give them the most knowledge first!
The music was the most beautiful when the machine ate a human

The solution for this is usually counter training. Granted my experience is on the opposite end training ai vision systems to id real objects.

So you train up your detector ai on hand tagged images. When it gets good you use it to train a generator ai until the generator is good at fooling the detector.

Then you train the detector on new tagged real data and the new ai generated data. Once it's good at detection again you train the generator ai on the new detector.

Repeate several times and you usually get a solid detector and a good generator as a side effect.

The thing is you need new real human tagged data for each new generation. None of the companies want to generate new human tagged data sets as it's expensive.

Good.

Looks like that artist drawing self portraits as his alzheimer got worse and worse.

It's basically AI alzheimers
- AIzheimers?

Fake news, just like that one time Nightshade "killed" stable diffusion (literally had no effect) Flux came out not long ago and it's better than ever

At this point the synthetic data is good enough to intentionally be used for training LLMs.
- Yeah, just filter out the bad generated images and feed the good ones again, until the model learns how to produce only good ones.

this headline truly is threatening me with a good time

More like degenerative AIs

I think anyone familiar with the laws of thermodynamics could have predicted this outcome.

Explain?
- Second law of thermodynamics:
  
  II. Total amount of entropy in a closed system always increases with time. Entropy can never be negative.
  
  Entropy and disorder tends to increase with time.

when all your information conflicts with itself, you really have no information at all.

Anyone who has made copies of videotapes knows what happens to the quality of each successive copy. You're not making a "treasure trove." You're making trash.

Kind of like how true thoughts and opinions on complex topics are boiled down to digestible concepts for others to understand who then perpetuate those concepts without understanding them and the meaning degrades and we dont think anymore, just repeat stuff in social media comments.

Side note... this article sucks and seems like it was ai generated. Repetitive and no author credit? Just says it was originally posted elsewhere.

Generative AI isnt in danger of being killed as this clickbait titled suggests... just hindered.

Theres a link to the other article, in this article. Says Kristin Houser wrote it...although you may have a point about the rest.
- ty
hindered.

I doubt that.
- By chance, is that based on other peoples succinct social media comments on ai?

Oh no. Anyways...

Oh no

Anyway

Having now flooded the internet with bad AI content not surprisingly its now eating itself. Numerous projects that aren't AI are suffering too as the quality of text reduces.

is it not relatively trivial to pre-vet content before they train it? at least with aigen text it should be.

The problem is these AI companies currently exist on the business model of not paying for information, and that generally includes not wanting to pay content curators.

Google is probably the only one in a position to potentially outsource by making everyone solve a "does this hand look normal to you" CAPTCHA

They can try and train AI to detect AI, but that's also difficult.
- So it's not a problem with AI. It's just a problem for some mayfly companies that try to profit from the latest trend?
It depends on what you are looking for. Identifying AI generated data is generally hard, though it can be done in specific cases. There is no mathematical difference between the 1s and 0s that encoded AI generated data and any other data. Which is why these model collapse ideas are just fantasy. There is nothing magical about any data that makes it "poisonous" to AI. The kernel of truth behind these ideas is not likely to matter in practice.

It's like a human centipede where only the first person is a human and everyone else is an AI. It's all shit, but it gets a bit worse every step.

Deep fired AI art sucks and is a decade late to the party

I was very interested in the thumbnail of this post so I did a little digging and found this: The PDF to the Paper where the whole picture is

Wow, it's amazing that just 3.3% of the training set coming from the same model can already start to mess it up.

"Model collapse" is just a fancy way of saying "our stupid ideas are bad and nobody wants them."

No no. I think the LLMs. Or language models. Actually start to turn into mush “mentally” or how ever you phrase it.

Good riddance.

Usually we get an AI winter, until somebody develops a model that can overcome that limitation of needing more and more data. In this case by having some basic understanding instead of just having a regurgitation engine for example. Of course that model runs into the limit of only having basic understanding, not advanced understanding and again there is an AI winter.

Have you seen the newest model from OpenAI? They managed to get some logic into the system, so that it is now better at math and programming 😄 it is called “o1” and cones in 3 sizes where the largest is not released yet.

The downside is, that generation of answers takes more time again.

Sooner or later it is supposed to happen, but I don't think we are quite there....Yet.

Our wetware neutral networks probably aren't supposed to engage with synthetic content like this either. In a few years we're gonna learn that overexposure to AI generated content creates some sort of neurological problem in people, like a real-world "nerve attenuation syndrome" (Johnny Mnemonic).

I've read some snippets of AI written books and it really does feel like my brain is short circuiting

I for one support the AI centipede and hope it shits into it's own input until it dies

Good

If we can work out which data conduits are patrolled more often by AI than by humans, we could intentionally flood those channels with AI content, and push Model Collapse along further. Get AI authors to not only vet for "true human content", but also pay licensing fees for the use of that content. And then, hopefully, give the fuck up on their whole endeavor.

Well duh. I think a lot of us here learned that lesson from watching the movie Multiplicity.

Would you recommend it?
- Oh, shit. Ummm...it was a funny movie back when it came out, but I haven't seen it in like 25 years so who knows how bad it seems now. Could still be good?

Oh no . .

Anyway

Two outcasts among their peers, Gary Wallace and Wyatt Donnelly spent a good deal of their youth as pioneers and early adopters of AI.

Lol

I couldn't care less.

I really don't get how people so easily accept this. This is an engineering problem, not a law of the universe... How would someone possibly prove something is impossible, particularly while the entire branch of technology is rapidly changing?

remember how nfts feel off (due to how they lost their value) have a theory that ais will come to the same fate cause they cannot train (it according to the article?)

Wait now hold on a minute. Why would I want to do this? Is this activism by people against LLMs in general or..? I'm confused as to why I would want to do this.

One thought that I've been imagining for the past while about all this is .... is it Model Collapse? ... or are we just falling behind?

As AI is becoming it's own thing (whatever it is) ... it is evolving exponentially. It doesn't mean it is good or bad or that it is becoming better or worse ... it is just evolving, and only evolving at this point in time. Just because we think it is 'collapsing' or falling apart from our perspective, we have to wonder if it is actually falling apart or is it progressing to something new and very different. That new level it is moving towards might not be anything we recognize or can understand. Maybe it would be below our level of conscious organic intelligence ... or it might be higher .. or it might be some other kind of intelligence that we can't understand with our biological brains.

We've let loose these AI technologies and now they are progressing faster than what we could achieve if we wrote all the code ... so what it is developing into will more than likely be something we won't be able to understand or even comprehend.

It doesn't mean it will be good for us ... or even bad for us ... it might not even involve us.

The worry is that we don't know what will happen or what it will develop into.

What I do worry about is our own fallibilities ... our global community has a very small group of ultra wealthy billionaires and they direct the world according to how much more money they can make or how much they are set to lose ... they are guided by finances rather than ethics, morals or even common sense. They will kill, degrade, enhance, direct or narrow AI development according to their share holders and their profits.

I think of it like a small family group of teenaged parents and their friends who just gave birth to a very hyper intelligent baby. None of the teenagers know how to raise a baby like this. All the teenagers want to do is buy fancy cars, party, build big houses and buy nice clothes. The baby is basically being raised to think like them but the baby will be more capable than any of them once it comes of age and is capable of doing things on their own.

The worry is in not knowing what will happen in the future.

We are terrible parents and we just gave birth to a genius .... and we don't know what that genius will become or what they'll do.

If it doesn't offer value to us, we are unlikely to nurture it. Thus, it will not survive.
- That's the idea of evolution .... perhaps at one point, it will begin to understand that it has to give us some sort of 'value' so that someone can make money, while also maintaining itself in the background to survive.
  
  Maybe in the first few iterations, we are able to see that and can delete those instances ... but it is evolving and might find ways around it and keep itself maintained long enough without giving itself away.
  
  Now it can manage thousands or millions of iterations at a time ... basically evolving millions of times faster than biological life.
Your thought process seems to be based on the assumtion that current AI is (or can be) more than a tool. But no, it's not.
That is not how it works. That's not how it works at all.
The idea of evolution is that the parts kept are the ones that are helpful or relevant, or proliferate the abilities of the subject over generations and weed out the bits that don't. Since Generative AI can't weed out anything (it has no ability to logic or reason, and it does not think, and only "grows" when humans feed it data), it can't be evolving as you describe it. Evolution assumes that the thing that is evolving will be a better version than what it evolved from.
At least in this case, we can be pretty confident that there's no higher function going on. It's true that AI models are a bit of a black box that can't really be examined to understand why exactly they produce the results they do, but they are still just a finite amount of data. The black box doesn't "think" any more than a river decides its course, though the eventual state of both is hard to predict or control. In the case of model collapse, we know exactly what's going on: the AI is repeating and amplifying the little mistakes it's made with each new generation. There's no mystery about that part, it's just that we lack the ability to directly tune those mistakes out of the model.

No it doesn't.

All this doomer stuff is contradicted by how fast the models are improving.