News AI Lie: Machines Don’t Learn Like Humans (And Don’t Have the Right To)

Status
Not open for further replies.
No offense to the author, but this indiscriminate assault on LLMs (what do math errors have to do with copyright, anyway?) reads less like an valid viewpoint and more like irrational neo-luddism from someone too concerned about losing their job to a new tech to stay objective. Not that I can't relate – my job is under threat as well – or there aren't any valid issues to be had with the technology, but I'm not going to abandon all logic and reason simply because my emotions tell me to.

The article is too long to go point by point, but in a nutshell, the author doesn't seem to realize (or rather, willing to admit) that the more content these LLMs consume, the less recognizable any of it will be in regard to particular sources. And man, if you're going to put random quotes around stuff you don't like, you should at least put them around the "unlicensed" part in your "'training data' masquerading as 'learning'" bit. Aside from the fact that the training LLMs seems to fall under fair use under US copyright law, there's nothing there to copyright since these models do not "copy" work in the traditional sense and merely digest it to predict outcomes. Using a large number of patterns to predict the best next word in a sequence is a long cry from reproducing a copyrighted work without permission, which is why none of the major corporations seem concerned about the wave of legal challenges this produced.

If anything, you should put blame on human masses for being so thoroughly unoriginal that you can train a mindless machine to believably mimic their verbal diarrhea on the fly. A smaller web with less redundancy and regurgitation and more accessible answers and solutions? The horror! Now, where do I sign up?
 
Using a large number of patterns to predict the best next word in a sequence is a long cry from reproducing a copyrighted work without permission
Indeed, for example compression algorithms do exactly that and no one has any problem with that.

If anything, you should put blame on human masses for being so thoroughly unoriginal that you can train a mindless machine to believably mimic their verbal diarrhea on the fly.
It's not that humans are that unoriginal, it's that some humans are used to put minimal effort in their job but now the same (minimal) quality can be produced by machines cheaper.
 
A smaller web with less redundancy and regurgitation and more accessible answers and solutions? The horror! Now, where do I sign up?
This is assuredly NOT where this will all lead. If anything, it's going to be more content, generated by AI, creating billions of meaningless drivel web pages that are grammatically correct but factually highly questionable. Sure, you can go straight to whatever LLM tool you want (ChatGPT, Bard, Sydney, etc.) and generate that content yourself. It won't be any better, and potentially not any worse, than the myriad web pages that did the same thing but created a website from it. 🤷‍♂️

I'm not personally that worried for my job, because so far no one is actually figuring out a way to have AI swap graphics cards, run hundreds of benchmarks, create graphs from the data, and analyze the results. But there will be various "best graphics cards" pages that 100% crib from my work and the work of other people to create a page that offers similar information with zero testing. Smaller sites will inevitably suffer a lot.
 
because so far no one is actually figuring out a way to have AI swap graphics cards, run hundreds of benchmarks, create graphs from the data, and analyze the results.
i mean swapping gpu from a system can be done with an assembly style arm & positioning data of where it goes & rest can be done with scripts for the ai to activate.

not really "not figured out" and more nobody cares enough to as cost too high.

and on context of copyright to train LLM....music & film industry made that pretty clear decades ago.

You ask for permission to use it and if you dont get a yes you can't. (and yes ppl can use content by modifying it, but the llm/ai is learning off the raw content which it cant change so couldnt use that bypass)
 
No offense to the author, but this indiscriminate assault on LLMs (what do math errors have to do with copyright, anyway?) reads less like an valid viewpoint and more like irrational neo-luddism from someone too concerned about losing their job to a new tech to stay objective. Not that I can't relate – my job is under threat as well – or there aren't any valid issues to be had with the technology, but I'm not going to abandon all logic and reason simply because my emotions tell me to.
The bots cannot exist without the training data; ergo, one might say that the entire bot is a derivative work. These issues are still being litigated in court and we don't know exactly what the decisions will be in terms of law and the decision might be different in one country than another.

The IP and tech worlds have never seen anything like LLMs so what they are doing is really unprecedented. The premise that these bots have the legal right to hoover up the data is based on the philosophical view that they are "learning like a person does." In the article, I quoted a couple of people who have literally said that the bots should be thought of as "students" or a writer who is "inspired by" the work they read.

What I'm trying to argue -- and we'll see how the legal cases pan out -- is that morally, philosophically and technically, LLMs do not learn like people and do not think like people. We should not think of the training process the same way we think of a person learning, because that's not what is happening. It's more akin to one service scraping data from another.

Is scraping all publicly-accessible data from a website legal? That has definitely not been established. LinkedIn has won a lawsuit against a company that scraped its user profiles, which are all visible on the public web (https://www.socialmediatoday.com/ne...st-Court-Battle-Against-Data-Scraping/635938/). As a society, we recognize that just because a human can view it that doesn't mean a machine has the right to record and copy it at scale.

Now I know someone will say "this isn't copying." In order to create the LLM, they are taking the content and ingesting it and using it to create a database of tokens. The company that scraped LinkedIn wasn't word-for-word reproducing LinkedIn content either; it was taking their asset and repurposing it for profit. What about Clearview AI, which scraped billions of photos from Facebook (again, pages that were visible on the public Internet)?
 
What I'm trying to argue -- and we'll see how the legal cases pan out -- is that morally, philosophically and technically, LLMs do not learn like people and do not think like people. We should not think of the training process the same way we think of a person learning, because that's not what is happening. It's more akin to one service scraping data from another...
I mean, just because a search engine doesn't process and index information the way a human does, doesn't mean that we should limit their access to all types of searchable data and force them to ask explicit permission for every single public result found. We had the same types of arguments when Google rose to power, and we all know how that particular battle played out.

So yes, we can all mourn the countless librarian jobs (not to mention, our privacy) lost to the likes of Google and Amazon, but let's not pretend that any of us want to go back to the reality of spending an entire afternoon on simply looking up a fact or finding a book in a library a few miles away from your house. Machine training, or "learning", is the inevitable future, and that future is already here. Just take one look at the recent Photoshop 2024 release to realize that things are never going back to what we considered normal. The way I see it, content creators only have two viable choices: adapt or perish, which is the harsh reality of all things obsolete. As for new low-effort, regurgitated content popping up all over the web mentioned by Jarred, we already had that for the past decade or two. Same s***, different toolset, if you ask me.
 
I mean, just because a search engine doesn't process and index information the way a human does, doesn't mean that we should limit their access to all types of searchable data and force them to ask explicit permission for every single public result found. We had the same types of arguments when Google rose to power, and we all know how that particular battle played out.
AI / LLM generation of content with no attribution is nothing like search. It's more akin to Google/Bing/etc. taking the best search results, ripping out all of the content, and presenting a summary page that never links to anyone else (except maybe Amazon and such with affiliate links). Search works because it drives traffic to the websites, and at most there's a small snippet from the linked article. That's "fair use" and it benefits both parties. LLM training only benefits one side, for profit, and that's a huge issue.
 
AI / LLM generation of content with no attribution is nothing like search. It's more akin to Google/Bing/etc. taking the best search results, ripping out all of the content, and presenting a summary page that never links to anyone else (except maybe Amazon and such with affiliate links). Search works because it drives traffic to the websites, and at most there's a small snippet from the linked article. That's "fair use" and it benefits both parties. LLM training only benefits one side, for profit, and that's a huge issue.
Not quite true, either. It may be a small part of it that relates to quoting very specific facts, like individual benchmark numbers that may be linked to a particular source, but the rest of it is too generic to be viewed this way. Writing a generic paragraph in the style of Steven King – loosely speaking –is not ripping off Steven King, and neither is quoting 99.9% facts known to humans. To whom should it attribute the idea that Paris is the capital of France, for example? Because the vast majority of the queries are just like that. Moreover, considering the way LLMs digest information to produce their content, I frankly can't even imagine how they could properly source it all. You say they're nothing like humans, but it's a lot like trying to attribute an idea that just popped up in your head that is based on multiple sources and events in your life going decades back. It seems inconceivable.

It appears to me see that you're looking at this from the niche position of your website and not broadly enough to see the big picture. Even when asking a related question like "what's the best gamer GPU on the market right now?", you won't get a rephrased summary from Tom's latest GPU so-and-so. What you'll get is likely an aggregate reply that stems from dozens of websites like yours (well, that including more inferior ones), and probably an old one at that – "RTX 3090!" – since the training data for these things can lag quite a bit. The doctrine of fair use also includes the idea of "transformative purpose", which is exactly what is going on with probably at least 99.99% of ML generated content already.

Again, it's not like I don't empathize with any of your concerns, but considering the mind-blowing opportunities presented by this new technology, I can't help but see these objections as beating a dead horse at best and standing in the way of inevitable progress at worst. I wish we could all focus on the truly amazing stuff it can and will soon do, rather than fixate on what seems to be futile doomsayer negativity that - at least in my view - will almost certainly lose in court and therefore ultimately cause the very people affected even more stress and financial troubles.I could be wrong, but that's my current take on this.
 
Last edited:
  • Like
Reactions: jp7189
While I consider most AI-generated content to be "grammatically-correct garbage", and definitely don't see AI as thinking as people do, I also don't really care for scraping to be outlawed, and I don't care if "search engines are ok because they benefit the website". Avram is certainly correct in analyzing how AI objectively works and what it is capable of producing (like a person who understands the worth of their tool); but also yeah, with all these articles, I do feel a bit of protectionism and defensiveness for his field coming out from them.

Frankly, even if it wasn't outlawed, the fact that there's a lot of people willing to "poison" AI content (which is easy to do) for either serious reasons or the lulz, means that the legality won't be the be-all and end-all. And perhaps more importantly, it seems that LLMs are running out of "fresh" data and collapsing on themselves, both over the excessive data that they do have, and the cannibalization that stems from "Model Autophagy Disorder". I do genuinely think that LLMs are being pushed to do broader things than they are really capable of, and both corporations and researchers in the field refuse to admit the limitations because they'd lose funding. Just remember folks, we've had big AI hype before, as early as in the 70s IIRC, and it moreorless went down in the same cycle.
 
Last edited:
This is assuredly NOT where this will all lead. If anything, it's going to be more content, generated by AI, creating billions of meaningless drivel web pages that are grammatically correct but factually highly questionable.
One could argue, that there are already a vast number of meaningless drivel webpages out there and, that grammatically correct drivel would actually represent an improvement.

I think the suggestion that AI generated webpages, should be clearly flagged as such, is a good one, although surely very difficult, at best, to enforce.
 
Color me surprised... A tool that needs to be used responsibly. A hammer is a great tool, until you use it to smash someone's head. Keep that in mind always.

Ingestion of any kind (talking about large data modeling in general) has to be responsible. The data needs to be curated so the ingestion doesn't produce garbage. This is not a "new thing" and it's just now being adapted to "AI" and being labeled as a "new problem", when it is not. There's always laws in place... Well, outside the USA at least, that cover most data ingestion scenarios already thanks to Google for being the predatory data company of excellence.

As for the "human rights" thing... Maybe the same clowns that argue "Companies" should also have human rights?

Regards.
 
  • Like
Reactions: P.Amini
>Again, it's not like I don't empathize with any of your concerns, but considering the mind-blowing opportunities presented by this new technology, I can't help but see these objections as beating a dead horse at best and standing in the way of inevitable progress at worst.

The author has been posting a string of these "LLM sucks" pieces for a while now. I've been meaning to post a response like yours, but I never did. Because one, he's the EIC of this site, and this is his bully pulpit. So, I figured I'd let him have his rant. If it helps alleviate his fear of being obsoleted by AI, then it's cheap therapy for him, even if the reader has to bear the brunt of his "therapy."

And two, all this beeswax doesn't really matter much, anyway. Whatever AI direction being taken will not be decided or influenced by anything here. That, and everybody would already have an opinion about AI, and nobody will change theirs because of some online argument. Getting engaged is just a waste of time.

But I empathize with the author. His is not about influencing the outcome or anything substantive. It's just about being heard. So, again, let him have his bully pulpit.

For me, one part is the tech hobbyist, trying to see how far these LLMs can take us. Another part is an observer, gleaning trends and predicting what's next. It's an exciting time, and yes, there will be turbulence.

Speaking of LLMs, I've been meaning to ask: So wassup with THW's chatbot? There's been no update since the initial gung-ho spurt at the end of May. It's been 4 months, and Hammerbot is still as dumb as ever. Maybe the EIC should write a piece about that instead, rather than tilting at more windmills?
 
I'm not personally that worried for my job, because so far no one is actually figuring out a way to have AI swap graphics cards, run hundreds of benchmarks, create graphs from the data, and analyze the results. But there will be various "best graphics cards" pages that 100% crib from my work and the work of other people to create a page that offers similar information with zero testing. Smaller sites will inevitably suffer a lot.

Was gonna say, who needs all that testing when a LLM will just hallucinate the results and create an article discussing that hallucination, complete with fanbois insisting the ravings from the void of transistors should be treated as "real".

Seriously everyone, stop confusing science fiction actors playing anthropomorphized machinery with what LLM's are. There is no analysis, no intelligence, no thoughts, it's just picking the most likely word to follow the word it had already picked based on the probabilities derived from all the other words in it's database. In the sense of IP and copywrite it most certainly is violating IP laws with a high chance of breaking copywrite rules depending on if there is substantive change from the original works.
 
One could argue, that there are already a vast number of meaningless drivel webpages out there and, that grammatically correct drivel would actually represent an improvement.

I think the suggestion that AI generated webpages, should be clearly flagged as such, is a good one, although surely very difficult, at best, to enforce.

The difference is a legal one. If I was to publish an article where I insist person A did thing B, that person could sue me and I'd be required to defend how and what I said. If I was to publish an article that was derivative or straight out plagiarism, I could be sued and would have to defend how and what I published.

If an AI just vomit's drivel it hallucinated by stringing together phrases it's collected off the internet, what are the legal ramifications? The publishing company would just point to the AI service company and the AI service company would state that it's all from the training data and they aren't liable.

The issues are two fold, first is the data ingestion. Unless someone has explicitly authorized their material to be used for AI training, it is IP theft. An AI can not scan through a Toms article, which is owned by Future PLC / Thomas Pabst, then use those collections of words to generate similar sounding words. This is exactly what the guys behind ChatGPT and every other "public" facing LLM have done, and why they are lobbying as hard as possible to not have anyone start looking around. The second issue is around usage and informed consent. Currently the operating assumption is that humans publish and are responsible for material online. There absolutely needs to be a way to indicate material was created and published by a generative AI, along with legal liability that links the publisher and LLMs for defamatory or libelous publications.
 
  • Like
Reactions: P.Amini and -Fran-
The issues are two fold, first is the data ingestion. Unless someone has explicitly authorized their material to be used for AI training, it is IP theft. An AI can not scan through a Toms article, which is owned by Future PLC / Thomas Pabst...
Thomas Pabst has not owned anything to do with this site for over two decades, FYI. He sold all the rights to ... well, it was originally Tech Media Networks or some such (might have missed one owner), then that was changed to Purch, and then Future bought Purch about five years ago. Other than giving the site its original name, Thomas has done very little since perhaps 2002/2003. (I was a fan that read the site in the 90s and early aughts. Can't say I cared much for Omit when he took over...)
 
  • Like
Reactions: -Fran-
I just want to voice my opposition to these type of articles. I come here for tech enlightenment not a paranoid rant against new tech.

I found the Stable Diffusion articles so interesting that it inspired me to set it up and play with it. I've learned a lot from that and it started at Tom's.

I'd love to see articles of the various open source LLMs that we can run on our gaming rigs at home. Tests of performance between various GPUs, and how many B parameters you can run on how much VRAM, and so on.

Teach me something new and fun and interesting, Tom's.
 
>A legal way of flagging a websight to specifically exclude AI scaping seems reasonable to me.

You can already exclude AI bots scraping. The NYTimes is doing that, and presumably most major sites now with original content.




>The issues are two fold, first is the data ingestion. Unless someone has explicitly authorized their material to be used for AI training, it is IP theft.

Whether data used for AI training is "fair use" has yet to be decided in a court of law. That is coming. Your declaration is wrong, at least for now.

From gauging sentiments of decision makers (viz US lawmakers), I think there'll be some accommodation to fair use, simply because of the intense competition to advance AI among many countries, that no one country would want to hamstrung its own progress by putting up major barriers.

I think the EU has the lead in AI regulation with its AI Act. Given that the US has always tended to be more business-friendly than the EU, and especially as that most major AI players are US companies, I think we can count on EU's AI Act as the "ceiling" of potential regulation, and the US will come somewhat below that (my opinion is that US regs will be substantially less than EU's.)

This is not based on what I think "should be," but what I see as likely. Being descriptive is simpler than wading into the prescriptive pool.

>The second issue is around usage and informed consent. Currently the operating assumption is that humans publish and are responsible for material online.

This is more or less wishful thinking. It's akin to saying that anything posted on the Internet has a legal responsibility to tell "the truth." We all know that ain't true, even before AI.

>Seriously everyone, stop confusing science fiction actors playing anthropomorphized machinery with what LLM's are. There is no analysis, no intelligence, no thoughts, it's just picking the most likely word to follow the word it had already picked based on the probabilities derived from all the other words in it's database.

Yes, more declarations. But there are signs of emergent abilities from LLMs. Even if you think these aren't true, it may well yet be true with future advancements. That's what's exciting.



>I found the Stable Diffusion articles so interesting that it inspired me to set it up and play with it. I've learned a lot from that and it started at Tom's.

Hey, ditto. Jarred's SD pieces convinced me to buy an Nvidia GPU to play around it. That, and to DIY some of the open-source LLMs courtesy of Meta.

Yeah, it's ironic that the EIC of THW would be an anti-LLM guy, given that his own site (OK, not "his" site precisely) is deploying its own LLM.

Anyway, all sites have "filler" content, and THW is no exception. I just ignore the noise.

>Teach me something new and fun and interesting, Tom's.

For AI, I think you'll have to go elsewhere.
 
Is impossible to have an useful conversation about a work that the LLM has no access, or cannot argue about because of copyright.
Lately I needed to reference quotes from books, and many LLM refused to do it, or manufactured fake quotes.
 
For AI, I think you'll have to go elsewhere.
Hey, I'm still using SD for testing! And if anyone has a suggestion on another potential AI usage that will run on a wider selection of GPUs, I'm all ears. I poked at Whisper, but that only seemed to use GPU shaders for the current 'best' model (i.e. no Tensor / XMX). SD is now pretty reasonably optimized, and my most recent AMD numbers have improved quite a bit, and Intel has really shot up with the latest OpenVINO stuff.

Has anyone here poked at some of the chatbot stuff recently? I did look at one a while back, but I didn't see a way at the time to get it working on AMD or Intel GPUs. It would be cool to add another reference point for AI performance, though, on something people might actually use. Which... maybe a chatbot isn't a great choice there, but I'm not sure what else would qualify.
 
Was gonna say, who needs all that testing when a LLM will just hallucinate the results and create an article discussing that hallucination, complete with fanbois insisting the ravings from the void of transistors should be treated as "real".
This isn't wrong at all. A while back there was a lawyer trying to use ChatGPT to make his/her work faster and trusted the information being given as factual. Needless to say it didn't end up working out well for that lawyer.
View: https://www.youtube.com/watch?v=oqSYljRYDEM
 
Thomas Pabst has not owned anything to do with this site for over two decades, FYI. He sold all the rights to ... well, it was originally Tech Media Networks or some such (might have missed one owner), then that was changed to Purch, and then Future bought Purch about five years ago. Other than giving the site its original name, Thomas has done very little since perhaps 2002/2003. (I was a fan that read the site in the 90s and early aughts. Can't say I cared much for Omit when he took over...)

Yeah I first was reading Toms back in 2000, never tracked who actually owned it now just that he was the original guy. Back when Toms would do uarch deep dives complete with die shots.
 
Is impossible to have an useful conversation about a work that the LLM has no access, or cannot argue about because of copyright.
Lately I needed to reference quotes from books, and many LLM refused to do it, or manufactured fake quotes.

Yes the various LLM vendors have placed blocks to prevent new generative data from those sources due to the wave of litigation. It can still hallucinate, since fabricating material is not illegal, but the incoming tidal wave is making them pause. And there is no such thing as "conversation" with an AI, you aren't talking to anyone or anything, it's just parsing your words then guessing what the most "correct" reply is one word / phrase at a time. Talking with AI is no different then talking with a toaster.

Seriously we are just anthropomorphizing code because it's output is legible. It's the same human instinct that has us keeping pets and enjoying moves like Zootopia and The Lion King.
 
The bots cannot exist without the training data; ergo, one might say that the entire bot is a derivative work. These issues are still being litigated in court and we don't know exactly what the decisions will be in terms of law and the decision might be different in one country than another.

The IP and tech worlds have never seen anything like LLMs so what they are doing is really unprecedented. The premise that these bots have the legal right to hoover up the data is based on the philosophical view that they are "learning like a person does." In the article, I quoted a couple of people who have literally said that the bots should be thought of as "students" or a writer who is "inspired by" the work they read.

What I'm trying to argue -- and we'll see how the legal cases pan out -- is that morally, philosophically and technically, LLMs do not learn like people and do not think like people. We should not think of the training process the same way we think of a person learning, because that's not what is happening. It's more akin to one service scraping data from another.

Is scraping all publicly-accessible data from a website legal? That has definitely not been established. LinkedIn has won a lawsuit against a company that scraped its user profiles, which are all visible on the public web (https://www.socialmediatoday.com/ne...st-Court-Battle-Against-Data-Scraping/635938/). As a society, we recognize that just because a human can view it that doesn't mean a machine has the right to record and copy it at scale.

Now I know someone will say "this isn't copying." In order to create the LLM, they are taking the content and ingesting it and using it to create a database of tokens. The company that scraped LinkedIn wasn't word-for-word reproducing LinkedIn content either; it was taking their asset and repurposing it for profit. What about Clearview AI, which scraped billions of photos from Facebook (again, pages that were visible on the public Internet)?
No offense, but I am not sure I can trust you since I read that Nvidia RTX 20 series article in 2018. This is the article in question
 
Status
Not open for further replies.