News AI Lie: Machines Don’t Learn Like Humans (And Don’t Have the Right To)

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
Seriously we are just anthropomorphizing code because it's output is legible. It's the same human instinct that has us keeping pets and enjoying moves like Zootopia and The Lion King.
While I 100% agree on the code and "AI", the pets? What the hell, man? I've been around various types of animals for many years. Nearly all of them have emotions, dreams, you name it. And I'm not even talking about the smarter breeds of dogs, who basically have levels of intelligence similar to those of toddlers or autistic kids. I was thoroughly surprised to learn that even such "mindless" creatures like cats of pigeons have a whole range of emotions and intelligence in their behavior. That's not anthropomorphism. That's a statement of fact.

Oh, and I hated Zootopia, by the way.
 
While I 100% agree on the code and "AI", the pets? What the hell, man? I've been around various types of animals for many years. Nearly all of them have emotions, dreams, you name it. And I'm not even talking about the smarter breeds of dogs, who basically have levels of intelligence similar to those of toddlers or autistic kids. I was thoroughly surprised to learn that even such "mindless" creatures like cats of pigeons have a whole range of emotions and intelligence in their behavior. That's not anthropomorphism. That's a statement of fact.

Oh, and I hated Zootopia, by the way.

This would only be true for primates, everything else is so chemically different from you that it's just you projecting your own feelings and thoughts. The neurochemistry in your brain that defines things like emotions, colors, thoughts, dreams and so forth are completely alien to non-primates. Each family has an entirely different set of neurochemical reactions that we simply can not relate to or comprehend. Instead we just anthropomorphize everything to feel better.
 
This would only be true for primates, everything else is so chemically different from you that it's just you projecting your own feelings and thoughts. The neurochemistry in your brain that defines things like emotions, colors, thoughts, dreams and so forth are completely alien to non-primates. Each family has an entirely different set of neurochemical reactions that we simply can not relate to or comprehend. Instead we just anthropomorphize everything to feel better.
Sorry, but that idea was discredited by scientific research decades ago.

While some of you say is obviously valid (but then again, much of this can be said about relating to other people's emotional states as well), our brains and behaviors have a shared evolutionary history that goes back millions of years, and you only need to spend a small amount of time with a truly smart animal like a German Shepherd or an elephant to realize this without any studies that all confirm the same. Especially basic emotions like fear and happiness are self-evident in most animal types, from hamsters to cows.

I consider myself a rational and scientific-minded individual, so no, I'm not talking about a panting dog regulating its heat as it "smiling at you". But as far as basic emotions go, even if they may experience them somewhat differently from humans, they are absolutely, undeniably, indisputably there. As a former German Shephard owner (truly amazing creatures, even though a handful), I can easily bet my life on this fact.
 
Last edited:
I mean, just because a search engine doesn't process and index information the way a human does, doesn't mean that we should limit their access to all types of searchable data and force them to ask explicit permission for every single public result found. We had the same types of arguments when Google rose to power, and we all know how that particular battle played out.

U.S. courts have ruled in favor of Google's ability to scrape and cache data for search engine purposes, but that does not mean that scraping for AI is also fair use.

I wrote about this a few weeks ago. I interviewed Cornell Law Professor James Grimmelmann and he said the following:

“Scraping data is technically reproduction under copyright law,” Grimmelmann said. “The courts have pretty consistently held that scraping is allowed, at least where there's a robots.txt or robots.exclusion protocol file saying you're allowed to do this. The reasoning of the scraping decision depends upon the fair use of the downstream purpose. And if the downstream uses aren't all fair use, then the scraping itself is at least a little more questionable.”

 
That .. has nothing to do with what I was saying.
It does. Obviously, there's quite a lot of subjectivity and anthropomorphizing going on, but again, emotion is a subjective experience even in humans. How can I truly know that your feeling of fear is in any way similar to mine? I can't. No one really can, and we already know that even though the biochemical reaction for fear may be the same for all of us, some people perceive fear very differently from others. But that doesn't stop us from discussing emotions on the human level, does it? Same thing with animals.

And more to the anthropomorphizing bit, scientists know very well how hard it is to gauge something as subjective as an emotion, so they come up with different ways of dealing with that. Here's a study that uses orofacial movements to study emotions in mice, and here's another one that developed a complex protocol to study emotional states in brumbies. There are ones that measure brainwaves, EEG, and so on, so it's not just interpreting visual cues through the prism of one's personal bias.

U.S. courts have ruled in favor of Google's ability to scrape and cache data for search engine purposes, but that does not mean that scraping for AI is also fair use.
My point was that LLMs and other ML models do not really scrape data in the sense as the LinkedIn and other examples in this thread. I mentioned earlier that asking an LLM a question about the best gamer GPU on the market would result in a dated, "aggregate reply that stems from dozens of websites," but even that's not even true. The way these things are trained, what you'd likely get would be an answer that stems not from a dozen, but from hundreds of such websites in dozens of different languages, and plus another tens of thousands of related discussions on public forums like Reddit (not to mention all the countless linguistic sources that were used to make the output look legible).

Because that's what these things largely do: they mindlessly parse the data and find the least common denominator to predict the most probable "right" answer. How do you propose they can even source something like that? Even if it were somehow possible and practical (again, you'd likely just hundreds of sources for every single query), the cost of running these things would probably multiply a thousandfold, if not more.
 
Last edited:
Not quite true, either. It may be a small part of it that relates to quoting very specific facts, like individual benchmark numbers that may be linked to a particular source, but the rest of it is too generic to be viewed this way. Writing a generic paragraph in the style of Steven King – loosely speaking –is not ripping off Steven King, and neither is quoting 99.9% facts known to humans. To whom should it attribute the idea that Paris is the capital of France, for example? Because the vast majority of the queries are just like that. Moreover, considering the way LLMs digest information to produce their content, I frankly can't even imagine how they could properly source it all.

We don't have to imagine what proper sourcing would look like. Bing Chat does a really good job of it right now. ChatGPT and Google Bard / SGE refused to cite sources not because they can't do it technically but because they want to maintain the illusion that their bots are being creative. They know what the sources of their information are. They are just burying them. Google SGE has "related links" which are two clicks away from the content and, if you take the time to click on them, you'll see word-for-word where the answers come from.

You say they're nothing like humans, but it's a lot like trying to attribute an idea that just popped up in your head that is based on multiple sources and events in your life going decades back. It seems inconceivable.
You're right. Humans don't know where all their ideas came from and some of it just came from life experience or hearing something a million times. Machines do know where every fact they cite came from and it was ALWAYS someone else's work.
It appears to me see that you're looking at this from the niche position of your website and not broadly enough to see the big picture. Even when asking a related question like "what's the best gamer GPU on the market right now?", you won't get a rephrased summary from Tom's latest GPU so-and-so. What you'll get is likely an aggregate reply that stems from dozens of websites like yours (well, that including more inferior ones), and probably an old one at that – "RTX 3090!" – since the training data for these things can lag quite a bit. The doctrine of fair use also includes the idea of "transformative purpose", which is exactly what is going on with probably at least 99.99% of ML generated content already.

The question of whether it's transformative is an open question. Their entire model could be considered a derivative work. The issue is that, for many things, they've taken information from so many different places that they can effectively launder the theft. One of the tests for fair use is whether the derivative work competes in the marketplace with the original source material. Clearly, Google SGE, Bing Chat, Bard and ChatGPT all compete with the copyrighted materials they are ingesting. They are absolutely harming the market for the original materials they took from.

One thing I point out is that the aggregate replies are really poor quality, because they are plagiarism stew of content from both reputable, disreputable and heavily-biased sources (ex: using content from a company that makes or sells a product as objective buying advice about that product, using content from a PAC when providing facts about a controversial issue). The stew might give them a better "fair use" argument -- look, we stole from so many sources and mixed them all together so everything is good -- and better plausible deniability about plagiarism. However, it's a terrible user experience because the reader can't interrogate the bias of the sources and one sentence of the answer might disagree with another sentence (see my example when I asked SGE about concert taping).

You're right; the answers are usually poor quality. And, in a fair marketplace of ideas, readers would recognize that and eschew the AI's advice. However, many people don't know enough to know that the AIs are wrong. You and I know that the RTX 3090 is last-gen, but many searchers do not. In the case of SGE, for example, Google is leveraging its monopoly position to put this poor-quality advice on the first page of results, above advice written by reputable human experts. It's not even close to a fair competition when they take your own content, use it to create an inferior product and then use their control of the SERP page to put that inferior product front and center above you.

If a human wrote a derivative article and didn't cite sources, as long as they mixed enough ideas together without directly copying words and phrases, courts would probably consider it fair use (FWIW, the person would have to do more creative mixing than the bots do now). Also, bot authors aren't people and shouldn't be entitled to the same right to "own" knowledge that people are (this is a key point of my article). And that low-quality, human writer wouldn't rank very high in search, because they don't have E-E-A-T (expertise, experience, authority and trust), which is a key ranking signal for Google. However, Google doesn't apply the E-E-A-T standard to its own bot, which it puts on top. Bots are held to a much lower quality and reputational standard than people.


 
We don't have to imagine what proper sourcing would look like. Bing Chat does a really good job of it right now...
That's not really true, though. Bing Chat has a plenty of issues with its "sourcing", which seems more of a crude guesstimation than the real thing. I've read a good article on this that I can't find, but here's a couple of anecdotal examples off the top of my quick Google search:

Google Bard and Bing Chat made it look like I shared fake news
False Citations / Links not saying what Bing Chat insists they say

I wouldn't be surprised if they have no sources at all, and what they do is simply run internal Bing searches for the key phrases in their Bing Chat output to generate their source lists. That's not proper referencing, and it has potential legal ramifications of its own.

You're right. Humans don't know where all their ideas came from and some of it just came from life experience or hearing something a million times. Machines do know where every fact they cite came from and it was ALWAYS someone else's work.
That's only true in the broadest possible sense, as every idea they regurgitate comes from a myriad of various sources, not just one or even ten.

I mean, even though they don't train like people, this analogy with humans is very much appropriate here.

Imagine asking me to reference every single claim or idea I just wrote in this thread. Not just those few links I threw in to demonstrate some of my claims, but every single statement, every word even. I don't exactly recall when or how I got realization about emotions in animals, and it was probably a very gradual, prolonged process anyway. Hell, I couldn't even come up with my source for Bing Chat sourcing above, and that was something I experienced recently. So even if I theoretically could come up with the appropriate set of memories, it wouldn't necessarily be accurate or complete, and it would take an extraordinary amount of brain power just to do it right.

The LM models, regardless of how different they may be, work in a very similar fashion. They may have the exact list of every single source that was fed into the model, but how the model interprets these sources and how many – if any – of them were used to generate some particular piece of output is pretty much everyone's guess. The model itself doesn't track the process, only the results, same as the human brain. So the only way they could reliably reference their sources is through meticulous logging of every single step of the ML's training process and trying to reverse engineer that inordinate amount of data... but then, just as with my human example, it would be an enormous computational task that would likely ultimately result in an inconcievably large number of secondary and tertiary sources that would likely only contain a couple of words relevant to the query and wouldn't be any use to anyone.
 
Last edited:
Let me try a different approach. Imagine a basic random phrase generator. Now, you can know and list every single book and other verbal source you fed into it, but what happens next? It doesn't randomly search a random source at a random place and copy one of its sentences. No, it indexes every single word used by all of these sources and adds weight to the most popular interactions between these words, making sure that the random combination of words it spews out have at least some semblance of authenticity.

So then, imagine this primitive program running a rand() on its database and coming up with some purely random nonsense like "how to be gray". And then imagine the authors of How to Win Friends and Influence People, 50 Shades of Gray, and Hamlet ("To be or not to be?") – not to mention the Tolkien estate claiming that took the name of Gandalf the Grey and Americanized it – all coming out after it and asking for royalties for every occurrence of these words produced by this program.

That may sound silly, but that's a lot like what we have going on here right now. The only thing that is really different is the scale and complexity of the program, but the way you can visualize and track its sourcing is a lot more similar than you may think.

Even if it's something more immediately recognizable like "to be or not to be", which almost certainly looks like a direct quote, it may not necessarily be so. This particular phrase has 29,900,000 results on Google alone, so there's a good chance the program simply gave a higher weight - that is, a higher probability of appearance over something like "to be or boogie woogie", which has zero results – to this combination of words solely from the abundance of it in its sources. In which case, tracing it back to Shakespeare would be a stretch and a whole separate task – which is what it would seem Bing Chat is doing with their "sources".
 
Last edited:
Wow - what a disconnected jumble of irrational thinking.
I'm certainly not going to waste my time picking it apart.

If someone is interested in this topic - just go to Wikipedia and read "fair use".

Obviously "fair use" is implemented so as to balance the interests of all parties to the best results as a whole.
If there is a problem with someone scanning copyrighted material using a pencil and paper or whatever, US laws can be modified to best optimize how information is created and utilized.

Copyright (read the Wikipedia article) is a very important set of rules that can be modified to fit the current environment. If technological advances warrant - laws can be modified.
 
  • Like
Reactions: Order 66
i mean swapping gpu from a system can be done with an assembly style arm & positioning data of where it goes & rest can be done with scripts for the ai to activate.

not really "not figured out" and more nobody cares enough to as cost too high.

and on context of copyright to train LLM....music & film industry made that pretty clear decades ago.

You ask for permission to use it and if you dont get a yes you can't. (and yes ppl can use content by modifying it, but the llm/ai is learning off the raw content which it cant change so couldnt use that bypass)
It's more complicated than that unless someone is pre loading the arm for you. If not the arm must pick up the GPU, find the pin board, properly align that with the chassis slots and insert the card. This also assumes the chassis is fixed, if it's not then it most also locate the correct pci-e slot, ensure ram clearance, etc. Then if the PC doesn't post, something has to then diagnose that. Dell and those types get away with it because there are only a couple of configurations they have to worry about. Getting a random card from a vendor to review surly isn't going to fit a generic arm sweep with simple positioning.

You are correct about the cost though. I worked in robotic for automotive manufacturing my first couple years out of college which was heavy on automation and these types of scenario automations are costly to build. I'm not sure a blog site is going to invest millions in doing so just to publish benchmarks.
 
  • Like
Reactions: hotaru251
>>For AI, I think you'll have to go elsewhere.

>Hey, I'm still using SD for testing! And if anyone has a suggestion on another potential AI usage that will run on a wider selection of GPUs, I'm all ears.

Hey Jarred, the reason I said the above isn't because of what you are or aren't doing, nor is it because of your EIC's ongoing personal crusade against LLM. It's because AI topics is a poor fit for THW's present scope, which mainly concerns with PC hardware for DIYers.

Your SD pieces was inspiring, not because of the GPU benchmarks for SD--although that is definitely helpful--but because you are showing what's possible for DIYers with respect to SD in particular, and generative AI in general. Its best aspect is as a how-to piece, even though that isn't its primary goal.

>Which... maybe a chatbot isn't a great choice there, but I'm not sure what else would qualify.

You see the constraint. You can only write about AI applications as pertaining to choosing the "best" PC hardware, and chatbots don't fit that bill. But AI topics are worthy for their own sake, not as a subordinate of hardware topics.

As a tech hobbyist and a DIYer, I'm interested in learning more about AI (LLMs for now) as what I can do with it on a PC. It's an emerging field, and we're just at the starting point.

I've been in this hobby long enough to remember a time when there was no such thing as a GPU (I started with an 80286 clone, so the '287 math coprocessor was the closest equivalent). AI and LLMs are limited to data centers for now, but just as GPUs are now de rigueur on PCs, I forsee a time not too long from now where all PCs will sport dedicated AI accelerators. That's where the fun is, not obsessing about FPS counts. For me, arguments about "best GPUs" or "best CPUs" are boring and passe.

I respect your writing. You are one of the few tech writers who knows his stuff, who keeps an even keel, and are eloquent enough to express yourself clearly. My suggestion is that you should branch out into more fruitful opportunities. Writing pieces on "AI for PC DIYers" would be a great start--perhaps on here, perhaps elsewhere.
 
The way these things are trained, what you'd likely get would be an answer that stems not from a dozen, but from hundreds of such websites in dozens of different languages, and plus another tens of thousands of related discussions on public forums like Reddit (not to mention all the countless linguistic sources that were used to make the output look legible).
Even if it were from dozens of websites, that still feels like theft to me, but it's not that many. We're talking 3 to 5 web pages for an answer actually. On Google SGE, if you open the related links you can literally track back every statement it makes to one of maybe 5 pages. Here's a screen shot I annotated from the search "thinkpad x13 amd review." And there were five web pages many of which had whole sentences taken.

Also, here's a screen shot of "best graphics card." It is taking its data from the following web pages right now for this answer. That plus listings from Google shopping. Those are its sources and the text it was "inspired by" is highlighted in each. Not dozens or hundreds of pages in different languages: 3 web pages and not the best 3 for the topic. Even if you don't think that Tom's Hardware should be one of the sources, I don't think the Binary Tides, Quora and eBuyer are the best 3 sources of unbiased, expert information on this topic.



 

Attachments

  • KFqekpfLQoVXjiMmF8TiPf-1200-80.png
    KFqekpfLQoVXjiMmF8TiPf-1200-80.png
    513.2 KB · Views: 4
  • best-gpu-sge.png
    best-gpu-sge.png
    494.9 KB · Views: 3
Even if it were from dozens of websites, that still feels like theft to me, but it's not that many. We're talking 3 to 5 web pages for an answer actually. On Google SGE, if you open the related links you can literally track back every statement it makes to one of maybe 5 pages. Here's a screen shot I annotated from the search "thinkpad x13 amd review." And there were five web pages many of which had whole sentences taken.[/URL]
How do you know that? For certain, I mean? Your screenshot assumes it "steals" from Tom's because Tom's has "comfortable keyboard" in its review. But if you google "Lenovo Thinkpad X13" "comfortable keyboard", you get 63,200 Google results. You also take something like "excels in productivity" as potential plagiarism, when the actual quote, "designed for productivity", has 23,400 results. And that probably doesn't include countless user reviews on some websites and some other sources.

Since I only have a very superficial knowledge of neural networks (especially ones of this magnitude), I'm not saying that you are necessarily wrong. I'm just saying that you may be mistaken about at least some aspects of it and therefore confusing correlation with causation. This is what I was talking about with confirmation bias and Bing Chat. From my understanding about how these neural networks process, evaluate, and organize data (and again, I have some perfunctory knowledge but I'm not an expert) , there is a higher chance that this is simply a mish-mash of all Lenovo Thinkpad X13 reviews (likely including last-gen and possibly translated ones as well) that bears some similarity to some of them, than there is one that this quote was taken verbatim from Tom's and intentionally rephrased to avoid copyright infringement.

As for the sources provided by Google, again, I'm not even sure that it has anything to do with the models to begin with. There is a good chance that their algorithm simply looks up relevant keywords in the output and presents the top domains that have the most occurrences of the phrase - as well as some sponsored links - as the "sources". "best graphics card" site:quora.com has 16,200 results, roughly twice as many as Reddit (9,460) or Tom's (6,780). As for the other two links in your example, I'm not familiar with either, but the fact that one of them is a retailer and another is simply a non-player in the global traffic rankings makes me think that these are both sponsored links masqueraded as sources.

As you can see, I'm not defending either Google or Microsoft. If my assumptions about them making up the sources on the fly and presenting ads as sources are correct, that's not exactly vindication by any means. What I'm trying to say here is that we need more tangible facts, more objective research, and a hell of a lot less guesswork when it comes to how exactly these companies train and deploy their models before making up our minds and reaching for the pitchforks.
 
  • Like
Reactions: Order 66
How do you know that? For certain, I mean? Your screenshot assumes it "steals" from Tom's because Tom's has "comfortable keyboard" in its review. But if you google "Lenovo Thinkpad X13" "comfortable keyboard", you get 63,200 Google results. You also take something like "excels in productivity" as potential plagiarism, when the actual quote, "designed for productivity", has 23,400 results. And that probably doesn't include countless user reviews on some websites and some other sources.

Since I only have a very superficial knowledge of neural networks (especially ones of this magnitude), I'm not saying that you are necessarily wrong. I'm just saying that you may be mistaken about at least some aspects of it and therefore confusing correlation with causation. This is what I was talking about with confirmation bias and Bing Chat. From my understanding about how these neural networks process, evaluate, and organize data (and again, I have some perfunctory knowledge but I'm not an expert) , there is a higher chance that this is simply a mish-mash of all Lenovo Thinkpad X13 reviews (likely including last-gen and possibly translated ones as well) that bears some similarity to some of them, than there is one that this quote was taken verbatim from Tom's and intentionally rephrased to avoid copyright infringement.

As for the sources provided by Google, again, I'm not even sure that it has anything to do with the models to begin with. There is a good chance that their algorithm simply looks up relevant keywords in the output and presents the top domains that have the most occurrences of the phrase - as well as some sponsored links - as the "sources". "best graphics card" site:quora.com has 16,200 results, roughly twice as many as Reddit (9,460) or Tom's (6,780). As for the other two links in your example, I'm not familiar with either, but the fact that one of them is a retailer and another is simply a non-player in the global traffic rankings makes me think that these are both sponsored links masqueraded as sources.

As you can see, I'm not defending either Google or Microsoft. If my assumptions about them making up the sources on the fly and presenting ads as sources are correct, that's not exactly vindication by any means. What I'm trying to say here is that we need more tangible facts, more objective research, and a hell of a lot less guesswork when it comes to how exactly these companies train and deploy their models before making up our minds and reaching for the pitchforks.
Very often if you look at the text Google SGE uses it is word-for-word traceable to the sources and it even deep links to and highlights what it took from. It's unashamed. Look at the screen shots I just shared below. I tried "best ssd for PS5." It's very clear that it took the top 2 bullets from a page on the Shortcut and the third buillet pint from a page on MakeUseOf. Could it have gotten similar ideas from other sites? Yes. But this where it did get them from as they are almost word-for-word copied and Google even links to them (if you click the open link to reveal them).
 

Attachments

  • sge-ssd.png
    sge-ssd.png
    523.2 KB · Views: 8
  • sge-ssd1.png
    sge-ssd1.png
    50.5 KB · Views: 9
  • ssd-sge-2.png
    ssd-sge-2.png
    138.6 KB · Views: 8
Very often if you look at the text Google SGE uses it is word-for-word traceable to the sources and it even deep links to and highlights what it took from. It's unashamed. Look at the screen shots I just shared below. I tried "best ssd for PS5." It's very clear that it took the top 2 bullets from a page on the Shortcut and the third buillet pint from a page on MakeUseOf. Could it have gotten similar ideas from other sites? Yes. But this where it did get them from as they are almost word-for-word copied and Google even links to them (if you click the open link to reveal them).
What's sort of hilarious here is that for "Best SSD for PS5" it's pulling generic information about SSDs from a relatively unrelated article to provide a bullet point. And that's a fundamental problem with AI. "SSDs Have a Long Lifespan" isn't really a critical factor in picking an SSD for the PS5, nor is "How to Estimate the Remaining Life of Your SSD." That the article was part of the training data doesn't mean it's directly relevant.

Best Potpourri for PS5 SSDs!

When choosing an SSD for the PS5, because you have to use and SSD and nothing else will suffice, Bard / SGE would like to remind you that durability is a key factor, because all SSDs are very reliable! Isn't that amazing? Shouldn't everyone now go search for a PS5 SSD based on the fact that SSDs are durable and reliable?
 
Very often if you look at the text Google SGE uses it is word-for-word traceable to the sources and it even deep links to and highlights what it took from. It's unashamed. Look at the screen shots I just shared below...

What's sort of hilarious here is that for "Best SSD for PS5" it's pulling generic information about SSDs from a relatively unrelated article to provide a bullet point. And that's a fundamental problem with AI. "SSDs Have a Long Lifespan" isn't really a critical factor in picking an SSD for the PS5, nor is "How to Estimate the Remaining Life of Your SSD." That the article was part of the training data doesn't mean it's directly relevant...

Yeah, that's. . .something else. Where the "sources" in the previous example looked more like random advertising than actual references, they do appear to provide proper sourcing this time. And where the earlier example appeared too generic and more what I'd expect from an unguided LLM, this one does make it seem likelier that they took their info from these links. The "PCIe Gen 4.0 (x4) M.2 NVME" and "5500 mb/s" bits do pop up in many other sources, including the official PlayStation page, but the exact phrasing is much closer to the linked sources this time. On the other hand, if you look at the actual contents, it appears to be pure nonsense... It's a mess I fully expect from a brainless LLM, but that makes the question of sources and attibution very murky.

The only two guesses I can come up off the top of my head is that either Google has additional custom content delivery algorithms on the top of the neural network content that take the text directly off the pages (similar to the page summaries on its regular searches); or that this is still a case of one big coincidence, which is something you are going to get a lot with LLMs due to their "common lowest denominator" nature.

Both explanations have problems. For the first one, it's that the end result is just too nonsensical to be ripped off intentionally. Neither the form factor nor storage space requirements are included, but there is a generic durability bit that makes no sense. The second one – and I may be biased because I just argued for it, but I do try to stay objective - probably still makes more sense for me, but I'm not nearly as dismissive as it was earlier. If you look at this thing from this perspective, what does it do? It lists the top three factors for choosing an SSD (still no capacity, though) and then it tries to fill them with specific data related to PlayStation 5. "Must be" is a more popular and preferable substitute "required", and the leftover durability bit is filled with generic info. Then each of the generated bullet points is googled, and voilà, you have your "sources" that look much more convincing this time around.

Is that what happened? Honestly, I have no idea at this point. It's not like I trust Google to do the right thing, and the conflicting nature of these small examples makes me more confused than anything else. We need somebody to research this thing properly, using thousands of inquiries and cross-referencing all their output both with the provided sources and the similar content found on the web.
 
Copyright is about the right to distribute copies, not the right to collect copies. Anyone is fully free to download the entire internet and store it in a database, so similarly everyone should be free to download the internet and store it in the form of a neural net. Humans get sued all the time for violating copyright by regurgitating something they learned online (see: the music artists claiming their song was ripped off by another artist). Whether or not the models learn like humans do does not materially affect the question of whether there's a copyright violation in training them or in using them, nor does it affect the question of whether exceptions need to be made on fair use grounds.

Now, here's my fear: the copyright camp wins a pyrrhic victory. They lose the battle to prevent training on copyrighted materials, but win the right to enforce controls on how the models are distributed and used: every API call to an LLM must be accompanied by a lookup into a fingerprinting database to check that the model isn't reproducing copyrighted material. As a consequence open models would need to become closed models that are paywalled. This would potentially lead to a bleak future where all the powerful AI's are paywalled and owned by a few tech giants. A world like that seems worse off even than today's world, because the last thing we need is yet another way for tech giants to become more powerful.

That bleak future presumes the LLMs are on a roadmap to become digital minds equal to or surpassing human minds, and to become essential tools of power. Those in the stochastic parrot camp are not worried about that future, I would guess, as they don't believe this is where things are headed. But, on the other hand, every single researcher actually training next-gen LLMs seems to believe we are indeed heading to such a future. So, the question becomes this: are LLMs like the internet, hyped and mocked in the beginning but ultimately deeply transformative to everything we do, with the skeptics proven wrong, or are they like crypto, equally hyped and mocked, but ultimately amounting to a big ball of nothing, proving the skeptics right. Time will tell.
 
Status
Not open for further replies.