News Nvidia being sued by writers for unauthorized use of their works in generative AI training

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
'We marvelled at our own magnificence as we gave birth to AI” - Morpheus, The Matrix.

As the world watches the beginning of AI, we are still sitting on our ass's trying to figure out what's going on, and what to do about it.

IMO, any issues that arise from AI, copyright or otherwise, are solely because there is no regulation/laws/guidelines to it's use. Something needs to be done. ASAP
 
And consider that there will be those people who will object (for whatever reasons) to what books etc. are "fed" into AI.

Much in the manner that many people are currently trying to do (for whatever reasons) with books for humans - especially human children. Who often already know more than we think.

So if a particular book is never fed into AI and AI writes that book is that a copyright violation?
 
I want to give AI the same freedom to learn/train in the same way a Human does and see what happen, if they then regurgitate the information word for word (Harvord University anyone?) instead of coming up with their own OG thoughts, yes 100% sue them (or their creators), but not before.
They already have it.

No one is really suing AI over it's output, it's the input that's the problem due to AI consuming copyrighted works without paying a licensing fee or even buying a copy of the work in question.

That's already a crime when humans do it but some people are trying to argue that the crime simply ceases to exist when it's AI stealing the work and not a flesh-and-blood human.
 
They already have it.

No one is really suing AI over it's output, it's the input that's the problem due to AI consuming copyrighted works without paying a licensing fee or even buying a copy of the work in question.

That's already a crime when humans do it but some people are trying to argue that the crime simply ceases to exist when it's AI stealing the work and not a flesh-and-blood human.
My understanding is the copywrited works were paid for or acquired through means that are normally free, ie any human could obtain the same material legally. The issue is the Authors are claiming the material was for human consumption, they dont want it being used to train AI because of what it MIGHT (commercial profit) be used for.

If the works were stolen, in the same context we would consider it stealing if we were talking about a human, not AI, then yes this is wrong.
 
My understanding is the copywrited works were paid for or acquired through means that are normally free, ie any human could obtain the same material legally for free. The issue is the Authors are claiming the material was for human consumption, they dont want it being used to train AI because of what it MIGHT (commercial profit) be used for.

If the works were stolen, in the same context we would consider it stealing if we were talking about a human, not AI, then yes this is wrong.
That doesn't appear to be the claim they're making:

"21. Much of the material in NVIDIA’s training dataset, however, comes from copyrighted
works—including books written by Plaintiffs and Class members—that were copied by NVIDIA without consent, without credit, and without compensation."

Here, Ars Technica has a link to the suit: https://cdn.arstechnica.net/wp-content/uploads/2024/03/Nazemian-v-Nvidia-complaint-3-8-2024.pdf
 
That doesn't appear to be the claim they're making:

"21. Much of the material in NVIDIA’s training dataset, however, comes from copyrighted
works—including books written by Plaintiffs and Class members—that were copied by NVIDIA without consent, without credit, and without compensation."

Here, Ars Technica has a link to the suit: https://cdn.arstechnica.net/wp-content/uploads/2024/03/Nazemian-v-Nvidia-complaint-3-8-2024.pdf
Thats the confusing part and where we are inturpting this differently. This is what the court is for.

Did they make a copy of the material and then feed those copies into the AI for learning, or...
Are they implying the act of AI learning the material is the "copy"?

I read a book I "copy" the material to my brain... The human brain cant provide perfect recall, well not yet anyways (Dune?), AI can provide perfect recall or a "copy"... This needs clarity.
 
Thats the confusing part and where we are inturpting this differently.

Did they make a copy of the material and then feed those copies into the AI for learning, or...
Are they implying the act of AI learning the material is the "copy"?

I read a book I "copy" the material to my brain... This needs clarity.
24. The Pile is a training dataset curated by a research organization called EleutherAI. In
December 2020, EleutherAI introduced this dataset in a paper called “The Pile: An 800GB Dataset of Diverse Text for Language Modeling”3 (the “EleutherAI Paper”).

25. According to the EleutherAI Paper, one of the components of The Pile is a collection of
books called Books3. The EleutherAI Paper reveals that the Books3 dataset comprises 108 gigabytes of data, or approximately 12% of the dataset, making it the third largest component of The Pile by size.

26. The EleutherAI Paper further describes the contents of Books3:

"Books3 is a dataset of books derived from a copy of the contents of the
Bibliotik private tracker … Bibliotik consists of a mix of fiction and
nonfiction books and is almost an order of magnitude larger than our next
largest book dataset (BookCorpus2). We included Bibliotik because
books are invaluable for long-range context modeling research and
coherent storytelling."

27. Bibliotik is one of a number of notorious “shadow library” websites that also includes
Library Genesis (aka LibGen), Z-Library (aka B-ok), Sci-Hub, and Anna’s Archive. These shadow libraries have long been of interest to the AI-training community because they host and distribute vast quantities of unlicensed copyrighted material. For that reason, these shadow libraries also violate the U.S. Copyright Act.

28. The person who assembled the Books3 dataset, Shawn Presser, has confirmed in public
statements that it represents “all of Bibliotik” and contains approximately 196,640 books.

29. Plaintiffs’ copyrighted books listed in Exhibit A are among the works in the Books3
dataset. Below, these books are referred to as the Infringed Works.

30. Until October 2023, the Books3 dataset was available from Hugging Face. At that time,
the Books3 dataset was removed with a message that it “is defunct and no longer accessible due to reported copyright infringement.”

31. In sum, NVIDIA has admitted training its NeMo Megatron models on a copy of The
Pile dataset. Therefore, NVIDIA necessarily also trained its NeMo Megatron models on a copy of Books3, because Books3 is part of The Pile. Certain books written by Plaintiffs are part of Books3— including the Infringed Works—and thus NVIDIA necessarily trained its NeMo Megatron models on one or more copies of the Infringed Works, thereby directly infringing the copyrights of the Plaintiffs.

34. To train the NeMo Megatron language models, NVIDIA copied The Pile dataset. The
Pile dataset includes the Books3 dataset, which includes the Infringed Works. NVIDIA made multiple
copies of the Books3 dataset while training the NeMo Megatron models.

35. Plaintiffs and the Class members never authorized NVIDIA to make copies of their
Infringed Works, make derivative works, publicly display copies (or derivative works), or distribute copies (or derivative works). All those rights belong exclusively to Plaintiffs under the U.S. Copyright Act.

36. NVIDIA made multiple copies of the Infringed Works during the training of the NeMo
Megatron models without Plaintiffs’ permission and in violation of their exclusive rights under the Copyright Act.

On information and belief, NVIDIA has continued to make copies of the Infringed Works for training other models.

__________________________________________________________________________________________________

Those are the relevant bits of the suit.

Best as I can tell, NVIDIA fed a copy of the works to their LLM from a 3rd party library that, according to the Plaintiffs, acquired their copy of the books illegally.

They then copied said copy to use as training data for other LLMs the company is developing.

But of course, that's what they claim, until it actually comes to court it's anyone's guess as to how valid their claims are.
 
Last edited:
If a human reads a book and memorized every sentence, then later writes those sentences and recreates the work to sell, the human is still guilty of copyright violation. Who and what is doing the copying and distribution is irrelevant, only that copywrited material was copied and distributed without a valid license.
You do not understand how AI works or the underlying scraping that exists to create their data. The output of the AI is not a perfect facsimile of the data it is based on, which is why fair use seems to be applicable. If typically inaccessible and copyrighted data exists publicly to be scraped to create these datasets then the IP owners need to go after the people who originally made the data public, not the companies who scraped the data after the fact.
 
  • Like
Reactions: Roland Of Gilead
Just to clarify and to be certain: IP being "Intellectual Property".

= = = =

Is using Robinson Crusoe (Defoe) as a reference for surviving on a deserted island an applicable "fair use" for a survival manual?

What if the AI recommends cannabalism as a solution?

No way for "dinner's" relatives to go after Daniel Defoe or his decendents or publishers...

Agree. Seems absurd.

Yet here we are.
 
I might be whispering in the wind here, but the principle of Fair Use doesn’t actually exist outside the US legal system.

There’s fair dealing, explicit use, and other location-specific legislature.
 
  • Like
Reactions: helper800
There is no tool being handed period, these rulings already exist as case precedent. People are always trying to make a buck and it's easier to do that if the can shortcut the process and use other peoples works. The commercialization aspect is what crush's any attempt at justifying using copyrighted material without a license.

I even linked the legal framework of Fair Use doctrine, it's very easy to understand the four point test the courts have created. Of all the aspects, purpose and economic harm are weighed the most. Was the unlicensed use for commercial or educational use? Did the unlicensed use have a negative economic impact to the copyright holder? If the answer is "commercial" and "yes" then it's a slam dunk case.
'Fair Use' (and copyright in general) deals with distribution of copies. Training a NN is not distribution of copies. If the training dataset is distributed and contained copyrighted works then there's an issue, but trying to use copyright to fight distribution of not-even-close-to-copies is where you end up far away from all existing case law.
You may argue that the initial gathering of publicly available images and creating of even an internal and non-distributed training dataset may be argued as not covered under fair use, but there is actually case law in opposition to this, e.g. Kelly v. Arriba Soft Corp., where it was upheld that a search engine can download entire full-resolution images and index them (and even distribute thumbnails of them) without infringing copyright.

::EDIT:: Oh, and "commercial" has absolutely noting whatsoever to do with fair use. You can be non-commercially infringing (e.g. if you record a movie onto VHS and then hand a copy to a friend, that's infringement) or commercially non-infringing (e.g. the thumbnail example above).
The entire "but humans can blah" is also a dead argument as there exists precedent on transformative and derivative works. If a human reads a book and memorized every sentence, then later writes those sentences and recreates the work to sell, the human is still guilty of copyright violation. Who and what is doing the copying and distribution is irrelevant, only that copywrited material was copied and distributed without a valid license.
Despite being a straw man, this is also wrong: if a human reads a copyrighted work and creates a different work, there is no copyright violation. JK Rowling cannot sue you for writing any story about wizards even if you read Harry Potter beforehand. Likewise, Disney cannot sue you for copyright infringement for drawing a mouse in any art style, or for drawing your own character in a 'Disney style', but solely for drawing specifically a Disney character. NNs are not photobashing or cut-and-pasting sentences and paragraphs from some archive of source images or texts. That is fundamentally not how they function: you can download an entire trained NN and comb through them would and not find a single image or line of text.
I might be whispering in the wind here, but the principle of Fair Use doesn’t actually exist outside the US legal system.
'Fair dealing' is what most other countries refer to that concept as under Copyright legislation. The US is the odd one out.


::EDIT:: Any before someone trots out the lazy "AI BRO" argument in lieu of thinking: my opinion is that the current wave of MLNNs has very limited application, and the 'AI boom' will end once everyone figures out nobody actually wants these LLMs, and moreso that nobody wants to pay for using them. Large Language Models have the least applicability, at best producing the crappest tier of marketing copy and only with extensive copyediting, and a very limited usage in turning correctly formatted and structured pseudocode into actual code (and turning poor pseudocode into crap code). Image processing models have broad applicability, but did so before the recent 'AI boom' and have been inactive use for many years anyway (e.g. multi-exposure and multi-lens image synthesis in smartphones). Image generation has some utility for artists (as Photoshop did, and more generally as photography did), but the idea that somehow it will 'end art' is about as valid as for every single other time it has been claimed by a new analog or digital tool.
I am far more concerned about the impact of the corporate manufactured panic (and subsequent legislative capture) over MLNN tools than I am over the MLNN tools themselves.
 
Last edited:
  • Like
Reactions: helper800
Referencing: "I am far more concerned about the impact of the corporate manufactured panic (and subsequent legislative capture) over MLNN tools than I am over the MLNN tools themselves."

Good point.

I will add political panic and the usual FUD (fear, uncertainty, doubt) factors so favored by mass media and online sites as a matter of concern as well.

[I just turned off the morning news, full disclosure, because one of the immediate headline stories was so trival and generally meaningless that the talking heads lost what little credibility they had. Pure FUD. Wondering if trained AI can do better....]

= = = =

Was not aware of the Fair Dealing with respect to the concepts involved.
 
  • Like
Reactions: helper800
You also paid for every single book read that wasn't provided to you for free by someone who did or that was out of copyright.

You already paid the authors their due when you bought their book or the library you read them from did when THEY bought it from the publisher.

NVIDIA didn't.

We can certainly argue that AI learning is no different from human learning but if that's the case then it should be subject to the exact same laws we are subject to and that means AI companies need to pay for the content their AI is digesting just as we have to.
Following your argument.. where did nvidia obtain the texts? Presumably they didn't use hackers to illegally obtain them from a protected source. I'll make a leap and assume the works were published in a way that didn't bother the authors before llms came along, and then nvidia came along and accessed those sources.

It seems the issue may not be that nvidia "stole" the works, but instead has used them in ways that weren't explicitly granted by the authors.
 
Following your argument.. where did nvidia obtain the texts?
24. The Pile is a training dataset curated by a research organization called EleutherAI. In
December 2020, EleutherAI introduced this dataset in a paper called “The Pile: An 800GB Dataset of Diverse Text for Language Modeling”3 (the “EleutherAI Paper”).

25. According to the EleutherAI Paper, one of the components of The Pile is a collection of
books called Books3. The EleutherAI Paper reveals that the Books3 dataset comprises 108 gigabytes of data, or approximately 12% of the dataset, making it the third largest component of The Pile by size.

26. The EleutherAI Paper further describes the contents of Books3:

"Books3 is a dataset of books derived from a copy of the contents of the
Bibliotik private tracker … Bibliotik consists of a mix of fiction and
nonfiction books and is almost an order of magnitude larger than our next
largest book dataset (BookCorpus2). We included Bibliotik because
books are invaluable for long-range context modeling research and
coherent storytelling."

27. Bibliotik is one of a number of notorious “shadow library” websites that also includes
Library Genesis (aka LibGen), Z-Library (aka B-ok), Sci-Hub, and Anna’s Archive. These shadow libraries have long been of interest to the AI-training community because they host and distribute vast quantities of unlicensed copyrighted material. For that reason, these shadow libraries also violate the U.S. Copyright Act.

28. The person who assembled the Books3 dataset, Shawn Presser, has confirmed in public
statements that it represents “all of Bibliotik” and contains approximately 196,640 books.

29. Plaintiffs’ copyrighted books listed in Exhibit A are among the works in the Books3
dataset. Below, these books are referred to as the Infringed Works.

30. Until October 2023, the Books3 dataset was available from Hugging Face. At that time,
the Books3 dataset was removed with a message that it “is defunct and no longer accessible due to reported copyright infringement.”

31. In sum, NVIDIA has admitted training its NeMo Megatron models on a copy of The
Pile dataset. Therefore, NVIDIA necessarily also trained its NeMo Megatron models on a copy of Books3, because Books3 is part of The Pile. Certain books written by Plaintiffs are part of Books3— including the Infringed Works—and thus NVIDIA necessarily trained its NeMo Megatron models on one or more copies of the Infringed Works, thereby directly infringing the copyrights of the Plaintiffs.

_________________________________________________________________________________________________


I posted this up thread.
 
  • Like
Reactions: jp7189
Feed lawbooks and law related databases (all properly purchased or licensed) into an AI being trained to understand copyright, infringements, and Fair Use laws,etc..

Anyone want to guess how that ends....?
^^^ This right here is the most human thing AI could do... LOL
 
There are already cases relating to humans writing songs that sound a lot like songs other humans wrote. For example, In 1997 The Rolling Stones added a credit to k.d. lang on their song ‘Anybody Seen My Baby’ after Keith Richards was informed he may have unintentionally styled his song on the k.d. lang classic "Constant Craving". Maybe he did, or maybe it was a coincidence. Either way it doesn't matter since they are close enough. If humans have to own up to it (as in money), then so should A.I. and its handlers.
 
  • Like
Reactions: palladin9479
I might be whispering in the wind here, but the principle of Fair Use doesn’t actually exist outside the US legal system.

There’s fair dealing, explicit use, and other location-specific legislature.

Right but since the companies involved are all US based we are sticking to arguments based on US legal system. AI Bros are going to do mental gymnastics to justify openly stealing other peoples property to make "the next new thing" work better. All the LLM makers have already admitted they fed anything and everything they could get their hands on to create "training data". LLM's tend to work better the larger the quantity and variety of data fed into the model, so the makers worked on a "steal now settle later" approach to quickly capitalize on the massive increase in market valuation happening to anything "AI" related.
 
Referencing: "I am far more concerned about the impact of the corporate manufactured panic (and subsequent legislative capture) over MLNN tools than I am over the MLNN tools themselves."

Good point.

I will add political panic and the usual FUD (fear, uncertainty, doubt) factors so favored by mass media and online sites as a matter of concern as well.

[I just turned off the morning news, full disclosure, because one of the immediate headline stories was so trival and generally meaningless that the talking heads lost what little credibility they had. Pure FUD. Wondering if trained AI can do better....]

= = = =

Was not aware of the Fair Dealing with respect to the concepts involved.

All the LLM companies needed to do was acquire permissions to use the content to train their AI's. Agree to whatever licensing scheme or royalty structed is needed and go from there. Instead they straight up stole everything they could get their hands on. Every book, every news article, every piece of code, everything. Fed it all into the learning models to teach the AI with the result that the AI can now regurgitate every last one of those items verbatim.

No amount of "it's the future man" or "the wealthy AI corporations are less evil then the evil greedy wealthy non-AI corporations" is going to get around US copyright law.
 
All the LLM companies needed to do was acquire permissions to use the content to train their AI's. Agree to whatever licensing scheme or royalty structed is needed and go from there. Instead they straight up stole everything they could get their hands on. Every book, every news article, every piece of code, everything. Fed it all into the learning models to teach the AI with the result that the AI can now regurgitate every last one of those items verbatim.

No amount of "it's the future man" or "the wealthy AI corporations are less evil then the evil greedy wealthy non-AI corporations" is going to get around US copyright law.
If I download as a consequence of seeing it in my browser, a Nike logo, is that stealing? No. Its also not trademark infringement. I can possess any number of trademarked images and copyrighted materials as long as I do not improperly use them, and even if I did, until a cease and desist order is levied against me or I lose a court case regarding my possession or use, there is nothing that can be done about it.
 
24. The Pile is a training dataset curated by a research organization called EleutherAI. In
December 2020, EleutherAI introduced this dataset in a paper called “The Pile: An 800GB Dataset of Diverse Text for Language Modeling”3 (the “EleutherAI Paper”).

25. According to the EleutherAI Paper, one of the components of The Pile is a collection of
books called Books3. The EleutherAI Paper reveals that the Books3 dataset comprises 108 gigabytes of data, or approximately 12% of the dataset, making it the third largest component of The Pile by size.

26. The EleutherAI Paper further describes the contents of Books3:

"Books3 is a dataset of books derived from a copy of the contents of the
Bibliotik private tracker … Bibliotik consists of a mix of fiction and
nonfiction books and is almost an order of magnitude larger than our next
largest book dataset (BookCorpus2). We included Bibliotik because
books are invaluable for long-range context modeling research and
coherent storytelling."

27. Bibliotik is one of a number of notorious “shadow library” websites that also includes
Library Genesis (aka LibGen), Z-Library (aka B-ok), Sci-Hub, and Anna’s Archive. These shadow libraries have long been of interest to the AI-training community because they host and distribute vast quantities of unlicensed copyrighted material. For that reason, these shadow libraries also violate the U.S. Copyright Act.

28. The person who assembled the Books3 dataset, Shawn Presser, has confirmed in public
statements that it represents “all of Bibliotik” and contains approximately 196,640 books.

29. Plaintiffs’ copyrighted books listed in Exhibit A are among the works in the Books3
dataset. Below, these books are referred to as the Infringed Works.

30. Until October 2023, the Books3 dataset was available from Hugging Face. At that time,
the Books3 dataset was removed with a message that it “is defunct and no longer accessible due to reported copyright infringement.”

31. In sum, NVIDIA has admitted training its NeMo Megatron models on a copy of The
Pile dataset. Therefore, NVIDIA necessarily also trained its NeMo Megatron models on a copy of Books3, because Books3 is part of The Pile. Certain books written by Plaintiffs are part of Books3— including the Infringed Works—and thus NVIDIA necessarily trained its NeMo Megatron models on one or more copies of the Infringed Works, thereby directly infringing the copyrights of the Plaintiffs.

_________________________________________________________________________________________________


I posted this up thread.
Thank you. That's explains it quite well, and point 27 drives it home for me.
 
If I download as a consequence of seeing it in my browser, a Nike logo, is that stealing? No. Its also not trademark infringement. I can possess any number of trademarked images and copyrighted materials as long as I do not improperly use them, and even if I did, until a cease and desist order is levied against me or I lose a court case regarding my possession or use, there is nothing that can be done about it.

Actually it is. Seriously if you download a Nike logo then use that logo to make money, they would hit you with a trademark violation so fast it would make your head spin.

Then again, you can't distinguish between copyright and trademark, so not much hope you can effectively understand the law as it applies to those.