News Google Wants AI Scraping to Be 'Fair Use.' Will That Fly in Court?

I think gigantic companies like them intend to pay a relatively small sum if they get fined over AI-training-related things, but they do not inted to stop this kind of training. Maybe unless managers will be "collected" like they were in Full Tilt Poker's case (Poker's Black Friday), so some hard governmental actions will not be taken, unless the regulations will not take a more strict approach, the direction of the corporate world will not really change.
So they have very good law specialists, likely their profit coming from the training will exceed the fines big time.

Also, regulations-wise pulling the plug on AI, before small entities, or individuals get access to very powerful AI tools maybe even would be beneficial to the mega and giga companies, as real breakthroughs still would most likely be achieved "only" by them. But maybe it is already late, and too much of the genie is out of the bottle.

All in all, I would like to see AI improving and going forward, as it is a very exciting field, but I would like to see more emphasis being put on more grounded researches, with less controversies like this, and less showboating type of usecases of AI.
 
Last edited:
Scraping copyrighted content may be theft. A lot of this is largely untested.

Data itself is trickier to argue as theft. It really depends. Facts themselves are not copyrightable under US law. Specific *presentations* of them may be.
 
This has already been tested in courts. Sony lawyers made the argument that possessing MP3 files is equal to stealing. Consumers were arguing that a copy is not equal to stealing because it's a copy. This is the exact same argument, but this time from a Company rather than a consumer, trying to make the argument that copying data into an AI database is not stealing but merely "scraping" ie replicating the original data.
 
  • Like
Reactions: peachpuff
The solution is simple. If the input is generated substantially using data acquired through 'fair use", all generated outputs must be labeled as available for 'fair use' by others and made available for others to use, not hidden behind a firewall.

If we aren't careful, Google will effectively own everything.
 
  • Like
Reactions: Giroro
What do you think would happen if I tried this? I stroll into a bank and see a wad of cash within arm’s reach behind an unoccupied teller window. I grab the dough and start walking out the door with it when a police officer, very rudely, stops me. “I’m entitled to take this money,” I say. “Because nobody at the bank told me not to.”

More like going into a newsagents and taking photographs of the newspapers.
 
Google should have to live by the same fair use standards they put on their content creators. Given how ridiculous YouTube and search can be with filtering out content or complying with obviously bogus DMCA take-down requests, Google should also have to comb through their datasets to aggressively apply every complaint. It's only fair!
 
If i.e. Google would be smart, they would realize that they are digging a hole under their feet, if they seriously consider putting professional content creators on the side-line. The "AI-output" would quickly become repetitive, and out-dated, possibly even ending up digesting just what some propaganda network puts out.

"AI as librarian" would certainly make more sense. I'd even go for "personal assistant". In example, I would ask PAI to compile a table containing links to all published works by Avram Pitch, containing "AI". Right now, I would have to do a search manually, including specific commands to not have the results be flooded with stuff I am not looking for specifically right now. And having PAI, which uses some personal data from me to actually improve my experience - personal data, such as what my preference for file format of the table is, which I can tell-it/customize - that would seem way cooler. And it would arguably be the actual next logical step in the development of the "web-search experience".

And such experience may perhaps not have as much broad appeal, as a "mysterious oracle" does have, or not sound as much of a financial venue as creating a corporate environment, which users are tied to, where they have their daily schedule determined by an algorithm, and where they are eventually told to crush another billionaire's fiefdom. But it would feel more like Web 3.0, opening the door to more quality.

In example, with the mentioned table, and some additions to it, it would save some time towards writing an article in an non-English language, which uses linked quotes, and which may help to create interest in web-stuff in overall. And such could create more traffic for various sites and services, who are then more likely to be able to afford full-time positions, which in turn could mean e.g. more news articles.

Meanwhile, PAIs could help readers to filter what every reader is individually interested in, instead of the quest for an ultimate algorithm, which caters to everyone, while it doesn't necessarily go even beyond being driven by the web-usage of a relatively small group of power-users, who are not really doing anything but to sit in their basement day and night, while flooding the net with their output. Which is good for them, but there are also poorer citizen, who may not even have a smartphone to type on. So an algorithm assuming that the power-users are what the majority says and likes, that makes the online experience in some cases a quite estranged one.
 
Google is beyond ruthless is preventing bots from scraping anything off google. But they give themselves the right to scrape whatever they want. Rank hypocrites!
 
If you don't want your public data scraped, don't make it public in the first place.

Though if you lock everything behind subscription/sign-in walls, most of stuff won't be indexed anywhere and nobody who isn't already a member of your private club will be able to find it and even then, they'd have to remember which private club a given thing they are searching from might reside in.

Content hosts and creators waging war against search engines and researchers scraping their stuff may be digging their own graves by making their content basically disappear from the face of the Earth.
 
This has already been tested in courts. Sony lawyers made the argument that possessing MP3 files is equal to stealing. Consumers were arguing that a copy is not equal to stealing because it's a copy. This is the exact same argument, but this time from a Company rather than a consumer, trying to make the argument that copying data into an AI database is not stealing but merely "scraping" ie replicating the original data.

You're assuming an extremely overbroad set of facts. Copyright infringement relies on a lot of statuatory language and precedents; "because broad decision X happened, Y" is extraordinarily facile. Exactly zero lawyers for any of the AI companies in question will ever suggest "copying is not stealing," as a defense, which is about one step above Charlie Kelly's Bird Law from It's Always Sunny.

There are elements of a copyright infringment case that have not directly been tested with AI scraping. For example, the Ninth Circuit has adopted a two-prong test for improper appropriation. The copyright holder of work that has allegedy been appropriated has to prove substantial similarity (as can be seen in Three Boys Music Corp v. Bolton and Mattel v. MGA)
 
If you don't want your public data scraped, don't make it public in the first place.

Though if you lock everything behind subscription/sign-in walls, most of stuff won't be indexed anywhere and nobody who isn't already a member of your private club will be able to find it and even then, they'd have to remember which private club a given thing they are searching from might reside in.

Content hosts and creators waging war against search engines and researchers scraping their stuff may be digging their own graves by making their content basically disappear from the face of the Earth.
The future of the internet.
 
Google better be careful, or their Bard AI is going to get permanently banned by getting too many copyright strikes from their YouTube AI.

But really, I think the correct choice for creators to protect themselves is to upload their entire website to YouTube's Content ID system, where "fair use" doesn't exist.
 
  • Like
Reactions: bigdragon
Scraping copyrighted content may be theft. A lot of this is largely untested.

Data itself is trickier to argue as theft. It really depends. Facts themselves are not copyrightable under US law. Specific *presentations* of them may be.

If that were true then nobody of us would be able to look at anything on the internet, scrapping is looking at something and keeping it 'in mind' or in storage in this case, for future use.

If this future use is a 100% copy or even a pretty close copy that you try to pass off as the real thing then you might go to jail, but if you use it to inspire your own distinct type of art or whatever then it's fair use.

If you look at it from the side of the law then the curt has to find a big enough difference, in something, that makes AI looking at stuff different from people looking at stuff.
 
Scraping? Where have I heard that term used before in this context? Oh THAT'S RIGHT! TORRENTS!
.
Unreal how Google thinks it can steal whatever copyright protected info it wants but if I download via torrent an MP3 file my ISP calls me and let's me know I'm potentially committing a federal offense.

Just sick of this two-tiered system...
 
Imitation is not plagiarism.

Technology always take jobs, and artists should not be exempted. Today a calculist engineer takes the jobs of dozens of engineers, and engineers are not demanding a ban on computers.
 
Last edited:
  • Like
Reactions: BX4096
Scraping? Where have I heard that term used before in this context? Oh THAT'S RIGHT! TORRENTS!
.
Unreal how Google thinks it can steal whatever copyright protected info it wants but if I download via torrent an MP3 file my ISP calls me and let's me know I'm potentially committing a federal offense.

Just sick of this two-tiered system...

You shouldn't have heard the term used before this in this context because scraping does not refer to simply sharing an entire copyrighted work. The issue here is training algorithms on certain aspects of copyrighted work, and whether those works are different enough to constitute a new work or contain substantial amounts of new material sufficient to constitute a new original work rather than a derivative one. Other than the fact that they fall under intellectual propery law, there's almost no similarity between the two issues.

Just because it's copyright infringment to offer "Oh, Pretty Woman" by Roy Orbison on Torrent for others to download does not mean it's copyright infringment to utilize elements of it in other works (this specifically was the issue at hand in Campbell v. Acuff-Rose Music). The extent to which AI usage is infringment is largely untested.
 
  • Like
Reactions: BX4096
More like going into a newsagents and taking photographs of the newspapers.
More like buying a newspaper, reading the newspaper, then telling someone else what you read in the news when they ask you.

That anyone thinks handing corporations the legal tools to say "you looked at our work before creating your work, therefore we own your work/you owe us cash" is a good idea need to take another look at the history of copyright law. Do you really believe Disney et al will pinky-swear to only use such tools against corporate AI models, and not against literally everyone they think they can make a buck from?
 
Of course Google is trying this in Australia first.
Australia's governments have a long tradition of being push-overs to corporate interests.

That they are doing this in Australia, and not in the EU (which is also in a process of looking into AI laws) is a sign of ill intent.
 
Of course Google is trying this in Australia first.
If you require that everyone pays for accessing publicly available data, you make it cost-prohibitive for anyone to run search engines and make the content impossible for most people to find. It isn't going to benefit anybody in the long run.

They will need to come up with some other accommodation. On the AI side of things, that would likely include some form of heuristics to reduce the quantity and frequency of training set that the AI will regurgitate verbatim or have it annotate any such quotes for credits so anyone using the AI doesn't mistakenly claim the AI's output as their own.