Computers, Privacy & the Constitution

View   r3  >  r2  >  r1
PedroLondonoFirstPaper 3 - 06 May 2024 - Main.PedroLondono
Line: 1 to 1
 
META TOPICPARENT name="FirstPaper"
Line: 7 to 7
 -- By PedroLondono - 28 Feb 2024
Changed:
<
<

In 2023 prominent generative AI developers were hit with three significant lawsuits revolving around copyright infringement. On September, the Authors Guild sued OpenAI? for massive copyright infringement of protected works in the training process of its Large Language Model (LLM) GPT. On November, Authors Guild filed another class action, this time against OpenAI? and Microsoft, for the same reasons as the first suit. On December, The New York Times filed a copyright infringement lawsuit against OpenAI? and Microsoft. These lawsuits are likely to challenge the livelihood of generative AI models as we now know them. This essay seeks to explore the different scenarios of these lawsuits, explain why the training processes of these LLMs have blatantly violated the Copyright Statute, and conclude that the only way out for these big AI developers is if Judges stretch the fair use doctrine.
>
>

Between September and December 2023, OpenAI? and Microsoft were hit with three significant lawsuits for copyright infringement in the training process of their Large Language Model (LLMs) for their generative artificial intelligence (gen AI) models. These lawsuits are likely to challenge the livelihood of generative AI models as we now know them. This essay seeks to explore the arguments of the lawsuits and the different potential outcomes, taking into account the possibility of ensuring the livelihood of these language models through the Creative Commons licensing system implemented by the Wikimedia Foundation.
 
Deleted:
<
<

GENERATIVE ARTIFICIAL INTELLIGENCE

 
Changed:
<
<

In June 2018 OpenAI? first introduced their generative AI LLM called “Generative Pre-trained Transformer” [GPT]. Its premise is not complex: users design prompts for the software requesting something specific (summarizing a book, revising a text, drafting an email, etc.), and the language model delivers in seconds, as an “output”, what the user requested. Though some argue that generative AI systems such as ChatpGPT? are a technological revolution and the result of the outmost technical advances, others contend that it is simply putting in practice data gathering, organizing and prediction processes that have been around for many years (Moglen and Choudhari).
>
>

GENERATIVE ARTIFICIAL INTELLIGENCE AND THE COPYRIGHT ISSUE

 
Changed:
<
<
Perhaps you should have checked how Mishi's name is spelled.
>
>

In June 2018 OpenAI? first introduced their generative AI LLM called Generative Pre-trained Transformer [GPT]. Its premise is not complex: users design prompts for the software requesting something specific, and the language model delivers in seconds, as an output, what the user requested. Though some argue that generative AI systems such as ChatpGPT? are a technological revolution and the result of the outmost technical advances, others contend that it is simply putting in practice data gathering, organizing and prediction processes that have been around for many years, and the only fact that makes them different nowadays is that there is much more available data to “feed” them, given individuals have been giving away their own information freely and sometimes unconsciously in the past decades (Moglen and Choudhary).
 
Added:
>
>
One key element for these generative LLMs is their training process. Similar to the way in which human beings learn throughout their lives, LLMs have been trained through a series of data and information that has been provided to them, so their technological features are able to analyze, process and use the information in what is called the machine learning process. Based the information disclosed by OpenAI, their LLMs are trained with datasets comprised of publicly available texts and information from the internet, licensed content from third parties, user-generated information. However, based on the assertions made by the Plaintiffs (Authors Guild and The New York Times) in their claims, it is very likely that these datasets were also fed with copyrighted works, without a license from their owners.
 
Changed:
<
<
In any case, the underlying key element for these generative LLMs is their training process. Similar to the way in which human beings learn throughout their lives, LLMs have been trained through a series of data and information that has been provided to them, so their technological features are able to analyze, process and use the information in what is called the “machine learning” process.
>
>
In that sense, the plaintiffs argue that the unauthorized copying of their works to “feed” this LLMs amount to a massive reproduction of copyrighted works. This unauthorized reproduction of copyrighted works would violate the copyright owners’ exclusive rights under Section 106 of the US Copyright Act. Though training a machine, per se, may not amount to a copyright infringement, just as a human being is not infringing and IP right merely by reading books, poems and encyclopedias or looking at paints and sculptures, and learning from what they see. Nonetheless, the plaintiffs allege that during the training process there was a physical act of copying the copyrighted works into a software or system, meaning a making a reproduction of the original without the authorization of the right holder. Similar to what happened in the Google Books case Authors Guild v. Google, this reproduction does violate Section 106 of the Copyright Act.
 
Changed:
<
<

INTELLECTUAL PROPERTY ISSUE

>
>

FAIR USE AND CREATIVE COMMONS - POSSIBLE OUTCOMES

 
Changed:
<
<

Copyright is a legal tool through which the progress of arts is promoted by securing limited exclusive rights over an author’s work. Section 106 of the US Copyright Act establishes the exclusive rights that the copyright owner is entitled to. Among these, one of the main exclusive rights that the copyright owner has is the ability to reproduce (or authorize third parties to reproduce) his copyrighted work.
>
>

US Courts have often applied the fair use doctrine to justify copyright infringements and immunize defendants from liability under some specific circumstances (especially when it comes to big and powerful tech companies, as it happened in the Authors Guild v. Google (supra) case regarding the Google Books project). Incorporated through §107 of the Copyright Act, the fair use doctrine allows specific circumstances under which the violation of an exclusive copyright would not be considered an infringement. This doctrine enables certain uses of copyrighted works for specific purposes. As established in the statute, the fair use defense is determined by taking into account four factors: (i) the purpose and character of the use, (ii) the nature of the copyrighted work, (iii) the amount and substantiality of what the infringer used from the copyrighted work and (iv) the effect of the unauthorized use on the potential market or target audience of the copyrighted work (Henderson et. al.).
 
Changed:
<
<
Based the information disclosed by OpenAI, their LLMs are trained with datasets comprised of publicly available texts and information from the internet, licensed content from third parties, user-generated information. However, based on the assertions made by the Plaintiffs (Authors Guild and The New York Times) in their claims, it is very likely that these datasets were also fed with copyrighted works, without a license from their owners.
>
>
These four factors are non-exhaustive, and fair use is an equitable doctrine through which the judge may find as a grounded defense even if not all the four factors are proved to weigh on the defendant’s favor (Balganesh et. al.). Based on the Google Books precedent, would not be surprising that the Judges will determine that under the first fair use factor, and policy reasons along the lines of “promoting technological developments”, the Defendants are not liable of copyright infringement (Lemley and Casey).
 
Changed:
<
<
Just as a copying machine or scanner, the most feasible way in which OpenAI? could have fed its LLMs the copyrighted works is through a process of reproducing them into a tangible medium, to which its language model then had access to and started the training model. Taking these works and reproducing them (in whatever way they did it) without the copyright owners’ authorization clearly violates §106(1) of the Copyright Act. The actions filed by the Plaintiffs are all grounded on the alleged massive copyright infringement to their reproduction right, and seek relief from OpenAI? and Microsoft for this unauthorized use.
>
>
However, even if the courts to not find fair use, and declare OpenAI? and Microsoft’s infringement, the Creative Commons BY-SA (Attribution ShareAlike? ) from the Wikimedia Foundation could serve these big tech companies and allow them to have sufficient data to train their models. This licensing model allows anyone to use, share, and adapt Wikipedia content for any purpose, as long as they provide proper attribution to the original authors and release any derivative works under the same license (Creative Commons). This fosters collaboration, knowledge sharing, and the creation of derivative works while ensuring that the original creators receive credit for their contributions. Furthermore, this licensing scheme enables its users to access millions of different works and use them without having to pay any compulsory license or any retribution to their authors.
 
Changed:
<
<

FAIR USE – THE DEFENDANT’S SAFE HARBOR


Subject to it being proven wrong in the discovery stage of these proceedings, it seems fairly clear that in training their LLMs the Defendants contravened the dispositions set forth in §106(1) of the Copyright Act. Nonetheless, US Courts have often applied the fair use doctrine to justify copyright infringements and immunize defendants from liability under some specific circumstances (especially when it comes to big and powerful tech companies, as it happened in the Authors Guild v. Google case regarding the Google Books project).

Incorporated through §107 of the Copyright Act, the fair use doctrine allows specific circumstances under which the violation of an exclusive copyright would not be considered an infringement. This doctrine enables certain uses of copyrighted works for specific purposes. As established in the statute, the fair use defense is determined by taking into account four factors: (i) the purpose and character of the use, (ii) the nature of the copyrighted work, (iii) the amount and substantiality of what the infringer used from the copyrighted work and (iv) the effect of the unauthorized use on the potential market or target audience of the copyrighted work (Henderson et. al.). These four factors are non-exhaustive, and fair use is an equitable doctrine through which the judge may find as a grounded defense even if not all the four factors are proved to weigh on the defendant’s favor (Balganesh et. al.).

Through the years, case law has focused the first fair use factor on the “transformativeness” of the infringing work (Leval). In a recent SCOTUS decision on the Warhol v. Goldsmith case, the Court expressed that the focus to assess whether there is fair use should be on whether the alleged infringing work has sufficient transformative use and purpose vis-à-vis the copyrighted work.

In this case, deep pockets and sophisticated lawyering skills can persuade courts that the defendants have a safe harbor under the fair use doctrine. Based on the Google Books precedent, it would not be surprising that the Judges will determine that under the first fair use factor, and policy reasons along the lines of “promoting technological developments” (just like in the legendary Sony case), the Defendants are not liable of copyright infringement. This outcome, though farfetched and a stretch of the fair use doctrine where the other factors weigh heavily against fair use, can be the only way in which generative AI systems will subsist in time, even if they’re alleged genius is all based on false promises (Chomsky et. al.). Otherwise, ruling for the Plaintiffs would force compulsory licensing schemes for the training of LLMs, which could potentially cost billions, and deter the investment in their development.

The route to improvement here is to edit stringently the profuse introductory material. We need only a couple of sentences to understand the claim that training models on copyrighted material constitutes infringement.

But the legal analysis that follows needs to be expanded, or rather reconsidered, because it is wrong. Making a copy in the course of training might be infringement under non-US copyright law, but it isn't correct to say that transient copying in the course of otherwise permitted activity is infringement. Nor is the fair use analysis any good. You missed the issues. Given that you had a starting point in the op-ed Mishi and I wrote, I'm sure a second try, based on the Authors' Guild complaint and a look at Wikipedia's Creative Commons BY-SA license would get you much closer.

>
>
Moreover, these licenses could also serve the defendants in these cases as defenses, in case they actually trained their LLMs using the millions of data available through these BY-SA Creative Commons. However, it will be interesting to see how ChatGPT? and the different gen AI comply with the “ShareAlike” portion of these licenses, given the burden the users have when benefitting from these free licenses is that the source code of the models must be available to the public in a free manner, under the same terms the Creative Commons BY-SA are available (Moglen, supra).
 
You are entitled to restrict access to your paper if you want to. But we all derive immense benefit from reading one another's work, and I hope you won't feel the need unless the subject matter is personal and its disclosure would be harmful or undesirable.

PedroLondonoFirstPaper 2 - 20 Apr 2024 - Main.EbenMoglen
Line: 1 to 1
 
META TOPICPARENT name="FirstPaper"
Line: 13 to 13
 
In June 2018 OpenAI? first introduced their generative AI LLM called “Generative Pre-trained Transformer” [GPT]. Its premise is not complex: users design prompts for the software requesting something specific (summarizing a book, revising a text, drafting an email, etc.), and the language model delivers in seconds, as an “output”, what the user requested. Though some argue that generative AI systems such as ChatpGPT? are a technological revolution and the result of the outmost technical advances, others contend that it is simply putting in practice data gathering, organizing and prediction processes that have been around for many years (Moglen and Choudhari).
Added:
>
>
Perhaps you should have checked how Mishi's name is spelled.

 In any case, the underlying key element for these generative LLMs is their training process. Similar to the way in which human beings learn throughout their lives, LLMs have been trained through a series of data and information that has been provided to them, so their technological features are able to analyze, process and use the information in what is called the “machine learning” process.

INTELLECTUAL PROPERTY ISSUE

Line: 33 to 38
 In this case, deep pockets and sophisticated lawyering skills can persuade courts that the defendants have a safe harbor under the fair use doctrine. Based on the Google Books precedent, it would not be surprising that the Judges will determine that under the first fair use factor, and policy reasons along the lines of “promoting technological developments” (just like in the legendary Sony case), the Defendants are not liable of copyright infringement. This outcome, though farfetched and a stretch of the fair use doctrine where the other factors weigh heavily against fair use, can be the only way in which generative AI systems will subsist in time, even if they’re alleged genius is all based on false promises (Chomsky et. al.). Otherwise, ruling for the Plaintiffs would force compulsory licensing schemes for the training of LLMs, which could potentially cost billions, and deter the investment in their development.
Added:
>
>
The route to improvement here is to edit stringently the profuse introductory material. We need only a couple of sentences to understand the claim that training models on copyrighted material constitutes infringement.

But the legal analysis that follows needs to be expanded, or rather reconsidered, because it is wrong. Making a copy in the course of training might be infringement under non-US copyright law, but it isn't correct to say that transient copying in the course of otherwise permitted activity is infringement. Nor is the fair use analysis any good. You missed the issues. Given that you had a starting point in the op-ed Mishi and I wrote, I'm sure a second try, based on the Authors' Guild complaint and a look at Wikipedia's Creative Commons BY-SA license would get you much closer.

 
You are entitled to restrict access to your paper if you want to. But we all derive immense benefit from reading one another's work, and I hope you won't feel the need unless the subject matter is personal and its disclosure would be harmful or undesirable. To restrict access to your paper simply delete the "#" character on the next two lines:

PedroLondonoFirstPaper 1 - 29 Feb 2024 - Main.PedroLondono
Line: 1 to 1
Added:
>
>
META TOPICPARENT name="FirstPaper"

GENERATIVE AI MODELS: POTENTIAL MASSIVE COPYRIGHT INFRINGEMENT

-- By PedroLondono - 28 Feb 2024


In 2023 prominent generative AI developers were hit with three significant lawsuits revolving around copyright infringement. On September, the Authors Guild sued OpenAI? for massive copyright infringement of protected works in the training process of its Large Language Model (LLM) GPT. On November, Authors Guild filed another class action, this time against OpenAI? and Microsoft, for the same reasons as the first suit. On December, The New York Times filed a copyright infringement lawsuit against OpenAI? and Microsoft. These lawsuits are likely to challenge the livelihood of generative AI models as we now know them. This essay seeks to explore the different scenarios of these lawsuits, explain why the training processes of these LLMs have blatantly violated the Copyright Statute, and conclude that the only way out for these big AI developers is if Judges stretch the fair use doctrine.

GENERATIVE ARTIFICIAL INTELLIGENCE


In June 2018 OpenAI? first introduced their generative AI LLM called “Generative Pre-trained Transformer” [GPT]. Its premise is not complex: users design prompts for the software requesting something specific (summarizing a book, revising a text, drafting an email, etc.), and the language model delivers in seconds, as an “output”, what the user requested. Though some argue that generative AI systems such as ChatpGPT? are a technological revolution and the result of the outmost technical advances, others contend that it is simply putting in practice data gathering, organizing and prediction processes that have been around for many years (Moglen and Choudhari).

In any case, the underlying key element for these generative LLMs is their training process. Similar to the way in which human beings learn throughout their lives, LLMs have been trained through a series of data and information that has been provided to them, so their technological features are able to analyze, process and use the information in what is called the “machine learning” process.

INTELLECTUAL PROPERTY ISSUE


Copyright is a legal tool through which the progress of arts is promoted by securing limited exclusive rights over an author’s work. Section 106 of the US Copyright Act establishes the exclusive rights that the copyright owner is entitled to. Among these, one of the main exclusive rights that the copyright owner has is the ability to reproduce (or authorize third parties to reproduce) his copyrighted work.

Based the information disclosed by OpenAI, their LLMs are trained with datasets comprised of publicly available texts and information from the internet, licensed content from third parties, user-generated information. However, based on the assertions made by the Plaintiffs (Authors Guild and The New York Times) in their claims, it is very likely that these datasets were also fed with copyrighted works, without a license from their owners.

Just as a copying machine or scanner, the most feasible way in which OpenAI? could have fed its LLMs the copyrighted works is through a process of reproducing them into a tangible medium, to which its language model then had access to and started the training model. Taking these works and reproducing them (in whatever way they did it) without the copyright owners’ authorization clearly violates §106(1) of the Copyright Act. The actions filed by the Plaintiffs are all grounded on the alleged massive copyright infringement to their reproduction right, and seek relief from OpenAI? and Microsoft for this unauthorized use.

FAIR USE – THE DEFENDANT’S SAFE HARBOR


Subject to it being proven wrong in the discovery stage of these proceedings, it seems fairly clear that in training their LLMs the Defendants contravened the dispositions set forth in §106(1) of the Copyright Act. Nonetheless, US Courts have often applied the fair use doctrine to justify copyright infringements and immunize defendants from liability under some specific circumstances (especially when it comes to big and powerful tech companies, as it happened in the Authors Guild v. Google case regarding the Google Books project).

Incorporated through §107 of the Copyright Act, the fair use doctrine allows specific circumstances under which the violation of an exclusive copyright would not be considered an infringement. This doctrine enables certain uses of copyrighted works for specific purposes. As established in the statute, the fair use defense is determined by taking into account four factors: (i) the purpose and character of the use, (ii) the nature of the copyrighted work, (iii) the amount and substantiality of what the infringer used from the copyrighted work and (iv) the effect of the unauthorized use on the potential market or target audience of the copyrighted work (Henderson et. al.). These four factors are non-exhaustive, and fair use is an equitable doctrine through which the judge may find as a grounded defense even if not all the four factors are proved to weigh on the defendant’s favor (Balganesh et. al.).

Through the years, case law has focused the first fair use factor on the “transformativeness” of the infringing work (Leval). In a recent SCOTUS decision on the Warhol v. Goldsmith case, the Court expressed that the focus to assess whether there is fair use should be on whether the alleged infringing work has sufficient transformative use and purpose vis-à-vis the copyrighted work.

In this case, deep pockets and sophisticated lawyering skills can persuade courts that the defendants have a safe harbor under the fair use doctrine. Based on the Google Books precedent, it would not be surprising that the Judges will determine that under the first fair use factor, and policy reasons along the lines of “promoting technological developments” (just like in the legendary Sony case), the Defendants are not liable of copyright infringement. This outcome, though farfetched and a stretch of the fair use doctrine where the other factors weigh heavily against fair use, can be the only way in which generative AI systems will subsist in time, even if they’re alleged genius is all based on false promises (Chomsky et. al.). Otherwise, ruling for the Plaintiffs would force compulsory licensing schemes for the training of LLMs, which could potentially cost billions, and deter the investment in their development.


You are entitled to restrict access to your paper if you want to. But we all derive immense benefit from reading one another's work, and I hope you won't feel the need unless the subject matter is personal and its disclosure would be harmful or undesirable. To restrict access to your paper simply delete the "#" character on the next two lines:

Note: TWiki has strict formatting rules for preference declarations. Make sure you preserve the three spaces, asterisk, and extra space at the beginning of these lines. If you wish to give access to any other users simply add them to the comma separated ALLOWTOPICVIEW list.


Revision 3r3 - 06 May 2024 - 17:51:39 - PedroLondono
Revision 2r2 - 20 Apr 2024 - 15:15:50 - EbenMoglen
Revision 1r1 - 29 Feb 2024 - 00:13:33 - PedroLondono
This site is powered by the TWiki collaboration platform.
All material on this collaboration platform is the property of the contributing authors.
All material marked as authored by Eben Moglen is available under the license terms CC-BY-SA version 4.
Syndicate this site RSSATOM