Last week, Australia’s Productivity Commission released an interim report that leans towards recommending a text and data mining (TDM) exception to copyright infringement, similar to the EU TDM exceptions – potentially allowing AI to be trained on copyright works without compensation to creators.

Yet a recent report commissioned by the EU Parliament’s Committee on Legal Affairs concludes:

The current EU text and data mining exception was not designed to accommodate the expressive and synthetic nature of generative AI training and its application to such systems risks distorting the purpose and limits of EU copyright exceptions.

The EU Parliamentary Report observes that while copyright law has proved durable and adaptable in the face of technological change (for example the introduction of the photocopier), “generative AI presents an unprecedented test of scale, opacity and economic impact”. The report cautions that “[w]ithout timely reform, the EU risks legal uncertainty, market concentration and cultural homogenisation”.

To TDM or not to TDM?

The report notes that copyright works are relevantly reproduced in the process of creating an AI training corpus and training the models – raising the key question whether such reproduction is permitted under a copyright exception.

The 2019 EU Copyright and related rights in the Digital Single Market (CDSM) Directive introduced two TDM exceptions. The exception most relevant to AI permits anyone to use copyright works in TDM without the creator’s permission or compensationprovided the rightsholder has not expressly ‘opted-out’, such as by using machine-readable means.

The report argues that this exception was not designed to permit the free use of copyright material in AI training, for the following reasons.

First, the original intent of the TDM exceptions was to promote data-driven research and innovation, not to allow commercial use of content at the scale, complexity, technological architecture of modern AI training. The exceptions also distinguish between commercial and non-commercial TDM, which fails to accommodate the many hybrid public-private R&D models in the AI industry with appropriate legal certainty.

Second, the ‘traditional’ TDM as contemplated by the current copyright exceptions, involves automated analytical techniques used to extract patterns, trends or correlations from large datasets. While AI training also involves identifying patterns in the training data, the purpose is to encode those patterns to synthesise new outputs which reproduce the style, structure and composition of the training data, enabling outputs that can closely resemble original creative works. In short, ‘traditional’ TDM finds existing patterns, while AI synthesises new expressions – AI is not mining, but (absent a licence) is likely to be infringing reproduction.

The report gives the following example:

If an AI tool is applied to restore or reconstruct a damaged artwork (for example, in a cultural heritage context), the goal is knowledge extraction. But if the same system is used to produce new commercial artworks in the style of a known artist, it may involve appropriation of protected material. In both cases, reproduction occurs – but only in the first case might that reproduction fall within the scope of the TDM exception due to its strictly analytical objective.

Third, widespread reliance by commercial AI developers on TDM is inconsistent with both the restrictive approach taken by the CJEU to copyright exceptions and the three step test. The three step test is the fundamental principle for all exceptions and limitations to copyright established by the Berne Convention and codified in the TRIPS Agreement and InfoSoc Directive. It requires that any exception to creators’ rights:

  • Apply only to certain special cases: by contrast, “large-scale ingestion of expressive works for AI training is no longer a special case – it is becoming a systematic industry practice”.

  • Does not conflict with the normal exploitation of the work: the report says the ability of generative models to replicate the style, structure or substance of protected works means that the AI output, even if it is not a literal copy, serves as a functional equivalent, fulfilling the same user demand as the licensed channels from which the creator would have derived income.

  • Does not unreasonably prejudice the legitimate interests of the rightsholder: the scale of the free use by for-profit AI companies unreasonably prejudices creators’ legitimate interests, including because it often occurs without transparency or consent.

Fourth, the report argues that the obligation for creators to opt-out if they wish to avoid the free use of their works for commercial purposes under the relevant TDM exception is ineffective and inappropriate. While robots.txt provides a means by which creators can instruct bots not to copy their works, the standard is not universally recognised or complied with by those scraping the internet for volumes of data (although the recent EU General Purpose AI Code requires that signatories comply with widely used machine-readable instructions from creators).

In any event, a lot of protected content is not uploaded by the creators themselves but by third parties, denying many creators the opportunity to mark the text with their opt-out decision. Opting out also typically means less visibility, which imposes a binary choice on creators between visibility and unremunerated copying and excludes legitimate intermediate positions such as permitting citation and reference without allowing replication in training. There is also a disadvantage to AI developers: the opt-out mechanism may also render training datasets less useful, because the material will be incomplete.

The report sets out more fundamental objections to an approach which requires creators to take action to protect their rights:

Under the Berne Convention, the enjoyment and exercise of copyright shall not be subject to any formality. A system that places the burden on authors to actively reserve their rights – using machine-readable opt-outs or technical protocols – risks conflicting with this foundational principle of international copyright law. Moreover, the current opt-out regime presupposes a level of technical literacy, awareness and infrastructural capacity that many small creators do not possess.

The report recommends pivoting to a permissions-based ‘opt-in’ regime for TDM, which would be more consistent with the exclusive rights structure of copyright law.

Is there a reproduction in the AI model itself?

The report argues that, even if the TDM exception allows copying for training, the end result of the training captures within the AI’s algorithms a ‘reproduction’ of the creative works which is itself copyright infringement not covered by the EU TDM exception:

During training, the entirety of the content, including stylistic and structural elements, is encoded in what is called a vector space, a kind of compressed internal representation that allows the model to later compute new outputs that echo the original content. These internal vector mappings do not merely analyse data – they encode it in a way that facilitates synthetic reproduction. In copyright terms, this represents a form of reproduction, not just analysis.

These representations of the original creative expression stored in the model’s weights are not human-readable. But the report points out that under EU copyright law, reproduction does not require human readability or pixel-perfect duplication: it is sufficient that the act of reproduction enables subsequent outputs to exploit the expressive content of the original work.

Similar arguments were made in the US Copyright Office report on AI training.

For these reasons, the report also reaches the same view as the US Copyright Office on one of the key debates in this area – rejecting the argument that the use of training material by the AI model is similar to human learning and therefore non-infringing.

Transparency is not enough

The EU’s AI Act introduces a requirement for providers of general-purpose AI models to “draw up and make publicly available a sufficiently detailed summary about the content used for training of the general-purpose AI model, according to a template provided by the AI Office”.

The report says that while transparency may appear to empower creators, “a closer analysis reveals that this approach is structurally inadequate and fails to meaningfully address the real obstacles faced by individual creators”. A creator must still navigate the cumbersome mechanism before finding out their work has already been incorporated in a training database. Disclosure alone does not address the underlying issues of legal uncertainty, lawful use and equitable remuneration.

The report also notes that transparency may ultimately disadvantage developers if it leads to substantially higher TDM optout. Given the vast volumes of data required for training AI, developers would face either negotiating many individual licences – delaying development – or rely on a substantially reduced training dataset, impacting the model’s capabilities.

Alternative solutions to balancing developers’ and creators’ concerns

The debate over AI training and copyright often conflates two separate questions:

  • Permission: should AI developers need a licence or consent from a creator to use their work in training. As noted, securing individual clearances from creators at scale would be a challenge. The report notes that creators’ moral rights are also emerging as a regulatory pressure point in this context.

  • Compensation: should AI developers pay compensation to creators for use of their works in AI training. Creators point to the structural ‘value gap’ that has emerged between the commercial benefits accrued by AI developers and the lack of financial return for the human creators whose works underpin these systems. This echoes the ‘value gap’ debate in the digital platform context, “where creators provide the raw materials, but intermediaries capture the economic value”.

These two questions can be addressed in different combinations. Reliance on the current EU TDM exception is a “without permission/without compensation” approach that is also fraught with legal uncertainty for the reasons mentioned above. Statutory licences which authorise use of copyright works, such as by governments or schools, are “without permission/with compensation” because the work can be used without individually negotiated licences but a fee is still payable to rightsholders, often determined by a court or tribunal at a fair level. Voluntary licensing by individual creators or extended collective licensing via collecting societies are “with permission/with compensation”.

The report says that the current general TDM exception presents creators with an “all or nothing” choice:

Under the current framework, authors face a stark binary choice: they may deploy technological protection measures to ‘opt out’ entirely – thus sterilising their works from inclusion in any automated analysis – or passively allow unfettered use of their creations, with no right to be informed or compensated when their labour fuels multimillion-dollar AI products. There is no intermediary route by which a creator can expressly grant permission for AI training while negotiating fair payment or attribution. In economic terms, this legal vacuum is compounded by a profound asymmetry in bargaining power.

The report notes that there is an emerging trend of private licensing agreements between AI developers and content providers, partly based on the legal uncertainty surrounding the application of the TDM exception, but that this “with permission/with compensation” approach is likely to only be feasible for the larger publishers and creators and risks leaving less commercially visible works unlicensed and unremunerated.

The report outlines several alternative “without permission/with compensation” approaches:

  • Statutory licence for machine-learning purposes: any commercial use of copyright works to train generative AI would automatically trigger a mandatory licence, addressing the AI developers’ concerns about the burden of negotiating licences, coupled with an unwaivable right to economic remuneration for rightsholders. Remuneration rates would be negotiated through collective bargaining by collecting societies or set by a regulator or tribunal. Collecting societies would distribute the aggregated proceeds to individual in accordance with the type of allocation mechanisms they use today for statutory licence revenue.

  • An output-oriented ‘AI levy’ on providers of generative systems: any commercial AI service whose outputs reach a threshold of human-like substitutability would pay a lump-sum levycalculated, for example, as a percentage of turnover, user subscriptions or volume of generated content. The pooled funds would be distributed by collecting societies to support authors’ livelihoods, finance training programmes and underwrite new creative projects.

While the report acknowledges that there are challenges with these approaches and makes no conclusive recommendations of its own, it argues that the building blocks of a fairer balance between developers and creators lie within familiar concepts in copyright, such as use of collecting societies: “the core challenge is not to reinvent copyright, but to preserve its integrity through principled evolution”. Copyright law should be understood “not as a barrier to innovation, but as a vehicle for ensuring that innovation remains ethically grounded and socially legitimate”.

Conclusion

The report’s strongest plea is for AI developers and creators to recognise that they need each other:

The large-scale, uncompensated use of human literary and artistic works in AI training risks eroding the right to fair remuneration – an essential mechanism for sustaining creative labour in the digital era. Fair compensation is not only a matter of distributive justice, but also of safeguarding the long-term vitality of human expression, including the forms of creativity that may eventually be enhanced through AI-assisted tools.