Politics

Tech companies battle content creators over use of copyrighted material to train AI models

Canadian creators and publishers want the government to do something about the unauthorized and usually unreported use of their content to train generative artificial intelligence systems. But AI companies maintain that using the material to train their systems doesn't violate copyright.

Publishers want action on unauthorized use of their work but tech firms say material doesn't violate copyright

The ChatGPT logo is seen on a cellphone.
The ChatGPT app is seen on an iPhone in New York, Thursday, May 18, 2023. Some creators and AI companies are at odds over the use of copyrighted material to train AI models. (Richard Drew/The Associated Press)

Canadian creators and publishers want the government to do something about the unauthorized and usually unreported use of their content to train generative artificial intelligence systems.

But AI companies maintain that using the material to train their systems doesn't violate copyright, and say limiting its use would stymie the development of AI in Canada.

The two sides are making their cases in recently published submissions to a consultation on copyright and AI being undertaken by the federal government as it considers how Canada's copyright laws should address the emergence of generative AI systems like OpenAI's ChatGPT.

Generative AI can create text, images, videos and computer code based on a simple prompt, but to do that, the systems must first study vast amounts of existing content.

In its submission to the government, Access Copyright argued most and potentially all large language models "are currently profiting from unauthorized use and reproduction of copyright protected works."

Creators angry over copyright 'black box'

It's taking place in a "black box," according to Access Copyright, which represents writers, visual artists and publishers.

"Rightsholders know it is happening, but due to the information asymmetry between themselves and AI platforms, they cannot determine who is conducting the activity, with whose works, and have no mechanism to stop it from happening." 

Music Canada, which represents the country's major record labels, said last year, a fake AI-generated song mimicking the voices of Drake and The Weeknd "made one thing abundantly clear: AI models and systems have already ingested massive amounts of proprietary datasets without authorization from the source of the data or rightsholders."

The Writers' Guild of Canada asked the government to start with implementing basic disclosure and reporting obligations. It said developers have all the knowledge of the work that is being mined and how it's being used, while creators have none of that information.

WATCH | Writers talk about risk of AI to their work:

Why Montreal writers want AI to stop stealing their work

1 year ago
Duration 1:13
Local writers, such as Heather O'Neill, Trevor Ferguson and Rosemary Sullivan, say they're interested in participating in legal action against artificial intelligence companies for using their writing to train bots to mimic their writing styles.

Some organizations have signed licensing deals with AI companies. But the Canadian Authors Association said rightsholders face "immense obstacles" in licensing their content "because they are being kept in the dark as to which of their works are being used" by which companies.

It asked Canada to clarify that text and data mining are subject to copyright laws.

Numerous lawsuits are underway in the United States over the use of copyrighted materials by generative AI systems, including one launched this week by the world's biggest record labels against two AI music generators.

Disparity in information a problem, artists say

The Canadian Media Producers Association said legal cases illustrate the problem posed by a lack of transparency, citing one case in which the AI company argued the rightsholder couldn't proceed with the infringement allegation unless they could specify the exact work used for training.

"Rightsholders will also undoubtedly face similar evidentiary issues as many datasets used to train Generative AI systems are purportedly destroyed after the initial training is complete," it said.

The group said it's an issue that "demands immediate attention" and asked the government to implement transparency requirements.

But AI companies maintain the kind of transparency rightsholders are asking for isn't realistic.

Microsoft told the government training large-scale AI systems involves "vast volumes" of data, and companies shouldn't have to keep records of that or disclose the content that is used for training.

"It would not be feasible to record such information and any such requirement would inhibit AI development," it said.

The company argued it is not "copyright infringement to analyze works and learn concepts and facts."

Google said AI training is already exempted under existing copyright law, though the government should adopt an exemption to make that explicit.

LISTEN | The growing realism of AI:
OpenAI is showing off the latest version of its ChatGPT software in a new set of promotional videos, sounding almost human in the way it talks to users, inviting all sorts of sci-fi comparisons. But AI chatbots are already here, using large language models to simulate human speech, emotion — and even relationships. As this technology goes increasingly mainstream, what will it mean for our "real life" relationships? Can you actually have a meaningful relationship with a computer program? And if you can… is that something you want to trust a tech company with? Philosophy instructor Jill Fellows tackles the big questions about the future of AI companions.

Google said requiring permission to use content for training purposes would expose competitively sensitive information and "would effectively block the development and use of large language models and other types of cutting-edge AI."

It also said AI developers don't have access to accurate information about copyright status. 

"In fact, there is no such source of truth anywhere in the world. Thus, complying with disclosure rules may simply prove impossible from the start." 

Canadian AI company Cohere said using content for training AI systems works similarly to how an individual reads books to become more informed.

The company said the process doesn't violate copyright, and argued that needs to be clear in the law. Otherwise, "Canada's ambitions to be the home of world-leading AI companies and ecosystems" could be undermined.

The Council of Canadian Innovators, which represents the Canadian tech sector, said disclosure requirements would harm smaller companies as opposed to their Big Tech rivals. It warned this would "seriously hamper the potential of Canadian companies to scale significantly."