In OpenAI Copyright Lawsuits, Discovery Complications Likely to Take Center Stage

A recent class-action lawsuit might compel OpenAI to reveal its highly-secret corpus of datasets its large language models are trained on. But the process of getting there isn’t going to be easy. .

{Editor’s Note: This article first appeared in the July 21, 2023 issue of LegalTech News from, a publication of American Lawyer Media.]

In OpenAI Copyright Lawsuits, Discovery Complications Likely to Take Center Stage by Isha Marathe
Image: Kaylee Walstad, EDRM

Earlier this month, three authors, including comedian Sarah Silverman, sued OpenAI and Meta over dual claims of copyright infringement alleging that their generative AI software scraped their data without consent.  The lawsuit came after Silverman’s legal team asked ChatGPT to summarize excerpts of her book “Bedwetter,” which the chatbot successfully did. Then, it was prompted to reproduce the copyright management information that went along with the published work—which it failed to do.

Until now, a significant burden in most copyright infringement cases against generative AI tools—largely from artists—has been the task of proving the model, also known as a large language model (LLM), was trained on a specific work.

By ChatGPT’s own admission, that challenge is somewhat overcome in the latest class action against OpenAI.

However, for IP owners looking to navigate the advent generative AI tools, the development begs the question: do datasets LLMs are trained on qualify for trade secret protections? And if so, what challenges might that bring to the discovery process?

For e-discovery experts and attorneys, the authors’ suits have the potential to complicate discovery. But if successful, they also might create new guidelines for how LLMs are dealt with in the discovery process, they told Legaltech News.

Mary Mack, CEO and chief legal technologist at the Electronic Discovery Reference Model (EDRM), said she believes ChatGPT’s responses summarizing Silverman’s book are integral to the case, and are likely enough to “bring [OpenAI] to court.”

I do think that they are able to demonstrate that their work is in the corpus not only crawled, but it was also indexed and put into the tool. And, it looks like they didn’t bring the copyrighted metadata with [the data itself].

Mary Mack, EDRM CEO and chief legal technologist

Read the entire article here.


  • Isha Marathe

    Isha is a New York-based legal technology reporter for Legaltech News, covering new things happening around privacy law, e-discovery and cybersecurity. Meanwhile, attempting to hit every hole-in-the-wall restaurant from Brooklyn to at least South of Houston.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.