A recent class-action lawsuit might compel OpenAI to reveal its highly-secret corpus of datasets its large language models are trained on. But the process of getting there isn’t going to be easy. .
{Editor’s Note: This article first appeared in the July 21, 2023 issue of LegalTech News from Law.com, a publication of American Lawyer Media.]
Earlier this month, three authors, including comedian Sarah Silverman, sued OpenAI and Meta over dual claims of copyright infringement alleging that their generative AI software scraped their data without consent. The lawsuit came after Silverman’s legal team asked ChatGPT to summarize excerpts of her book “Bedwetter,” which the chatbot successfully did. Then, it was prompted to reproduce the copyright management information that went along with the published work—which it failed to do.
Until now, a significant burden in most copyright infringement cases against generative AI tools—largely from artists—has been the task of proving the model, also known as a large language model (LLM), was trained on a specific work.
By ChatGPT’s own admission, that challenge is somewhat overcome in the latest class action against OpenAI.
However, for IP owners looking to navigate the advent generative AI tools, the development begs the question: do datasets LLMs are trained on qualify for trade secret protections? And if so, what challenges might that bring to the discovery process?
For e-discovery experts and attorneys, the authors’ suits have the potential to complicate discovery. But if successful, they also might create new guidelines for how LLMs are dealt with in the discovery process, they told Legaltech News.
Mary Mack, CEO and chief legal technologist at the Electronic Discovery Reference Model (EDRM), said she believes ChatGPT’s responses summarizing Silverman’s book are integral to the case, and are likely enough to “bring [OpenAI] to court.”
Read the entire article here.