Marathe: Generative AI is Trained on Online E-discovery Resources. Here’s What That Means for E-discovery

The e-discovery resources included in Google’s C4 database may be best confined to content generation purposes, as opposed to being research tools. 

[Editor’s Note: Legaltech News‘ Isha Marathe wrote an analysis of the resources ingested by ChatGPT regarding eDiscovery highlighting Rob Robinson, Dr. Maura R. Grossman and Ralph Losey.]

Generative AI is Trained on Online E-discovery Resources. Here’s What That Means for E-discovery
(Image Credit: local_doctor/Adobe Stock)

Even for the experts, there is much left to learn about generative artificial intelligence, primarily by the virtue of it being so new and so fast-evolving.

However, there’s at least one unanimous agreement across all sectors when it comes to the technology: What comes out depends strongly on what goes in.

And while the black box where the magic happens (i.e., the machine’s reasoning and decision-making process) may yet have to be cracked, more information about the “what goes in,” or the input, is beginning to emerge.

In the e-discovery sector specifically, the managing director of ComplexDiscovery Rob Robinson outlined a list of 55 e-discovery-centric resource domains that are included in Google’s Colossal Clean Crawled Corpus (C4) dataset. The C4 dataset is a rather large collection of Web pages crawled—or analyzed and indexed—by the CommonCrawl project, and serves as a vital bedrock of information to train large language models (LLMs) like OpenAI’s GPT models, Microsoft’s Bing chatbot and Google’s Bard, to name a few.

Legaltech News‘ Isha Marathe

In the e-discovery sector specifically, the managing director of ComplexDiscovery Rob Robinson outlined a list of 55 e-discovery-centric resource domains that are included in Google’s Colossal Clean Crawled Corpus (C4) dataset. The C4 dataset is a rather large collection of Web pages crawled—or analyzed and indexed—by the CommonCrawl project, and serves as a vital bedrock of information to train large language models (LLMs) like OpenAI’s GPT models, Microsoft’s Bing chatbot and Google’s Bard, to name a few.

The inclusion of e-discovery resources in the C4 database attracted substantial attention from the industry. Indeed, the C4 dataset lists popular online e-discovery blogs, training resources, federal government advisories pertinent to the sector and even Legaltech News pages, among many other resources.

Ultimately, the benefit the e-discovery industry will derive from this powerful technology will depend, to a large degree, on the data included in the C4 project. So, what do e-discovery professionals think of the 55 e-discovery-specific entries Robinson noted?

For one, the C4 data proves that today’s generative AI is limited to being an e-discovery content generation tool, and not a research tool. What the future holds, though, is anyone’s guess.

It’s All About the Lingo (read the rest of the article here.)

Author

  • Mary Mack

    Mary Mack is the CEO and Chief Legal Technologist for EDRM. Mary was the co-editor of the Thomson Reuters West Treatise, eDiscovery for Corporate Counsel for 10 years and the co-author of A Process of Illumination: the Practical Guide to Electronic Discovery. She holds the CISSP among her certifications.