Marathe: Generative AI is Trained on Online E-discovery Resources. Here’s What That Means for E-discovery

The e-discovery resources included in Google’s C4 database may be best confined to content generation purposes, as opposed to being research tools.

[Editor’s Note: Legaltech News‘ Isha Marathe wrote an analysis of the resources ingested by ChatGPT regarding eDiscovery highlighting Rob Robinson, Dr. Maura R. Grossman and Ralph Losey.]

Generative AI is Trained on Online E-discovery Resources. Here’s What That Means for E-discovery — (Image Credit: local_doctor/Adobe Stock)

Even for the experts, there is much left to learn about generative artificial intelligence, primarily by the virtue of it being so new and so fast-evolving.

However, there’s at least one unanimous agreement across all sectors when it comes to the technology: What comes out depends strongly on what goes in.

And while the black box where the magic happens (i.e., the machine’s reasoning and decision-making process) may yet have to be cracked, more information about the “what goes in,” or the input, is beginning to emerge.

In the e-discovery sector specifically, the managing director of ComplexDiscovery Rob Robinson outlined a list of 55 e-discovery-centric resource domains that are included in Google’s Colossal Clean Crawled Corpus (C4) dataset. The C4 dataset is a rather large collection of Web pages crawled—or analyzed and indexed—by the CommonCrawl project, and serves as a vital bedrock of information to train large language models (LLMs) like OpenAI’s GPT models, Microsoft’s Bing chatbot and Google’s Bard, to name a few.
Legaltech News‘ Isha Marathe

In the e-discovery sector specifically, the managing director of ComplexDiscovery Rob Robinson outlined a list of 55 e-discovery-centric resource domains that are included in Google’s Colossal Clean Crawled Corpus (C4) dataset. The C4 dataset is a rather large collection of Web pages crawled—or analyzed and indexed—by the CommonCrawl project, and serves as a vital bedrock of information to train large language models (LLMs) like OpenAI’s GPT models, Microsoft’s Bing chatbot and Google’s Bard, to name a few.

The inclusion of e-discovery resources in the C4 database attracted substantial attention from the industry. Indeed, the C4 dataset lists popular online e-discovery blogs, training resources, federal government advisories pertinent to the sector and even Legaltech News pages, among many other resources.

Ultimately, the benefit the e-discovery industry will derive from this powerful technology will depend, to a large degree, on the data included in the C4 project. So, what do e-discovery professionals think of the 55 e-discovery-specific entries Robinson noted?

For one, the C4 data proves that today’s generative AI is limited to being an e-discovery content generation tool, and not a research tool. What the future holds, though, is anyone’s guess.

It’s All About the Lingo (read the rest of the article here.)

Author

Mary Mack

Mary Mack is the CEO and Chief Legal Technologist for EDRM. Mary was the co-editor of the Thomson Reuters West Treatise, eDiscovery for Corporate Counsel for 10 years and the co-author of A Process of Illumination: the Practical Guide to Electronic Discovery. She holds the CISSP among her certifications.

View all posts