The e-discovery resources included in Google’s C4 database may be best confined to content generation purposes, as opposed to being research tools.
[Editor’s Note: Legaltech News‘ Isha Marathe wrote an analysis of the resources ingested by ChatGPT regarding eDiscovery highlighting Rob Robinson, Dr. Maura R. Grossman and Ralph Losey.]
Even for the experts, there is much left to learn about generative artificial intelligence, primarily by the virtue of it being so new and so fast-evolving.
However, there’s at least one unanimous agreement across all sectors when it comes to the technology: What comes out depends strongly on what goes in.
And while the black box where the magic happens (i.e., the machine’s reasoning and decision-making process) may yet have to be cracked, more information about the “what goes in,” or the input, is beginning to emerge.
In the e-discovery sector specifically, the managing director of ComplexDiscovery Rob Robinson outlined a list of 55 e-discovery-centric resource domains that are included in Google’s Colossal Clean Crawled Corpus (C4) dataset. The C4 dataset is a rather large collection of Web pages crawled—or analyzed and indexed—by the CommonCrawl project, and serves as a vital bedrock of information to train large language models (LLMs) like OpenAI’s GPT models, Microsoft’s Bing chatbot and Google’s Bard, to name a few.
The inclusion of e-discovery resources in the C4 database attracted substantial attention from the industry. Indeed, the C4 dataset lists popular online e-discovery blogs, training resources, federal government advisories pertinent to the sector and even Legaltech News pages, among many other resources.
Ultimately, the benefit the e-discovery industry will derive from this powerful technology will depend, to a large degree, on the data included in the C4 project. So, what do e-discovery professionals think of the 55 e-discovery-specific entries Robinson noted?
For one, the C4 data proves that today’s generative AI is limited to being an e-discovery content generation tool, and not a research tool. What the future holds, though, is anyone’s guess.