Exploring the Inclusion of eDiscovery-Centric Resources in the Google C4 Dataset: A Highly Selective Search

[EDRM Editor’s Note: This article was first published here on April 26, 2023 and EDRM is grateful to Rob Robinson, editor and managing director of ComplexDiscovery, for permission to republish.]

[ComplexDiscovery Editor’s Background Note: The impact of organizations and entities on the output from Large Language Models (LLMs) can be more significant than one might initially anticipate. In some instances, specific resources within an industry can considerably influence how LLMs process and respond to information. One example of this influence can be observed by examining the Google C4 Dataset and searching for a non-comprehensive selection of domains from 55 eDiscovery-centric websites. While this exploration only offers a snapshot of selected resources from a non-all-inclusive list, it may provide valuable context for those evaluating the resource impact on LLMs and also highlight some tools that can help better understand the content populating LLMs. This deeper understanding can, in turn, contribute to shedding light on how selected eDiscovery resources may play a substantial role in shaping the knowledge and responses generated by LLMs – a role much more significant (or less important) than one might think.]

Industry Backgrounder

Exploring the Inclusion of eDiscovery-Centric Resources in the Google C4 Dataset: A Highly Selective Search

ComplexDiscovery*

Large language models, such as those developed by Google and OpenAI, are becoming increasingly sophisticated and pervasive in various industries. One such application of these models is in the eDiscovery ecosystem, which contains touchpoints ranging from cybersecurity and information governance to legal discovery. This article explores at a very high level the inclusion of selected eDiscovery-centric resources in the Google C4 Dataset. It also discusses why understanding this exploration may benefit professionals working in the eDiscovery ecosystem.

Google’s C4 Dataset and its Relevance to eDiscovery

Understanding the Google G4 Dataset

Google’s C4 (Colossal Clean Crawled Corpus) project aims to create a comprehensive and diverse dataset for training large language models. The dataset is built from web pages crawled by the CommonCrawl project and includes a diverse range of content in multiple languages. Google’s C4 Dataset serves as an essential foundation for developing more accurate and sophisticated language models that can understand and generate human-like text.

The C4 dataset from Google contains approximately 750GB of cleaned text data derived from CommonCrawl web pages. This large-scale dataset is utilized for training and improving large language models, such as those based on the GPT architecture.

CommonCrawl is an open-source initiative that crawls and archives publicly available web content. This vast repository of web-crawled data is invaluable for training large language models, as it provides a diverse and extensive source of text in multiple languages. The Common Crawl project significantly contributes to the C4 Dataset, enhancing its quality and usefulness for AI research.

The Role of large language models in eDiscovery

Large language models can potentially revolutionize the eDiscovery process by automating tasks ranging from document review to review reporting. These models can analyze vast amounts of data quickly and efficiently, identify relevant information, and generate insightful summaries or responses. As a result, they can save time, reduce costs, and improve the accuracy of eDiscovery outcomes.

Inclusion of eDiscovery-centric resources in the C4 Dataset

The presence of eDiscovery resources in the C4 Dataset is crucial for ensuring the accuracy and relevance of large language model outputs in the eDiscovery context. By training on high-quality eDiscovery resources, the models can better understand the domain-specific language, concepts, and best practices, leading to more reliable and valuable results for eDiscovery professionals.

ComplexDiscovery’s Non-Comprehensive List of eDiscovery Resources and Its Significance

Introduction to ComplexDiscovery’s resource listing

On March 9, 2023, ComplexDiscovery published a non-comprehensive list of potentially helpful eDiscovery-centric resources. These resources, ranging from analyst and research firms to industry associations and blogs, were designed to serve as a simple starting point for individuals seeking information related to eDiscovery.

Selection of resources from ComplexDiscovery’s list for analysis

Given the manageable size of this resource listing and the direct or indirect relevance to the eDiscovery ecosystem of each listed resource, ComplexDiscovery created a truncated listing from an initial grouping of 100+ resources and used the top-level domain names of those resources to search the C4 Dataset. This truncation, which included the removal of top-level domain duplicates for multiple resources on the same domain and removing resources not available at the time of the Google C4 Dataset snapshot, resulted in a list of 55 resource domains.

Top-level domain names search against the C4 Dataset

The objective of searching the top-level domain names of the selected resources within the C4 Dataset was to explore how a very targeted snapshot of eDiscovery resources might be represented in the C4 Dataset. This information on the representation of selected resources may help gauge how these resources are being used to train Google’s large language models in responding to inquiries and prompts related to eDiscovery.

The results of top-level domain name searches of 55 eDiscovery-centric resources are provided in the following table, as extracted from the C4 Dataset search capability resource featured in the Washington Post article titled “Inside the Secret List of Websites That Make AI Like ChatGPT Sound Smart.” The data is reported based on database rank, tokens, and the percentage of all tokens. The aggregated results for the selected resources below showcase the prevalence of content from these resources in the C4 Dataset.

Table: Selected eDiscovery Resources and the C4 Dataset. To search the dataset, click here.

Resource Category (ComplexDiscovery)	Resource	Domain Searched	Rank	Tokens (Rounded)	Percent of All Tokens
Analyst, Research, and Review Firms	G2	G2.com	152	16,000,000	0.01%
Analyst, Research, and Review Firms	Capterra	Capterra.com	216	13,000,000	0.008%
News, Announcement, and Commentary Resources	Lexology	Lexology.com	519	8,100,000	0.005%
Analyst, Research, and Review Firms	Software Advice	SoftwareAdvice.com	730	6,300,000	0.004%
Associations, Consortiums, and Groups	IAPP (International Association of Privacy Professionals)	IAPP.org	5,236	1,900,000	0.001%
News, Announcement, and Commentary Resources	JD Supra	JDSupra.com	5,274	1,800,000	0.001%
News, Announcement, and Commentary Resources	Legaltech News	Law.com	5,898	1,700,000	0.001%
Information and Research Resources	NIST (National Institute of Standards and Technology)	NIST.gov	5,920	1,700,000	0.001%
Analyst, Research, and Review Firms	TrustRadius	TrustRadius.com	6,958	1,500,000	0.001%
Information and Research Resources	Cybersecurity Legal Task Force (American Bar Association)	AmericanBar.org	8,266	1,300,000	0.0009%
Information and Research Resources	FTC Premerger Notification Program (Federal Trade Commission)	FTC.gov	10,959	1,100,000	0.0007%
Analyst, Research, and Review Firms	Gartner	Gartner.com	19,166	720,000	0.0005%
Industry Blogs	eDiscovery Team (Ralph Losey)	E-DiscoveryTeam.com	29,362	530,000	0.0003%
Analyst, Research, and Review Firms	IDC	IDC.com	41,812	400,000	0.0003%
Analyst, Research, and Review Firms	Forrester	Forrester.com	42,218	400,000	0.0003%
News, Announcement, and Commentary Resources	LawSites	LawSitesblog.com	63,769	290,000	0.0002%
Analyst, Research, and Review Firms	Chambers and Partners	Chambers.com	77,729	250,000	0.0002%
Industry Blogs	Artificial Lawyer (Richard Tromans)	ArtificialLawyer.com	85,162	230,000	0.0001%
Educational Training and Resources	E-Discovery Team Training	e-DiscoveryTeamTraining.com	93,748	210,000	0.0001%
News, Announcement, and Commentary Resources	LexBlog	LexBlog.com	110,534	180,000	0.0001%
News, Announcement, and Commentary Resources	LegalIT Insider	LegalTechnology.com	122,034	170,000	0.0001%
eDiscovery Provider Websites	Relativity	Relativity.com	145,664	150,000	0.00009%
Industry Blogs	eDisclosure Information Project (Chris Dale)	ChrisDaleOxford.com	187,731	120,000	0.00008%
News, Announcement, and Commentary Resources	Legal IT Professionals	LegalITProfessionals.com	220,976	100,000	0.00007%
Information and Research Resources	ENISA (European Union Agency for Cybersecurity)	ENISA.Europa.eu	271,149	85,000	0.00005%
Associations, Consortiums, and Groups	EDRM (Electronic Discovery Reference Model)	EDRM.net	293,316	79,000	0.00005%
eDiscovery Provider Websites	IPRO	IPROTech.com	299,993	77,000	0.00005%
Associations, Consortiums, and Groups	Women in eDiscovery	WomenineDiscovery.org	303,379	77,000	0.00005%
eDiscovery Provider Websites	Nuix	Nuix.com	323,733	72,000	0.00005%
eDiscovery Provider Websites	Epiq	EpiqGlobal.com	387,082	61,000	0.00004%
Analyst, Research, and Review Firms	ComplexDiscovery	ComplexDiscovery.com	445,248	53,000	0.00003%
Associations, Consortiums, and Groups	ACEDS (Association of Certified E-Discovery Specialists)	ACEDS.org	470,275	50,000	0.00003%
Industry Blogs	Hanzo Blog (Hanzo)	Hanzo.co	486,348	49,000	0.00003%
eDiscovery Provider Websites	Exterro	Exterro.com	508,502	46,000	0.00003%
Associations, Consortiums, and Groups	The Sedona Conference (TSC)	TheSedonaConference.org	508,617	46,000	0.00003%
Industry Blogs	Ball In Your Court (Craig Ball)	CraigBall.net	602,359	39,000	0.00002%
eDiscovery Provider Websites	Disco	CSDisco.com	747,835	31,000	0.00002%
eDiscovery Provider Websites	HaystackID	HaystackID.com	763,781	30,000	0.00002%
Information and Research Resources	International Cyber Law in Practice: Interactive Toolkit (NATO CCDCOE)	CCDCOE.org	818,082	28,000	0.00002%
eDiscovery Provider Websites	Logikcull	Logikcull.com	838,778	27,000	0.00002%
eDiscovery Provider Websites	Lexbe	Lexbe.com	894,973	26,000	0.00002%
Associations, Consortiums, and Groups	ILTA (International Legal Technology Association)	ILTAnet.org	929,143	24,000	0.00002%
eDiscovery Provider Websites	Lighthouse	LighthouseGlobal.com	1,049,929	21,000	0.00001%
eDiscovery Provider Websites	KLDiscovery	KLDiscovery.com	1,064,262	21,000	0.00001%
Information and Research Resources	GDPR (General Data Protection Regulation) (European Union)	GDPR.eu	1,089,043	20,000	0.00001%
Associations, Consortiums, and Groups	CLOC (Corporate Legal Operations Consortium)	CLOC.org	1,200,575	18,000	0.00001%
Industry Blogs	Ride the Lightning (Sharon Nelson)	SenseiEnt.com	1,222,763	18,000	0.00001%
Information and Research Resources	EDPB (European Data Protection Board)	EDPB.Europa.eu	1,306,894	17,000	0.00001%
Associations, Consortiums, and Groups	ARMA International	Arma.org	1,321,946	16,000	0.00001%
Industry Blogs	The Cowen Group (David Cowen)	CowenGroup.com	1,637,480	13,000	0.000008%
Industry Blogs	eDiscovery Assistant Blog (Kelly Twigger)	eDiscoveryAssistant.com	1,757,035	12,000	0.000007%
Educational Training and Resources	Nordic Institute for Interoperability Solutions	NIIS.org	2,609,572	7,000	0.000004%
Industry Blogs	Reveal Blog (George Socha and Cat Casey)	RevealData.com	5,437,005	2,100	0.000001%
Associations, Consortiums, and Groups	GICLI (The Government Investigations & Civil Litigation Institute)	GICLI.org	10,772,422	330	0.0000002%
eDiscovery Provider Websites	L2 Services	L2Services.net	13,335,285	110	0.00000007%

Showing 1 to 55 of 55 entries

Source: ComplexDiscovery and the Washington Post

Implications of eDiscovery Resource Representation in the C4 Dataset

Identifying potential biases and limitations

By analyzing the representation of eDiscovery resources in the C4 Dataset, professionals in the eDiscovery ecosystem can identify potential biases and limitations in the data used to train large language models. This knowledge may enable them to make more informed decisions about the reliability and applicability of AI-generated outputs in their work.

Enhancing the quality and diversity of data used to train large language models

Understanding the inclusion of eDiscovery resources in the C4 Dataset can also help researchers and developers improve the quality and diversity of data used to train large language models. By incorporating a more comprehensive range of eDiscovery-centric resources, models may become better equipped to generate more accurate and relevant responses in the eDiscovery context.

Addressing the needs of cybersecurity, information governance, and legal discovery professionals

By exploring the eDiscovery resources represented in the C4 Dataset, developers can better understand the needs of cybersecurity, information governance, and legal discovery professionals. This insight may allow them to fine-tune large language models to address better the unique challenges and requirements of the eDiscovery ecosystem, ultimately leading to more useful AI-generated outputs for these professionals.

Encouraging transparency in AI development

Highlighting the inclusion of eDiscovery-centric resources in the C4 Dataset emphasizes the importance of transparency in AI development. By understanding the data sources used to train large language models, professionals in the eDiscovery ecosystem may be able to evaluate the reliability of AI-generated outputs better and make more informed decisions about their adoption and integration into their work and workflows.

Conclusion

This high-level exploration of selected eDiscovery-centric resources in the Google C4 Dataset has meaningful implications for professionals in the eDiscovery ecosystem. Analyzing the representation of selected resources in the dataset may help identify potential biases and limitations, enhance the quality and diversity of data used to train large language models, and encourage transparency in AI development. It may also highlight, with context, resources that may have more influence than you would think on shaping LLM-driven answers to prompts and queries. As large language models continue to evolve and become more integrated into the eDiscovery ecosystem, understanding their data sources and potential limitations will be crucial in ensuring their successful application and adoption.

*Assisted by GAI and LLM Technologies

Article References

Additional Reading

Source: ComplexDiscovery

Author

Rob Robinson

Rob Robinson is a technology marketer who has held senior leadership positions with multiple top-tier data and legal technology providers. He writes frequently on technology and marketing topics and publish regularly on ComplexDiscovery.com of which he is the Managing Director.

View all posts