
[EDRM Editor’s Note: The opinions and positions are those of Craig Ball. This article is republished with permission and was first published on August 15, 2025.]
Early this century, when I was gaining a reputation as a trial lawyer who understood e-discovery and digital forensics, I was hired to work as the lead computer forensic examiner for plaintiffs in a headline-making case involving a Houston-based company called Enron. It was a heady experience.
Today, everywhere you turn in e-discovery, Enron is still with us. Not the company that went down in flames more than two decades ago, but the Enron Email Corpus, the industry’s default demo dataset.
Type in “Ken Lay” or “Andy Fastow,” hit search, and watch the results roll in. For vendors, it’s the easy choice: free, legal, and familiar. But for 2025, it’s also frozen in time—benchmarking the future of discovery against the technological equivalent of a rotary phone. Or, now that AOL has lately retired its dial-up service, benchmarking it against a 56K modem.
How Enron Became Everyone’s Test Data
When Enron collapsed in 2001 amid accounting fraud and market-manipulation scandals, the U.S. Federal Energy Regulatory Commission (FERC) launched a sweeping investigation into abuses during the Western U.S. energy crisis. As part of that probe, FERC collected huge volumes of internal Enron email.
In 2003, in an extraordinary act of transparency, FERC made a subset of those emails public as part of its docket. Some messages were removed at employees’ request; all attachments were stripped.
The dataset got a second life when Carnegie Mellon University’s School of Computer Science downloaded the FERC release, cleaned and structured it into individual mailboxes, and published it for research. That CMU version contains roughly half a million messages from about 150 Enron employees.
A few years later, the Electronic Discovery Reference Model (EDRM)—where I serve as General Counsel—stepped in to make the corpus more accessible to the legal tech world. EDRM curated, repackaged, and hosted improved versions, including PST-structured mailboxes and more comprehensive metadata. Even after CMU stopped hosting it, EDRM kept it available for years, ensuring that anyone building or testing e-discovery tools had a free, legal dataset to use. [Note: EDRM no longer hosts the Enron corpus, but for those who like hunting antiques, you may find it (or parts of it) at CMU, Enrondata.org, Kaggle.com and, no joke, The Library of Congress].
Because it’s there, lawful, and easy, Enron became—and regrettably remains—the de facto benchmark in our industry.
Why Enron Endures
Its virtues are obvious:
- Free and lawful to use
- Large enough to exercise search and analytics tools
- Real corporate communications with all their messy quirks
- Familiar to the point of being an industry standard
But those virtues are also the trap. The data is from 2001—before smartphones, Teams, Slack, Zoom, linked attachments, and nearly every other element that makes modern email review challenging.
In 2025, running Enron through a discovery platform is like driving a Formula One race car on cobblestone streets.
In 2025, running Enron through a discovery platform is like driving a Formula One race car on cobblestone streets.
Craig Ball.
Why There’ll Never Be Another Enron
The conditions that produced the Enron corpus are gone. Back then, there was:
- A spectacular corporate collapse (OK, still plenty of these)
- Bankruptcy, eliminating any ongoing operations to protect
- A regulator willing to make internal corporate communications public, and most crucially,
- A legal environment without GDPR, CCPA, or modern privacy expectations
Post-2000 privacy laws, heightened sensitivity to privilege and aggressive litigation strategies make a wholesale release of modern corporate mailboxes virtually impossible. Even massive corporate implosions now end in settlements, with data returned, destroyed, or locked behind protective orders.
Chances are we will never see the likes of the Enron email release.
When ‘Safe’ Isn’t Good Enough
That leaves us with a problem. The only large, lawful, realistic corporate dataset we all share is woefully out of date. The “safe” choice—Enron—is the wrong choice if you want to see how a 2025 platform handles contemporary realities:
- Cloud-hosted mail systems
- Embedded chat and meeting content
- Mobile-generated messages
- Linked attachments
- Mixed MIME types and non-standard encodings
- Encryption, redactions, and multifile attachments
Teaching e-discovery and digital evidence at the University of Texas Schools of Law, Computer Science and Information Science, I sidestep that by using something more modern and manageable: the John Podesta emails. Yes, they were stolen and released without consent. Yes, that’s ethically and legally fraught. But they’re from 2015, structured as PSTs with full headers and attachments, and a suitable size for students—around 50,000 messages. In a controlled educational setting, with disclaimers, they supply a realistic glimpse of the formats, quirks, and challenges of modern email collections that Enron simply can’t provide.
… I sidestep that by using something more modern and manageable: the John Podesta emails. Yes, they were stolen and released without consent. Yes, that’s ethically and legally fraught. But they’re from 2015, structured as PSTs with full headers and attachments, and a suitable size for students—around 50,000 messages. In a controlled educational setting, with disclaimers, they supply a realistic glimpse of the formats, quirks, and challenges of modern email collections that Enron simply can’t provide.
Craig Ball.
I wouldn’t expect vendors to demo with Podesta considering the optics, risks and size, but it underscores the problem: the most realistic training data is often the least “safe” to use.
A Practical Path Forward
If we can’t have another lawful, massive corporate email release, we need to stop pretending Enron is a valid stand-in and build datasets that reflect the present. That means:
- Synthetic corpora: purpose-built to mimic modern corporate environments, with realistic metadata, message formats, and attachment types. One example of this is the Avocado Research Email Collection available from the Linguistic Data Consortium, but it, too, is old data and too costly and restricted for general use.
- FOIA-based collections: modern government email releases converted into structured, searchable corpora.
- Anonymized donor data: corporate partners willing to contribute sanitized, non-privileged communications for research.
- Blended corpora: combinations of lawful sources to produce realistic size, variety, and complexity.
These won’t have the mystique of a real scandal’s raw inboxes, but they will give us a far better sense of how today’s tools handle today’s ESI challenges.
Time to Log Off
Enron doesn’t need to vanish. It’s part of our field’s history, and for researchers in some settings it still has value. But for evaluating modern discovery software, It’s a dinosaur from the dial-up era, and it’s past time we stopped pretending otherwise.
The next time you see Enron in a demo, ask: If your tool can only shine on data from 2001, how will it perform on what I’m dealing with today? That’s the real benchmark.
Postscript: After I penned this, I asked ChatGPT, “Has anyone published advocating for retirement of the Enron corpus in ediscovery?” It pointed to a 2019 post on CloudNine Software blog titled, “The Enron Data Set is No Longer a Representative Test Data Set: eDiscovery Best Practices.” The post carries no byline, but my guess is it’s the work of my dear friend and blogger extraordinaire, Doug Austin from Houston. Credit where it’s due.
Read the original article here.
Assisted by GAI and LLM Technologies per EDRM GAI and LLM Policy.