Native or Not? Rethinking Public E-Mail Corpora for E-Discovery (Redux, 2013→2025)

[EDRM Editor’s Note: The opinions and positions are those of Craig Ball. This article is republished with permission and was first published on August 16, 2025.]

Yesterday, I found myself in a spirited exchange with a colleague about whether the e-discovery community has suitable replacements for the Enron e-mail corpora¹—now more than two decades old—as a “sandbox” for testing tools and training students. I argued that the quality of the data matters: native or near-native e-mail collections remain essential to test processing and review workflows in ways that mirror real-world litigation.

The back-and-forth reminded me that, unlike forensic examiners or service providers, ediscovery lawyers may not know or care much about the nature of electronically-stored information until it finds its way to a review tool. I get that. If your interest in email is in testing AI coding tools, you’re laser-focused on text and maybe a handful of metadata; but if your focus is on the integrity and authenticity of evidence, or in perfecting processing tools, the originating native or near-native form of the corpus matters more.

If your interest in email is in testing AI coding tools, you’re laser-focused on text and maybe a handful of metadata; but if your focus is on the integrity and authenticity of evidence, or in perfecting processing tools, the originating native or near-native form of the corpus matters more.
Craig Ball.

July 2, 2013 Post: What is Native Production for E-Mail?

Recently [remember, it’s 2013], I’ve weighed in on disputes where the parties were fighting over whether the e-mail production was sufficiently “native” to comply with the court’s orders to produce natively. In one matter, the question was whether Gmail could be produced in a native format, and in another, the parties were at odds about what forms are native to Microsoft Exchange e-mail. In each instance, I saw two answers; the technically correct one and the helpful one.

I am a vocal proponent of native production for e-discovery. Native is complete. Native is functional. Native is inherently searchable. Native costs less. I’ve explored these advantages in other writings and will spare you that here. But when I speak of “native” production in the context of databases, I am using a generic catchall term to describe electronic forms with superior functionality and completeness, notwithstanding the common need in e-discovery to produce less than all of a collection of ESI.

But when I speak of “native” production in the context of databases, I am using a generic catchall term to describe electronic forms with superior functionality and completeness, notwithstanding the common need in e-discovery to produce less than all of a collection of ESI.
Craig Ball.

It’s a Database

When we deal with e-mail in e-discovery, we are usually dealing with database content. Microsoft Exchange, an e-mail server application, is a database. Microsoft Outlook, an e-mail client application, is a database. Gmail, a SaaS webmail application, is a database. Lotus Domino, Lotus Notes, Yahoo! Mail, Hotmail and Novell GroupWise—they were all databases. It’s important to understand this at the outset because if you think of e-mail as a collection of discrete objects (like paper letters in a manila folder), you’re going to have trouble understanding why defining the “native” form of production for e-mail isn’t as simple as many imagine.

Native in Transit: Text per a Protocol

E-mail is one of the oldest computer networking applications. Before people were sharing printers, and long before the internet was a household word, people were sending e-mail across networks. That early e-mail was plain text, also called ASCII text or 7-bit (because you need just seven bits of data, one less than a byte, to represent each ASCII character). In those days, there were no attachments, no pictures, not even simple enhancements like bold, italic or underline.

Early e-mail was something of a free-for-all, implemented differently by different systems. So the fledgling internet community circulated proposals seeking a standard. They stuck with plain text in order that older messaging systems could talk to newer systems. These proposals were called Requests for Comment or RFCs, and they came into widespread use as much by convention as by adoption (the internet being a largely anarchic realm). The RFCs lay out the form an e-mail should adhere to in order to be compatible with e-mail systems.

The RFCs concerning e-mail have gone through several major revisions since the first one circulated in 1973. The latest protocol revision is called RFC 5322 (2008), which made obsolete RFC 2822 (2001) and its predecessor, RFC 822 (1982). Another series of RFCs (RFC 2045-47, RFC 4288-89 and RFC 2049), collectively called Multipurpose Internet Mail Extensions or MIME, address ways to graft text enhancements, foreign language character sets and multimedia content onto plain text emails. These RFCs establish the form of the billions upon billions of e-mail messages that cross the internet.

So, if you asked me to state the native form of an e-mail as it traversed the Internet between mail servers, I’d likely answer, “plain text (7-bit ASCII) adhering to RFC 5322 and MIME.” In my experience, this is the same as saying “.EML format;” and, it can be functionally the same as the MHT format, but only if the content of each message adheres strictly to the RFC and MIME protocols listed above. You can even change the file extension of a properly formatted message from EML to MHT and back in order to open the file in a browser or in a mail client like Outlook 2010. Try it. If you want to see what the native “plain text in transit” format looks like, change the extension from .EML to .TXT and open the file in Windows Notepad.

The appealing feature of producing e-mail in exactly the same format in which the message traversed the internet is that it’s a form that holds the entire content of the message (header, message bodies and encoded attachments), and it’s a form that’s about as compatible as it gets in the e-mail universe.²

Unfortunately, the form of an e-mail in transit is often incomplete in terms of metadata it acquires upon receipt that may have probative or practical value; and the format in transit isn’t native to the most commonly-used e-mail server and client applications, like Microsoft Exchange and Outlook. It’s from these applications–these databases–that e-mail is collected in e-discovery.

Outlook and Exchange³

Microsoft Outlook and Microsoft Exchange are database applications that talk to each other using a protocol (machine language) called MAPI, for Messaging Application Programming Interface. Microsoft Exchange is an e-mail server application that supports functions like contact management, calendaring, to do lists and other productivity tools. Microsoft Outlook is an e-mail client application that accesses the contents of a user’s account on the Exchange Server and may synchronize such content with local (i.e., retained by the user) container files supporting offline operation. If you can read your Outlook e-mail without a network connection, you have a local storage file.

Outlook: The native format for data stored locally by Outlook is a file or files with the extension PST or OST. Henceforth, I’m going to speak only of PSTs, but know that either variant may be seen. PSTs are container files. They hold collections of e-mail—typically stored in multiple folders—as well as content supporting other Outlook features. The native PST found locally on the hard drive of a custodian’s machine will hold all of the Outlook content that the custodian can see when not connected to the e-mail server.

Because Outlook is a database application designed for managing messaging, it goes well beyond simply receiving messages and displaying their content. Outlook begins by taking messages apart and using the constituent information to populate various fields in a database. What we see as an e-mail message using Outlook is actually a report queried from a database. The native form of Outlook e-mail carries these fields and adds metadata not present in the transiting message. The added metadata fields include such information as the name of the folder in which the e-mail resides, whether the e-mail was read or flagged and its date and time of receipt. Moreover, because Outlook is designed to “speak” directly to Exchange using their own MAPI protocol, messages between Exchange and Outlook carry MAPI metadata not present in the “generic” RFC 5322 messaging. Whether this MAPI metadata is superfluous or invaluable depends upon what questions may arise concerning the provenance and integrity of the message. Most of the time, you won’t miss it. Now and then, you’ll be lost without it.

Because Microsoft Outlook is so widely used, its PST file format is widely supported by applications designed to view, process and search e-mail. Moreover, the complex structure of a PST is so well understood that many commercial applications can parse PSTs into single message formats or assemble single messages into PSTs. Accordingly, it’s feasible to produce responsive messaging in a PST format while excluding messages that are non-responsive or privileged. It’s also feasible to construct a production PST without calendar content, contacts, to do lists and the like. You’d be hard pressed to find a better form of production for Exchange/Outlook messaging. Here, I’m defining “better” in terms of completeness and functionality, not compatibility with your ESI review tools.

MSGs: There’s little room for debate that the PST or OST container files are the native forms of data storage and interchange for a collection of messages (and other content) from Microsoft Outlook. But is there a native format for individual messages from Outlook, like the RFC 5322 format discussed above? The answer isn’t clear cut. On the one hand, if you were to drag a single message from Outlook to your Windows desktop, Outlook would create that message in its proprietary MSG format. The MSG format holds the complete content of its RFC 5322 cousin plus additional metadata; but it lacks information (like foldering data) that’s contained within a PST. It’s not “native” in the sense that it’s not a format that Outlook uses day-to-day; but it’s an export format that holds more message metadata unique to Outlook. All we can say is that the MSG file is a highly compatible near-native format for individual Outlook messages–more complete than the transiting e-mail and less complete than the native PST. Though it’s encoded in a proprietary Microsoft format (i.e., it’s not plain text), the MSG format is so ubiquitous that, like PSTs, many applications support it as a standard format for moving messages between applications.

Exchange: The native format for data housed in an Exchange server is its database file, prosaically called the Exchange Database and sporting the file extension .EDB. The EDB holds the account content for everyone in the mail domain; so unless the case is the exceedingly rare one that warrants production of all the e-mail, attachments, contacts and calendars for every user, no litigant hands over their EDB.

It may be possible to create an EDB that contains only messaging from selected custodians (and excludes privileged and non-responsive content) such that you could really, truly produce in a native form. But, I’ve never seen it done that way, and I can’t think of anything to commend it over simpler approaches.

So, if you’re not going to produce in the “true” native format of EDB, the desirable alternatives left to you are properly called “near-native,” meaning that they preserve the requisite content and essential functionality of the native form, but aren’t the native form. If an alternate form doesn’t preserve content and functionality, you can call it whatever you want. I lean toward “garbage,” but to each his own.

E-mail is a species of ESI that doesn’t suffer as mightily as, say, Word documents or Excel spreadsheets when produced in non-native forms. If one were meticulous in their text extraction, exacting in their metadata collection and careful in their load file construction, one could produce Exchange content in a way that’s sufficiently complete and utile as to make a departure from the native less problematic—assuming, of course, that one produces the attachments in their native forms. That’s a lot of “ifs,” and what will emerge is sure to be incompatible with e-mail client applications and native review tools.

Litmus Test: Perhaps we have the makings of a litmus test to distinguish functional near-native forms from dysfunctional forms like TIFF images and load files: Can the form produced be imported into common e-mail client or server applications?

You have to admire the simplicity of such a test. If the e-mail produced is so distorted that not even e-mail programs can recognize it as e-mail, that’s a fair and objective indication that the form of production has strayed too far from its native origins.

Gmail

The question whether it’s feasible to produce Gmail in its native form triggered an order by U.S. Magistrate Judge Mark J. Dinsmore in a case styled, Keaton v. Hannum, 2013 U.S. Dist. LEXIS 60519 (S.D. Ind. Apr. 29, 2013). It’s a seamy, sad suit brought pro se by an attorney named Keaton against both his ex-girlfriend, Christine Zook, and the cops who arrested Keaton for stalking Zook. It got my attention because the court cited a blog post I made three years ago.⁴

The Court wrote:

Zook has argued that she cannot produce her Gmail files in a .pst format because no native format exists for Gmail (i.e., Google) email accounts. The Court finds this to be incorrect based on Exhibit 2 provided by Zook in her Opposition Brief. [Dkt. 92 at Ex. 2 (Ball, Craig: Latin: To Bring With You Under Penalty of Punishment, EDD Update (Apr. 17, 2010)).] Exhibit 2 explains that, although Gmail does not support a “Save As” feature to generate a single message format or PST, the messages can be downloaded to Outlook and saved as .eml or.msg files, or, as the author did, generate a PDF Portfolio – “a collection of multiple files in varying format that are housed in a single, viewable and searchable container.” [Id.] In fact, Zook has already compiled most of her archived Gmail emails between her and Keaton in a .pst format when Victim.pst was created. It is not impossible to create a “native” file for Gmail emails.
Id. at 3.

I’m gratified when a court cites my work, and here, I’m especially pleased that the Court took an enlightened approach to “native” forms in the context of e-mail discovery. Of course, one strictly defining “native” to exclude near-native forms might be aghast at the loose lingo; but the more important takeaway from the decision is the need to strive for the most functional and complete forms when true native is out-of-reach or impractical.

Gmail is a giant database in a Google data center someplace (or in many places). I’m sure I don’t know what the native file format for cloud-based Gmail might be. Mere mortals don’t get to peek at the guts of Google. But, I’m also sure that it doesn’t matter, because even if I could name the native file format, I couldn’t obtain that format, nor could I faithfully replicate its functionality locally.⁵

Since I can’t get “true” native, how can I otherwise mirror the completeness and functionality of native Gmail? After all, a litigant doesn’t seek native forms for grins. A litigant seeks native forms to secure the unique benefits native brings, principally functionality and completeness.⁶

There are a range of options for preserving a substantial measure of the functionality and completeness of Gmail. One would be to produce in Gmail.

HUH?!?!

Yes, you could conceivably open a fresh Gmail account for production, populate it with responsive messages and turn over the access credentials for same to the requesting party. That’s probably as close to true native as you can get (though some metadata will change), and it flawlessly mirrors the functionality of the source. Still, it’s not what most people expect or want. It’s certainly not a form they can pull into their favorite e-discovery review tool.

Alternatively, as the Court noted in Keaton v. Hannum, an IMAP⁷ capture to a PST format (using Microsoft Outlook or a collection tool) is a practical alternative. The resultant PST won’t look or work exactly like Gmail (i.e., messages won’t thread in the same way and flagging will be different); but it will supply a large measure of the functionality and completeness of the Gmail source. Plus, it’s a form that lends itself to many downstream processing options.

So, What’s the native form of that e-mail?

Which answer do you want; the technically correct one or the helpful one? No one is a bigger proponent of native production than I am; but I’m finding that litigants can get so caught up in the quest for native that they lose sight of what truly matters.

Where e-mail is concerned, we should be less captivated by the term “native” and more concerned with specifying the actual form or forms that are best suited to supporting what we need and want to do with the data. That means understanding the differences between the forms (e.g., what information they convey and their compatibility with review tools), not just demanding native like it’s a brand name.

When I seek “native” for a Word document or an Excel spreadsheet, it’s because I recognize that the entire native file—and only the native file—supports the level of completeness and functionality I need, a level that can’t be fairly replicated in any other form. But when I seek native production of e-mail, I don’t expect to receive the entire “true” native file. I understand that responsive and privileged messages must be segregated from the broader collection and that there are a variety of near native forms in which the responsive subset can be produced so as to closely mirror the completeness and functionality of the source.

When it comes to e-mail, what matters most is getting all the important information within and about the message in a fielded form that doesn’t completely destroy its character as an e-mail message.

So let’s not get too literal about native forms when it comes to e-mail. Don’t seek native to prove a point. Seek native to prove your case.

2025 Reflections

The Enron corpus has been invaluable as an openly available, large-scale dataset, but it’s long overdue for replacement: its PSTs reflect early 2000s Outlook conventions, the metadata is frozen in amber, and the corporate context is one most students only know, if at all, as a case study in 20th century corporate scandal.

What’s emerged in the years since Enron are alternative corpora—Jeb Bush’s gubernatorial e-mails, the Podesta collection from WikiLeaks, opioid litigation e-mails hosted at UCSF, and smaller thematic sets assembled by academics. Each has its value, but they also illustrate key shortcomings:

Image-based or static renditions: Many are posted only as PDFs or HTML tables, stripped of transport headers and lacking the artifacts real processing engines must contend with.
Synthetic reconstructions: Some have been rebuilt from text or CSV extracts into pseudo-RFC 5322 messages. These are useful for text mining, but they are not “native” in the sense that preserves fidelity to client/server behavior, MIME boundaries, or system-generated metadata.
Custodian/context gaps: Unlike Enron’s multi-custodian dataset, newer collections often lack the richness of interwoven mailboxes and shared infrastructure that make Enron so good for threading, deduplication, and analytics testing.

Meanwhile, contemporary systems and concerns pose fresh challenges:

Cloud archives (O365/Google Workspace): Modern discovery often turns on exported .PST or .MBOX files from cloud tenants, with quirks like truncated headers, missing BCCs, or OCR-derived attachments.
Security and privacy: The very features that make corpora valuable for teaching (authenticity, richness, discoverability) also make them risky to host without redaction. That risk has chilled the release of modern equivalents.
AI model training: With generative AI in the mix, corpora aren’t just used for testing search terms anymore; they’re fodder for fine-tuning and evaluation of AI models—making their integrity and authenticity even more critical.

Closing Thought

Our challenge in settling on new test corpora is to balance authenticity with accessibility. A corpus should let practitioners experience the headaches of creating load files, threading, de-duplicating, character encoding, and other issues touch on in my 2019 Processing in E-Discovery Primer. In yesterday’s post, I called the Enron email corpus “a dinosaur from the dial-up era.” Yet, until we solve the legal, privacy and logistical hurdles of releasing true native or near-native corpora, the Enron set may remain our decrepit yardstick.

As we seek new corpora to displace Enron, please esteem the litmus test I posited so long ago: Does the corpus conform to a fulsome RFC-5322 form such that it could be imported into common e-mail client or server applications? Not that we need or want to do that, but if we cannot–if the email corpora is so distorted that e-mail programs cannot recognize it as e-mail–the corpora is not good enough. The differences materially impact the reliability of, e.g., processing, threading and deduplication. It’s a distinction that may seem trivial when looking east from review, the ass end of e-discovery; but cast your gaze back across the vast westward expanse of the EDRM and the quality and integrity of the test corpus become decisive.

Read the original article here.

Notes

In this context, a corpus (plural: corpora) means a large, representative body of documents—e.g., email—collected and preserved in something close to its native form. Think of it as the “test dataset” that lets us see whether processing, search, review, and analytics tools work on the messy, artifact-laden data lawyers actually face in practice. A true corpus isn’t just text; it carries the headers, metadata, and quirks that make or break authenticity, privilege, and context. ↩︎
There’s even an established format for storing multiple RFC 5322 messages in a container format called MBOX. The MBOX format was described in 2005 in RFC 4155, and though it reflects a simple, reliable way to group e-mails in a sequence for storage, it lacks the innate ability to memorialize mail features we now take for granted, like message foldering. A common workaround is to create a single MBOX file named to correspond to each folder whose contents it holds (e.g., Inbox.mbox). ↩︎
2025 note: please note that the 2013 discussion of locally stored Outlook PSTs and Exchange has diminished relevance since Microsoft’s shift to the Cloud in M365. ↩︎
With a tip of the hat to Josh Gilliland, the blogger behind Bow Tie Law, who brought the Keaton decision to my attention. ↩︎
It was once possible to create complete, offline replications of Gmail using a technology called Gears; however, Google discontinued support of Gears some time ago. Gears’ successor, called “Gmail Offline for Chrome,” limits its offline collection to just a month’s worth of Gmail, making it a complete non-starter for e-discovery. Moreover, neither of these approaches employs true native forms as each was designed to support a different computing environment. [2025 note: Today’s Gmail Offline settings allow up to 90 days of sync, with attachments, but still lack fidelity and depth compared to true native-form exports. None of these methods satisfy the litmus test for “native” format compliance]. ↩︎
2025 note: Savvy readers may wonder why Gmail “Takeout” isn’t mentioned in the original July 2013 post. That omission isn’t an oversight—Google’s Takeout tool didn’t support Gmail export until December 5, 2013, when MBOX downloads of mail (and Calendar data) were first made available. At the time of writing in mid-2013, “native” Gmail export was not an option for end users. I wrote about Google Takeout for Gmail in 2014, and frequently since then. ↩︎
IMAP (for Internet Message Access Protocol) is another way that e-mail client and server applications can talk to one another. The latest version of IMAP is described in RFC 3501. IMAP is not a form of e-mail storage; it is a means by which the structure (i.e., foldering) of webmail collections can be replicated in local mail client applications like Microsoft Outlook. Another way that mail clients communicate with mail servers is the Post Office Protocol or POP; however, POP is limited in important ways, including in its inability to collect messages stored outside a user’s Inbox. Further, POP does not replicate foldering. Outlook “talks” to Exchange servers using MAPI and to other servers and webmail services using MAPI (or via POP, if MAPI is not supported). [2025 note: Though RFC 3501 once defined the standard for IMAP (IMAP4rev1), it that has since been obsoleted by RFC 9051 (IMAP4rev2), which is the current version of the protocol as of 2021]. ↩︎

Assisted by GAI and LLM Technologies per EDRM GAI and LLM Policy.

Author

Craig Ball

Craig Ball is a Texas trial lawyer, computer forensic examiner, law professor and noted authority on electronic evidence. He limits his practice to serving as a court-appointed special master and consultant in computer forensics and electronic discovery and has served as the Special Master or testifying expert in computer forensics and electronic discovery in some of the most challenging and celebrated cases in the U.S. Craig is also EDRM’s General Counsel and a key contributor to many EDRM projects.

View all posts