Document Content vs. Metadata in eDiscovery AI: A Clarification of Scope, Access, and Accuracy

[EDRM Editor’s Note: The opinions and positions are those of Shannon Bales¹.]

Abstract

This article addresses a critical source of confusion in legal technology: the conflation of document content and metadata in generative AI tools for eDiscovery. Document content² is the text and visual data you see when you open a file—like emails, Word documents, or PDFs—including headers, footers, tables, and images. Document content does not include metadata (like who created the file or when it was sent), which is stored separately and not visible by default. Metadata³ encompasses contextual attributes like file dates, custodianship, and communication headers that exist outside the visible text.

Many AI tools claim to “analyze metadata,” yet in practice, these tools often operate solely on extracted document content. Practitioners have limited or no visibility into which fields of content and metadata are being extracted and analyzed by AI tools. This is particularly true when documents are exported as static PDFs or analyzed without fielded metadata because the data will be missing or inaccurate due to hallucinations. Additionally, many newer generative AI tools lack sufficient error logging⁴, leaving end users uncertain about which files are actually included in the AI analysis. In contrast, traditional document review databases extract information into structured fields, enabling granular search⁵ capabilities and generating error logs that indicate which content is being analyzed—and which is not.

Many AI tools claim to “analyze metadata,” yet in practice, these tools often operate solely on extracted document content. Practitioners have limited or no visibility into which fields of content and metadata are being extracted and analyzed by AI tools.
Shannon Bales.

This misalignment creates false assumptions about search scope, completeness, and accuracy of AI-assisted workflows. This article provides definitions, presents real-world examples of their divergence, and recommends technical and procedural safeguards for legal practitioners and AI vendors. By reinforcing the importance of clear definitions and transparent workflows, we contribute to more accurate and technically sound applications of artificial intelligence in the legal domain.

Introduction

Generative AI tools are rapidly transforming eDiscovery with capabilities including document summarization, privilege tagging, content clustering, and pattern identification. However, as these tools become more integrated into legal workflows, a fundamental question emerges: What exactly are these tools analyzing?

A critical distinction has been obscured in practice: the difference between document content and metadata. This confusion becomes problematic when:

Legal teams assume AI tools are analyzing metadata when they only process visible document content
Legal team members often lack an understanding of the complexities involved in file conversions⁶ and transfers⁷, which can impact metadata integrity and affect analysis when imported into AI tools
Legal team members believe comprehensive “metadata analysis” occurred when only extracted text analysis was performed
AI models are expected to draw conclusions based on information (such as file creation dates or email routing data) that exists in metadata but is not present in the visible text

This article clarifies these distinctions, explains why they matter for accuracy, and provides best practices for responsible AI deployment for legal use.

Defining Document Content vs. Metadata

Document content refers to the visible, human-readable text or data within the four corners of a document, as rendered by its native application or standard processing tools. It includes body text, tables, headers, footers, signature blocks, and embedded objects (e.g., images, charts), as well as any OCR-extracted text from scanned or image-based files. This excludes metadata, which is not visible in the normal viewing environment.

Document Content and the “four corners rule”

The “four corners rule” reinforces the primacy of document content by requiring that the meaning of a written document be determined solely from its visible text—what appears within the four corners of the page—without reference to external context or metadata. This idea underscores why document content refers to the human-readable information that was intended to be read and understood by recipients at the time of drafting. By contrast, metadata, which describes how or when a file was created or by whom, exists outside the scope of this rule. While metadata can support authenticity or contextual analysis, it is not considered part of the document’s operative content for interpretive purposes. In this sense, AI tools that analyze only document content are mimicking the four corners rule —often ignoring or unable to extract metadata unless it has been explicitly rendered or extracted.

While metadata can support authenticity or contextual analysis, it is not considered part of the document’s operative content for interpretive purposes. In this sense, AI tools that analyze only document content are mimicking the four corners rule —often ignoring or unable to extract metadata unless it has been explicitly rendered or extracted.
Shannon Bales.

Document content includes:

Body text of emails, memoranda, and other documents. Email thread conversations and chat message sequences
Full text content of Word documents, PDFs, and presentations
OCR-processed text from scanned images
Visible headers, footers, and signature blocks
Any metadata that has been rendered as visible text (such as email headers printed in a PDF)

Document content represents what a human reviewer would see when reading the document in its native application or rendered format. For AI tools to analyze this information, it must be properly extracted during processing—poor OCR quality or formatting errors will limit what the AI can actually interpret.

Metadata

Metadata is information about a document that is not part of its visible content⁸.Metadata typically falls into the following categories:

Native Metadata (embedded within the file):

Microsoft Word: Document author, creation date, last modified date, revision history
Email (PST/MSG files): From, To, CC, BCC, Subject, Date/Time Sent, Message-ID
PDF: Creator application, creation date, modification date, security settings
Excel: Worksheet names, cell formulas, macro code, document properties
PowerPoint: Slide notes, hidden slides, embedded objects metadata

System Metadata (generated by operating systems and servers):

File system: File path, file size, last accessed date, file permissions
Email servers: Message routing headers, server timestamps, delivery receipts
Operating system: Registry entries, system logs, user access records
Cloud platforms: Upload timestamps, sync history, sharing permissions

Review/Processing Metadata (added during eDiscovery):

Custodian assignments (mapped during collection)
Bates numbers and control numbers
Processing tags and exceptions
Review codes and privilege designations
De-duplication and hash values

Metadata provides crucial context for legal analysis, including timeline creation, communication mapping, and custodian-based review. However, this information is often invisible to AI tools unless explicitly extracted and provided using traditional eDiscovery processing⁹ techniques where the metadata is separated from the record and fielded into a database. Hybrid work flows that incorporate fielded data searches along with AI analysis provides the safest route for analysis.

Traditional eDiscovery vs. Generative AI Workflows

Traditional eDiscovery and generative AI tools take fundamentally different approaches to preparing and analyzing data for legal review. Traditional eDiscovery relies on structured processing, where electronic documents are parsed into discrete metadata fields—such as date sent, author, and file type—within a relational database. This structured format allows for precise searching. In contrast, generative AI tools typically analyze documents as unstructured text¹⁰, focusing on meaning rather than fielded data. This approach allows large language models to analyze meaning and context, but it sacrifices data precision and field-level control. These tools excel at tasks like summarization and contextual understanding but may overlook or misinterpret metadata, leading to risks such as hallucinated dates, incomplete analysis or inferred relationships that aren’t explicitly supported by the data.

Traditional Review Platforms

Established eDiscovery platforms like Relativity, Reveal, Everlaw and DISCO follow structured workflows that separate content from metadata during processing. During processing, document content and metadata are extracted from the electronic file and placed into a database record. An error log is generated showing documents that did not process correctly¹¹ which allows for additional steps to be taken in order for the electronic file to be included in the database¹². Last, both document content and metadata is fielded in the database so users can see what was extracted for analysis and what was not.

Processing Phase: Metadata is extracted and normalized into searchable database fields
Indexing: Content and metadata are indexed for independent searching
Investigate: Users can filter, search, and sort using both content and metadata fields providing certainty as to which data is being analyzed
Quality Control: Clear audit trails “error logs” show what data was processed and whether it is included in the database or not. In addition, users can visually see which data is fielded in the database, something that is not transparent in most AI processing.

The “traditional” approach offers a high level of transparency, giving users confidence in their results by clearly showing which data contributed to their findings and providing verification at every stage of the process.

Current Generative AI Tools

Most generative AI tools operate as if applying the “Four Corners Rule”—meaning they analyze only the visible text within the document itself, without access to the structured metadata that exists outside the document’s rendered content. The Four Corners Rule holds that a document should be understood based solely on its written terms, not on external evidence. However, unlike a judge applying this rule, generative AI tools often also incorporate invisible external knowledge from their training data—such as general world facts, language patterns, or inferred meanings—that can subtly influence the results. This combination can create a false sense of self-contained analysis when, in fact, unseen assumptions may shape the AI’s interpretation.

A major risk in using generative AI tools for legal analysis is the assumption that these tools consider metadata when, in reality, many do not. Users may believe that document properties such as author, date sent, create date, or custodian are part of the AI’s analysis—especially if those fields are available in their review platform—but unless that metadata has been explicitly extracted and passed to the AI, it is likely not part of the analysis it performs. As a result, the AI may hallucinate answers based on vague language in the document text (e.g., guessing a date based on “last week” or assigning authorship from a greeting like “Hi John”). This creates a false sense of accuracy and can lead to serious missteps, such as timeline errors or poor review decisions. Without clear visibility into what data the AI is actually analyzing, legal teams may make decisions based on assumptions or hallucinations rather than verified scope.

As a result, the AI may hallucinate answers based on vague language in the document text (e.g., guessing a date based on “last week” or assigning authorship from a greeting like “Hi John”). This creates a false sense of accuracy and can lead to serious missteps, such as timeline errors or poor review decisions.
Shannon Bales.

Adding to the complexity, many GenAI platforms only accept a narrow set of file formats—often just PDFs. This restriction requires users to convert native files into PDFs, typically using functions like “Print to PDF.” While convenient, this conversion strips away original metadata such as author name, modification history, file path, or custodian information and converts it to a new digital object. The resulting PDF becomes a new file with a new creation timestamp and potentially misleading metadata, making it difficult for AI tools to reliably determine the document’s origin, context, or chain of custody when it performs its analysis. From a forensic and eDiscovery standpoint, this operation breaks the chain of custody and results in metadata loss, making it impossible for AI tools to rely on the file’s contextual integrity. As outlined in digital forensics literature, this practice undermines both the transparency and defensibility of legal workflows¹³.

Direct Upload: Users drag-and-drop PDFs, emails, or documents without structured processing
Content Extraction: Tools process only the visible text content
No Field Separation: Sender, date, and content are all mixed together (if metadata is included at all) in an “unstructured text blob”—where the document is ingested as a single body of text, without separating metadata into distinct fields, which is great for context but not for accuracy
Analysis: AI operates on extracted content, often without access to metadata
Limited Transparency: Users may not know which files were successfully processed or what data was analyzed

This structural limitation means many AI tools cannot compensate for metadata that was never provided, creating a significant gap in analytical capability compared to traditional review workflows.¹⁴ Hybrid tools that contain traditional database fields and AI are conducive to hybrid approaches that create correct analysis and work product.

Consequences for Legal Practice

The confusion between document content and metadata analysis creates several critical risks. Fielded metadata offers precision, auditability, and accuracy¹⁵. Generative AI tools that rely only on document content risk hallucinations—especially with time/chronology, authorship, or contextual information.

Incomplete Filtering and Search

Traditional platforms enable sophisticated filtering by virtue of its processing ability¹⁶ and fielded data (e.g., “all documents from Custodian A sent or created on May 1, 2023”). AI tools analyzing only visible content cannot perform such operations unless dates and custodian information appear in every document’s text. There is risk that the AI tool doesn’t parse the information correctly¹⁷ or hallucinates data into date fields¹⁸ that can’t be ascertained from the document content.

Issue:
Filtering is unreliable due to hallucination, incorrect data or no content. Some documents are omitted (false negatives), and others may be misclassified (false positives).

Example Query:
“Find all documents created by Custodian A on June 1, 2024.”

What Happens Without Metadata:
AI tools rely on visible document content and may hallucinate a “create date” based on other date information it finds on the face of the document. Alternatively, a “print to PDF” file could have an incorrect create date based on the time the PDF was created.

The AI cannot “find” criteria that it cannot see or is incorrect due to PDF conversion. The AI may:

Miss the document entirely
Use an incorrect date
Hallucinate a date based on content found on the face of the document

Timeline Construction Failures

If AI tools are expected to build chronological narratives but only have access to visible dates (which may be incomplete or inconsistent), they cannot leverage accurate timestamps from email metadata fields like “Date Sent” or “File Created Date.”

Example Query:
“Build a timeline of key events based on emails exchanged between January and March 2023.”

What Happens Without Metadata:
The AI tool may rely on:

“We’ll finalize the contract next month.”
“After our Q1 meeting…”

Without DateSent or DateCreated, it can:

Hallucinate dates, anchoring phrases to incorrect times
Misorder events, especially when multiple documents use vague timeframes
Exclude records from the timeline

Issue:
You get a distorted chronology—critical in legal timelines, discovery disclosures, or trial prep.

Communication Analysis

Mapping relationships between parties requires metadata fields like sender, recipient, and domain information or accurate parsing and analysis by AI. Without access to this structured data or accurate parsing, AI tools may not perform accurate relationship or communication pattern analysis.

Example Query:
“Provide all emails between Smith and Wilson, or from ABC to XYX firm, from January to June 2023.”

What Happens Without Metadata:
AI tools try to extract names/emails from content like:

“Hi John,” or “Sent to Sarah at BigLaw LLP”

But:

Many messages don’t list full recipient info in the visible content¹⁹
BCCs are invisible
Domains (e.g., @biglaw.com) may never appear

Issue:

Missed relationships
Incomplete or misleading social network graphs
Misattributed communications

Privilege Review Errors

Privilege determinations often depend on who created, sent, edited or received a document, , communication dates, and document types—information frequently stored in metadata rather than visible content. AI tools lacking metadata access may miss critical privilege indicators.

Example Query:
“Flag documents for attorney-client privilege.”

What Happens Without Metadata:
AI relies on seeing words like:

“Privileged” or “Legal advice from John Doe”

But real-world privileged documents:

Often don’t declare privilege explicitly
Are identified by metadata like From: jane@lawfirm.com or presence of a .msg attachment from legal counsel

Issue:

Failure to flag clearly privileged documents
Increased risk of accidental disclosure
Over-inclusion of non-privileged documents

Custodian-Based Review Problems

If legal teams need to segregate documents by custodian for review or production, and custodian information exists only in metadata, AI tools will be unable to perform this fundamental organizational task.

Example Query:
“Pull all documents owned by Custodian B for production.”

What Happens Without Metadata:
AI might look in content for:

“Prepared by Brian Smith”

But most documents don’t include the custodian’s name in text. Ownership is recorded in fields like:

Custodian
LastModifiedBy
File path or collection tags

Issue:

Impossible to segregate documents by legal hold or discovery scope
Risk of producing incorrect or incomplete sets

Use Case	Query	Issue Without Metadata
Filtering & Search	Emails by Custodian A on May 1, 2023	Misses documents without explicit content reference
Timeline/Chronology	Events from Jan–Mar 2023	Hallucinated or misordered dates
Communication Analysis	Map exchanges with outside counsel	Cannot detect BCCs or non-visible recipients
Privilege Review	Flag privileged communications	Fails without metadata indicating attorney involvement
Custodian Review	Isolate Custodian B’s documents	No way to determine document ownership from document content alone

User Error Problems

Improper handling of native files is one of the most common causes of metadata degradation in eDiscovery workflows—often occurring well before documents are analyzed by AI systems. Legal professionals may unknowingly alter or destroy metadata by downloading attachments without preserving their source path, renaming files manually, copying them between devices, or transferring them through applications that update file properties. These actions can overwrite or erase key metadata fields such as “Date Sent,” “Last Modified,” “Custodian,” or “File Path,” all of which are essential for accuracy, review, and analysis when uploaded to AI systems.

Compounding the issue, some users convert documents to PDF before upload—either due to perceived simplicity or because the AI tool accepts only limited file types. When a document is converted using “Print to PDF,” it becomes a new file entirely, inheriting a fresh creation date and losing embedded metadata from the original file. This severance of metadata renders generative AI tools blind to critical context, forcing them to infer authorship, dates, or communication structure from vague textual clues like “yesterday” or “per our last meeting.” These hallucinations, while often subtle, can materially affect review determinations, timeline accuracy, and custodian-based review. To avoid such outcomes, legal teams must implement workflows that preserve original metadata and discourage informal file handling or premature conversion.

Best Practices and Recommendations

As generative AI tools become more prevalent in legal environments, it’s critical that legal teams adopt a clear, methodical approach to their use—especially when it comes to understanding what data these tools are actually analyzing. A common but risky assumption is that AI tools automatically access both document content and metadata; in reality, many tools are limited to analyzing document content alone, only utilizing what is visible on the four corners of the document, which can introduce serious gaps in accuracy, create AI hallucinations, and reduce confidence in outcomes. The following best practices and recommendations aim to help legal professionals and stakeholders develop a deeper understanding of how AI interacts with data, enforce consistent terminology, and implement safeguards that preserve metadata integrity, ensure accurate analysis, and maintain trust in AI-assisted workflows.

The following best practices and recommendations aim to help legal professionals and stakeholders develop a deeper understanding of how AI interacts with data, enforce consistent terminology, and implement safeguards that preserve metadata integrity, ensure accurate analysis, and maintain trust in AI-assisted workflows.
Shannon Bales.

1. Educate Teams and Stakeholders

Train legal teams to ask critical questions:

“Is this tool performing content search or metadata search?”
“What specific data fields are available to the AI?”
“How do we verify the completeness of AI analysis?”
“What are the limitations of this tool’s data access?”
“Is there any special handling needed of this document prior to import into an AI tool?”

2. Define Terms Precisely

Clearly distinguish between “content analysis” and “metadata analysis” in all vendor/user discussions, contracts, and workflow documentation
Require vendors to specify exactly what data (including metadata) their tools access and analyze when imported into their tool
Document these definitions in case protocols

3. Maintain Metadata Integrity

Use an error log to verify accurate metadata extraction and identify which documents were successfully ingested and which were excluded from analysis
Implement quality control measures to verify metadata was extracted correctly by reviewing the extracted data is in the appropriate fields (where fielded data is used)
Recognize how operations like file copying, relocation, or conversion to PDF “print to PDF”, can alter metadata prior to ingestion into an AI system

4. Make Metadata Accessible When Needed

If metadata analysis is required:

Include relevant metadata fields in document text processing
Prepend metadata to prompts in structured formats
Create hybrid workflows that combine traditional database searching with AI content analysis

5. Implement Validation and Quality Control

Conduct test runs to confirm which data fields are accessible to AI tools
Verify that processing logs accurately reflect included and excluded documents
Audit/test AI outputs against known metadata-dependent results²⁰

6. Document Limitations and Scope

Clearly document what information was and was not available to AI tools
Understand the difference between tools that utilize traditional processing methods that field data and metadata and those that only analyze document content
Include these limitations in case strategy discussions
Ensure opposing counsel and courts understand the scope of AI-assisted review

Conclusion

As generative AI becomes integral to eDiscovery workflows, the distinction between document content and metadata analysis is not merely technical—it is fundamental to the accuracy of the results. The assumption that AI tools automatically have access to all relevant information can lead to incomplete review, missed privilege claims, and flawed legal strategies.

In legal environments, hybrid systems that use metadata and AI content analysis are safer and more reliable and can serve as a quality check of AI work product. However, there are many situations where a drag and drop solution, that only analyzes what is visible on the documents four corners, is reasonable. User education is the differentiator between good and bad outcomes for legal AI use cases. Legal professionals must approach AI deployment with the same rigor applied to traditional review workflows: understand exactly what data is being analyzed, document limitations, and implement appropriate quality controls. Only through this disciplined approach can the legal profession realize the benefits of AI while maintaining the standards of thoroughness and defensibility that effective legal practice demands.

The future of AI in eDiscovery lies not in replacing traditional methods but in thoughtfully integrating AI capabilities with established data management practices. This requires clear communication between legal teams and technology vendors, precise documentation of analytical scope, and ongoing validation of AI-assisted workflows.

By maintaining these standards, legal professionals can harness the power of generative AI while preserving the integrity and defensibility of their review processes.

Notes

I would like to thank Allison Day and Mary Mack for their thoughtful feedback and insightful suggestions during the development of this article. Their input strengthened the clarity, structure, and practical value of this work. ↩︎
Document content is not formally defined in the Sedona Glossary or other standards, it is used in practice to refer to the visible, user-authored text within a document—what a human reviewer reads. This paper uses the term with that practical meaning and explicitly contrasts it with metadata, which describes document properties rather than its substantive content. ↩︎
Metadata is “the generic term used to describe the structural information of a file that contains data about the file, as opposed to describing the content of a file.” The Sedona Conference Glossary: eDiscovery and Digital Information Management, 5th ed. (2020), https://thesedonaconference.org/publication/The_Sedona_Conference_Glossary ↩︎
The error log shows which files were not processed and would be excluded from analysis. The excluded files could be corrupt, password protected or have an undisclosed error preventing them from being included in the review set. Remedies include file repair, recollecting or obtaining the password. ↩︎
Example: To find all emails sent by Jane Smith in January 2023 you might use the following search: Sender = “Jane Smith” AND Date Sent >= 01/01/2023 AND Date Sent <= 01/31/2023. This is searching the specific fields of “Date Sent” and “Sender” which were extracted during processing. ↩︎
For example, converting a document to PDF for upload to a AI system. ↩︎
For example, copying or downloading a file prior to uploading to a AI system can alter the metadata. ↩︎
“Metadata: The generic term used to describe the structural information of a file that contains data about the file, as opposed to describing the content of a file.” The Sedona Conference Glossary: eDiscovery and Digital Information Management, Fifth Edition, p.337 (2020). ↩︎
For more on “processing” see: https://edrm.net/resources/frameworks-and-standards/edrm-model/edrm-stages-standards/edrm-processing-standards-guide-version-1/ ↩︎
AI tools often treat each document as an “unstructured text blob”—where the document is ingested as a single body of text, without separating metadata into distinct fields. The entire file might be treated as one long string of text that includes all of the visible content on the page (document content) which is great for context but not for accuracy. ↩︎
Electronic files may not process correctly (extraction) because they are password protected/encrypted, corrupted, processing tool limitations or other errors. ↩︎
May need to recollect the file, request the password, or repair it due to corruption. The file can subsequently be processed successfully if this fixes the issue. ↩︎
“User actions such as opening, modifying, copying, or printing a file can alter metadata that is critical to an investigation. For instance, printing to PDF or saving a copy may result in the loss of original timestamps and document properties, effectively overwriting evidence of authorship, editing history, or file origin.”— Casey, Eoghan. Digital Evidence and Computer Crime, 3rd ed., Academic Press, 2011, Chapter 5: Digital Evidence on Windows Systems. ↩︎
National Institute of Standards and Technology, Guide to Integrating Forensic Techniques into Incident Response, NIST Special Publication 800-86 (2006). As outlined in NIST SP 800-86, forensic and investigative integrity depends on accurate access to and preservation of metadata. Generative AI systems that omit metadata from their analysis run counter to these well-established standards. ↩︎
Precision: The ability to retrieve only relevant documents (i.e., reducing false positives). Accuracy: The degree to which retrieved information reflects the correct or true values (e.g., correct dates or correct custodian). Auditability: The capacity to verify and explain what data was reviewed and how. ↩︎
The ESI metadata is separated from the file and placed into a field with error logging. Searchers can visibly see which metadata is present. ↩︎
See “unstructured text blob” above. ↩︎
The ‘create date’ metadata of a file denotes the date and time the file was initially created or first saved to a particular storage device. ↩︎
Email aliases (Shannon.Bales@mto.com might be listed as “IT Manager” or “Sooner” rather than the full name causing analysis issues). ↩︎
One could audit AI outputs by asking a series of questions that is verifiable. Such as, “who sent this email” or “who received this email”. Ask questions that the system should not know such as, “what is the create date of this file”, where the create date is not visible on the four corners of the document. ↩︎

Assisted by GAI and LLM Technologies per EDRM GAI and LLM Policy.

Author

Shannon Bales

Senior Manager of Litigation Support at Munger Tolles and Olson, California Lawyers Association Law Practice Management and Technology Section Executive Committee Chair, White House Exec trainer and advisor on eDiscovery. I am a UN War Crimes Advisor (Cambodia and Rwanda) on Litigation Support Technology, Documentation and Investigations, Adjunct Professor at UCLA and Santa Ana College on Legal Tech. NITA author "The Trial Presentation Companion", author/contributor to "eDiscovery for Corporate Counsel". Multiple legaltech award winner for innovation and my work at the UN, ILTA Contributor of the Year Finalist, ILTA Lifetime Achievement award winner.
View all posts