3.0 Text, Metadata and Image Extraction

Once files have been extracted, analyzed and filtered, the next step is content extraction. Processing software should extract text from emails, office documents and other file types, often as text files, and move them forward for indexing and search later.

Reputable processing software should extract fielded information, known as metadata, from the files such as the from, to, cc, bcc, subject, sent date and time, and custodian fields from emails. In addition, the software should also offer the option to include tracked changes in a document, hidden content, and activity. The extracted information may be crucial in later stages of the e-discovery process, particularly search and review.

Sections 3.1 through 3.10 address the key stages of this phase of processing.

3.1 Access File Content

First, processing software must access the content of hundreds if not thousands of different file types, including the most common ones, such as email messages, Microsoft Office files and PDFs. Most processing platforms integrate specialized software for this purpose as it would be nearly impossible to build such a wide range of document filters independently.

Leading programs used for accessing file content include: Oracle’s Outside In, Hyland Document Filters, dtSearch, Aspose, and Tika, a widely-used open source software product.

3.2 Detect Encrypted and Corrupt Files

Document filters cannot extract text or metadata from files that are encrypted or corrupt. The processing system should report the files as exceptions or errors to be addressed separately. Obtain passwords for password-protected files. Some processing systems permit the administrator to list known passwords that will be tried automatically when password-protected files are detected.

There is little to be done with corrupt files. Most processing systems identify the files in an error report so they may be recollected or otherwise addressed.

3.3 Detect Encoding

Once a file’s contents are accessed, the software accessing the file must determine how the contents have been encoded. The earliest e-discovery processing tools assumed files were encoded using the ASCII character set that was developed in 1963. ASCII supported 128 characters that sufficiently expressed the English alphabet (upper and lowercase letters), the numbers 0 to 9, and a few dozen other symbols necessary for basic computing.

As computing reached beyond U.S. borders, it was quickly realized that 128 characters were insufficient to encode other languages. Different standards bodies, the International Organization for Standardization (ISO) and American National Standards Institute (ANSI), began creating extended ASCII character sets to address the problem and expressed through different code pages.^[1] As a result, content extraction software needed to know which encoding set was used to properly extract text.

In the 1990s, a worldwide consortium of computer scientists developed the universal encoding standard Unicode, which integrated the original ASCII characters and today supports approximately 150,000 characters plus multiple symbol sets and emojis. Now in version 13, Unicode provides a basis to encode over 150 modern and historic languages.^[2] The most widely used version of Unicode is UTF-8 (Unicode Transformation Format); some systems also use UTF-16.

For processing purposes, the text extraction system must recognize how a system was encoded, whether it was in ASCII, one of the extended ASCII sets or in Unicode. If the latter is used, it must also distinguish between UTF-8 or UTF-16. If the system mistakes the encoding, it is likely the extracted text will be partially or completely unusable and searches will be inaccurate.

3.4 Detect Language

While many e-discovery matters involve documents that are only in English, others involve documents written in different languages. Detecting which language is used in ESI is important to processing for several reasons.

First, language identification ensures the right filters are used for text extraction, tokenization and diacritical handling. Second, language identification may be important during the review stage when documents are assigned by language to ensure they are properly analyzed by foreign language reviewers. These foreign language documents may also be identified for machine translation but this should not be used as replacement for translation by a linguist.

3.5 Extract and Normalize Text

During the text extraction, the processing engine makes a series of decisions about how that text should be stored for indexing. First, the text is normalized to ensure it is stored consistently and understandably for later searches. The following are key considerations for normalization.

Case Normalization: Should capital letters in text be reduced to lower case? Some systems are case sensitive, ensuring that a search for RAM (memory) will not return Ram (sheep) or rams (car accidents). To locate all variants of RAM regardless of case, normalize the text to lower case.

Diacritical Normalization: Many languages use accents and other diacritical marks to distinguish between otherwise equivalent characters. For example, a job applicant may submit a résumé or resume (or even a resumé). A search for “resume” would return results for “resume” but not the other two variants. Likewise, a search for “résumé” would only find that variant. Some languages also include ligatures that are used to combine two letters. For example, “æ” is sometimes used in the phrase “curriculum vitæ.”

Normalizers used in many processing systems convert these Unicode characters into their ASCII equivalents. For example, instances of résumé, resume and resumé are converted to “resume” for purposes of search. “Curriculum vitæ” would be converted to “curriculum vitae” be retrieved regardless of how it was searched (assuming the search system similarly normalized the text input).

Unicode Normalization: The Unicode standard allows accented characters to be identified in two ways: (1) the character is represented by a specific code value—a precomposed character and (2) the character is split into two parts — the ASCII equivalent and the accent character. Thus, the “é” in résumé could be represented by its specific code value or as two characters—the letter “e” and the accent “aigu.”

Unicode normalization reduces these equivalent character values to the same values. Splitting out the accents broadens the search but is less efficient if the searcher wants to locate only the accented version.

Time Zone Normalization: The dates and times of sent and received emails can often be important to an ESI investigation. When email collections involve different time zones, it is important to normalize the time zones to a single standard. Otherwise, it may appear that a reply email sent from a California location (Pacific) to Boston (Eastern) was sent several hours before the original email from Boston. This may cause confusion when legal professionals prepare a timeline of events.

Most email processing engines normalize date and time values against the common standard, often Greenwich Mean Time (GMT), aka Coordinated Universal Time (UTC). Doing so provides a later viewer to review email communications against UTC or to convert it to a different but consistent format.

Text Tokenization: This step is critical because the search engine used for keyword searching must rely on an index of tokenized text to retrieve results quickly and efficiently.

A token is the lexical unit placed in the search index. It may be a word or any combination of letters or numbers that are grouped together in a wordlike construct. Thus, a set of letters or numbers or even a misspelled word is treated as a token and placed in the index.

During text normalization, the system separates the words to be properly indexed later. For English and most western languages, this is done by removing most punctuation and using the spaces between each token to define it as a separate unit. Thus, the phrase “natural-born citizen” is tokenized into three separate tokens: natural, born and citizen. Likewise, the phone number 303-777-1245 is treated as three separate tokens, 303, 777 and 1245.

Tokenization is more difficult for many Asian languages and others that do not use punctuation or spacing between words. In Japanese, the phrase “You have breached the contract” is written as あなたは契約に違反しました. How does the computer separate these characters into individual words for later searching?

Special tokenizing software is used to break apart the individual word tokens used in these languages. Modern processing software should include appropriate tokenizing software for Asian and other languages that do not use punctuation marks or spacing to define words.

Stop Word Normalization: Many search engines do not index certain commonly-used words such as a, an, and, the, etc. Others do not index numbers, symbols or two-letter combinations. Most do not index standard punctuation characters, such as quotation marks, hyphens, and percent characters that are also stripped for tokenization purposes. For example, if the “&” symbol is omitted, you will not easily be able to search for “AT&T”

Processing engines often follow these rules but should provide the administrator with a choice in this regard. If your search engine indexes every character, the processing engine should offer this option as part of text normalization.

Special Settings for Office Files: Modern Office files and their equivalents from other publishers, provide a wide range of options for preservation during text extraction. Some record the editing history of the file; many permit the addition of comments or notes. In most cases, this information is not presented when the Office document is printed or converted to PDF.

The text extraction software may be set to extract this additional information and include it for indexing, searching and reviewing later. These are typically included as special settings that may or may not be requested depending on the matter and its needs.

3.6 Extract Metadata

Processing software must extract metadata, defined as information about the file itself. Specifically, if possible it should t extract information about the file maintained in the operating system as well as internal information maintained in the file. In the case of email, the processing software should also track information collected from the email database (e.g., MS Exchange or Office 365) and information maintained within the file.^[3]

Modern processing systems can track hundreds of metadata fields ranging from the original file name and basic creation information to internal fields such as author, date last saved and date printed. Emails contain hundreds of metadata fields including basics such as from, to, cc, bcc, subject, and sent date/time. Outlook files contain more than a hundred metadata fields, most of which are not relevant to an e-discovery investigation.

A processing engine must extract the basic metadata fields and should provide choices to the administrator as to which fields should be included in the processing output.

3.7 Extract Images

Email and other file types often allow content creators to embed images and other programs within the file. Photos and other graphics are common additions to email and office documents. Spreadsheets are often embedded in Word or PowerPoint files in addition to text or email messages.

Processing software must be able to separate embedded files, often treating them similarly to email attachments and allow the administrator to choose not to extract pictures or other image files because they have little to no value outside of the original file. A logo attached to an email has little probative value when it is separated from its original message. These extractions can substantially add to the number of records exported to the litigation support system, making review more difficult because of the increased volume of largely irrelevant records.

3.8 Handling Mobile Devices: SMS and IM

Smartphones and other mobile devices have exploded in popularity over the past two decades. They now record a large amount of human and social activity, in many cases supplanting traditional desktop computers and email communication.

SMS (Short Messaging Service) and IM (Instant Messaging) involve non-document data types that may be relevant to an e-discovery matter and which require special treatment during the processing phase.

When working with mobile devices, there are two areas of interest:

Text Messaging: Mobile devices offer a variety of means to send communications to others. SMS and MMS (Multimedia Messaging Service) are the most universal of these referred to as “text messages”. Closely related is iMessage data, Apple’s proprietary messaging service that works alongside SMS and MMS data on devices such as iPhones and iPads.
Third Party Messaging Software: Third party software such as WhatsApp and Facebook Messenger offer their own messaging platforms. Their content is typically stored in proprietary databases and may be extracted using specialized software.

Mobile device collection often involves a forensic component. Typically, an examiner will collect data from the device itself or from a backup server using specialized software to extract the available data and export it in a usable format.

The most common export format for phone systems is an Excel workbook comprised of various worksheets, each of which corresponding to a data type from the device, such as messages, voicemails, call logs, etc. These worksheets contain rows of data that correspond to individual messages, voicemails, etc. The worksheet columns provide the metadata for each item—sender, recipient(s), timestamp, body text, date sent, etc.

While messaging is a core function of mobile phones and other smart devices, the devices also function as powerful, handheld computers in their own right, as well as repositories for files, pictures and traditional emails. Relevant information that may be extracted includes call logs, contacts, calendar items, voicemails, and more.

3.9 Handling Data Collected from Social Media and Collaboration Platforms

Closely related to, but distinct from, mobile data is data generated from various interactive social platforms that often contain a strong messaging/chat component, as well as other forms of communication, such as social media posts or file sharing.

Below are several examples of data that may require ediscovery processing.

Workplace Collaboration Data: This form of data has risen in prominence with the proliferation of work-from-home-policies in the wake of COVID-19.

Essentially, workplace collaboration software provides a platform for interaction and collaboration among teams of individuals, usually in a workplace setting. Three of the most well known examples are Slack, Microsoft Teams and Google Chat. These tools provide instant message functionality and hosted chat rooms to facilitate file sharing and other collaborative activities. Data from these platforms is typically exported in CSV, JSON, or an XML format.

Social Media: With billions of users, social media sites, such as Facebook, Twitter, and LinkedIn, store content that may be relevant to a wide range of matters from criminal investigations, to family matters and business disputes. The content can be similar to the workplace collaboration platforms with an emphasis on social interactions, rather than workplace interactions.

Social media sites frequently involve public or semi-public posts, which often appear on a timeline. Some sites have a specific focus. Instagram, for example, focuses on media content, such as photos and videos. There are a number of specialty software products used to collect data and to “scrape” public information from the sites.

Website Content: In some cases, the website content can provide relevant information for an e-discovery matter. There are a number of collection software programs that can extract information from websites, whether text or page representations. These collections are typically date and time stamped and exported in different formats such as PDF or HTML.

3.10 Exporting Mobile, Collaboration and Website Data

Traditional ESI for e-discovery consists of document formats such as PDF, email and Office files. The non-traditional formats extracted from mobile devices, collaboration software and websites more closely resemble rows in a table and may be understood as streams of event data or activity logs, rather than static files. As a result, the process of loading the data into a review platform can present challenges.

Often special software is needed to convert the exported data into a more traditional load file that can be imported into a document review platform. In such cases, it may be necessary to transform discussion segments, such as a day of conversations, into a single document that can be loaded, searched and viewed in a document review platform.

While a more traditional load file format may make sense based on your document review platform, vendors are starting to create alternatives to a traditional load file that enables reviewers to view and search the conversation in a format that more closely resembles how the custodian actually interacted within the conversation.

A unique component of these non-traditional formats is the various metadata that may or may not be included for each of these events. Given this, production requirements must be aligned upon sooner than with traditional document production, or mobile data productions. It is advised that prior to processing parties understand what data is available for production from collaboration and website data. As mentioned previously, the definition of a document may be more fluid with this type of data, though the main goal is to help reviewers sift through portions of a conversation, rather than attempting to review an entire conversation over many years.

^[1]Read more about extended ASCII at: https://en.wikipedia.org/wiki/Extended_ASCII

^[2]Read more about Unicode at: https://en.wikipedia.org/wiki/Unicode

^[3] Email and other files maintained in the cloud, for example Office 365 files, may not allow the extraction of operating system metadata.