1.0 ESI Ingestion and File Extraction

Ingestion is the first step in data processing. A processing system should have the ability to ingest a variety of ESI types such as emails, office documents (word processing, spreadsheets and slides), instant messages, social media, and audio and video files, as well as a variety of container formats, which are often used to package the collected files.

Container files are a type of compressed file created by such programs as WinZip or WinRAR and are often identified by their three-letter extensions (e.g., zip or rar). Container files are commonly used to make files easier to transport. When a container file is exploded or unzipped, it returns exact copies of the original compressed files.

Forensic software programs also create container files to make exact images of a drive or part of a drive for evidentiary or preservation purposes. These programs have their own proprietary formats with file extensions such as FTK or E01. Ultimately, a processing engine must be able to extract files from different kinds of container files so they can move further along the processing chain.

The basic steps to unzip container files are:

1.1 Receive and Extract Data from Common Container Formats

The first step is to explode the container files by extracting and identifying file contents. For ZIPs and RARs, the system must apply the proper decompression algorithm to properly extract those files. The same is true for forensic containers. In some cases, the forensic software may be used to extract container file content.

To receive and extract data, processing software should be able to:

Extract data from common file transport container formats such as Zip and RAR
Extract data in a recursive manner until all containers within containers have been addressed
Identify and report on encrypted container files
Extract data from forensic collection formats such as FTK, L01, DD, E01 and AFF and
Make a record of container files and their contents for chain of custody purposes.

1.2 Identify and Extract Content from Email Containers

Email collections are typically transported in a container file known as a PST (personal storage table) or OST for local temporary copies, and NSF for Lotus Notes.^[1] PSTs are created to hold Microsoft Exchange files such as Outlook emails, contacts, tasks and calendar items. To properly extract this information, the processing system must apply the appropriate encoding schema to segregate messages and other individual items to successfully extract their contents.

Individual emails are mini containers that often contain attachments and embedded objects with the body of the email message and its associated metadata. For example, Microsoft Outlook’s email export format is typically identified as an .msg. Other standard Internet mail formats include eml, emlx, and mbx.^[2] Gmail, for example, uses the eml extension.

To identify and extract content, processing software should be able to:

Continue recursion until all message content and attachments are extracted, including standard container files attached to messages
Track and report on family relationships, e.g. the attachments contained in an individual email message
Extract embedded objects^[3] from email messages and display them as attachments
Extract inline images at the operator’s request and
Distinguish between recurring logos and other inline images.
Extract text from OLE objects (Object Linking and Embedding) such as smart art graphics or icons within office files

1.3 Identify Other Basic File Types

It is imperative that the processing system recognize the various file types received for ingestion. Office files must be properly identified; image files must be treated as images. Audio and video files must also be properly addressed. Misidentification of file types can quickly lead to processing failures.

File identification is part art and part science. In the Windows world, the computer’s operating system usually identifies file types (and opens them with the appropriate program) based on their file extension, e.g., .doc or .docx for Word files and.xls or .xlsx for Excel files.

This can be misleading. A file can be renamed with any file extension, which may be done by a malefactor to hide a file’s identity. As a result, other operating systems, including Mac, Linux and many Windows utilities, do not rely on the file extension for identification. Instead, the information in the file itself is used to identify the file.^[4] Programmers place a binary file signature in the first few bytes of a file to identify its type. In cases where the software cannot confidently identify the file type, the system should report the file as part of an error listing.

To identify file types, processing software should be able to:

Correctly identify file types based on multiple factors including header information, MIME types and file extensions and
Identify common Office file formats.

1.4 Scan for Viruses

Ingested files should be scanned for viruses at an early stage. Infected files should be quarantined or removed from the system so they do not adversely affect the processing system, infect other data or be passed to other systems during later stages of the e-discovery process.

Ultimately pre-processing or processing software should include virus protection to scan files for viruses by:

Working from a regularly updated virus signature database published by one or more reputable virus protection vendors
Quarantining virus infected files for later handling
Allowing virus infected files to be removed from the system or safely deleted and
Reporting on files quarantined for virus issues.

Many processing systems do not include virus scanning as an integral part of the processing software. Rather, the belief is that an operator should maintain and run virus scanning software separately before loading the data into the system. This ensures that the virus scanning software is continuously updated, which is required to detect recent viruses.

At the least, if the processing software includes a virus protection component, there should be a mechanism to ensure that the virus signature files are current. A virus signature file is a security program used to detect and identify malware.

1.5 Hash Files for Identification and Comparison

Hashing is a process used to create a “digital fingerprint” for each individual file. This hash value can be used to identify duplicate files to reduce redundant files prior to review.

Hashing uses an algorithm that analyzes a file and its contents to calculate a unique file identifier. For example, a file hash value may resemble this:

5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8

The algorithm is designed to change values significantly when even a single byte of data is changed. This allows the processing system to confidently remove duplicate files without inadvertently removing similar but non-duplicates. It can also be used to determine whether a file has been changed at a later stage of the process, perhaps to alter its contents. If a claim is made that a file has been altered, hash values from the original and the suspect copy can be easily compared.

There are different hashing algorithms available for use in a processing system. One of the earliest was the MD5 (message digest five) algorithm, which created a 128-bit fingerprint. The secure hash algorithm (SHA) was developed by the National Security Agency, and the current versions use 256 bits (SHA-256) or 512 bits (SHA-512) to create a digital fingerprint.^[5]

The key is to use the same hash algorithm on both files when documents are duplicates or a file was later changed . Although it is not mathematically impossible, the chances of two non-identical files having the same hash value is extremely low.

To hash all files received, processing software should be able to:

Create a hash value for each file using industry standarding hash protocols, such as MD5 or SHA-256;
Store the hash values for chain of custody purposes with links to the corresponding hashed files.and
Validate the files received for processing against the files delivered after processing.

Some products may also create a “family” hashing to provide an option to deduplicate at the family level and ensure that when you remove duplicates; the families stay intact.

Using hash file values to remove duplicates along with system and program files is discussed in the next section.

1.6 Create an Exceptions List

During this initial phase of processing, files may fail for a variety of reasons, including encryption, corruption, false identification or removal due to virus concerns. These files should be preserved, quarantined, or otherwise identified to maintain a proper chain of custody.

Many systems offer exception reports that can be retrieved at this stage of the process or later as the need arises. Exception reporting applies to every phase of processing. These guidelines will not include a discussion on the other phases. It is a critical part of the processing phase and its contents should be available through the end of the discovery process. Exception reports should include, but not limited to, the error or exception, its location, and its status (resolved, ignored, retried, unresolvable, etc.).

^[1] We note that some cloud-based email systems are beginning to offer in place ingestion and processing. In such a case, the ingestion stage remains, but the data may not be physically removed from its original cloud environment. Rather, the system acquires the data directly (ingestion) and processes it in the same environment. The interim stage of extracting the data and moving it through a PST or ZIP may be avoided.

^[2] See https://en.wikipedia.org/wiki/Email for more information on email formats.

^[3] Embedded objects are typical file attachments embedded as base 64 encoding or object linking and embedding (OLE, an early Microsoft format).

^[4] One commonly used mechanism is Media (MIME) detection. Multipurpose Internet Mail Extensions (MIME) is a seminal Internet standard that enables the grafting of text enhancements, foreign language character sets (Unicode) and multimedia content (e.g., photos, video, sounds and machine code) onto plain text emails. Virtually all email travels in MIME format. For more information on this, see Ball at 22.

^[5] Hashing is a unidirectional process that can never work backwards to retrieve the original data.Learn more about file verification and hashing at: https://en.wikipedia.org/wiki/File_verification.