Last updated May 14, 2015
DRAFT – This is a draft document. Please comment.
EDRM’s Processing Standards Guide continues to be open for public comment. Comments can be sent to us via email at firstname.lastname@example.org.
At a point in the e-discovery lifecycle (“Lifecycle”) following preservation, identification and collection it often becomes necessary to “process” data before it can be moved to the next steps of the Lifecycle. Some primary goals of processing are to discern at an item-level exactly what data is contained in the universe submitted; to record all item-level metadata as it existed prior to processing; and to enable defensible reduction of data by “selecting” only appropriate items to move forward to review. All of this must happen with strict adherence to process auditing; quality control; analysis and validation, and chain of custody considerations.
Data may arrive at the processing stage in various formats which then need to be restored before subsequent work can be done (tapes, backups, etc.); individual files and e-mail may need to be extracted from container files (PST, NSF, zip, rar, etc.); and certain types of data may need to be converted to facilitate further processing (legacy mail formats; legacy file formats). During these processing stages individual items are cataloged and their associated metadata is captured.
Rarely is it necessary to review all items that are submitted for processing. A number of data reduction opportunities are usually available. Processing is further broken into four main sub-processes, namely: Assessment; Preparation; Selection; and Output. Assessment may allow for a determination that certain data need not move forward; Preparation involves performing activities against the data which will later allow for specific item-level selection to occur (extraction, indexing, hashing, etc.); Selection involves de-duplication; searching; and analytical methods for choosing specific items which will be moved forward; Output allows for transport of reviewable items to the next phases of the Lifecycle.
This resource is designed to assist the intended audience to learn more about the intricacies of processing data for e-discovery and better be able to ask questions whose answers will help you improve your approach to this challenging aspect of e-discovery. We attempt here to inform the reader of the key considerations and concerns that arise when ESI is processed in preparation for discovery or investigation purposes.
There are various tools available to process data for e-discovery purpose, each with its own feature set, strengths and weaknesses. Selected examples include, in alphabetical order Ipro Tech’s eCapture and Allegro, kCura Relativity Processing, Lexis/Nexis’s LAW PreDiscovery and Early Data Analyser, and Nuix eDiscovery, but new software tools are constantly introduced into the marketplace. The principles, best practices and examples used in this guide are meant to be application-agnostic and extend across all platforms.
- Greg Houston, kCura, principal author
- Julie Brown, Vorys, Sater, Seymour & Pease LLP
- Angela Bunting, Nuix
- Sean Byrne, Nuix
- Corinne Cartwright, Ricoh
- Adrienne Johnson
- Michael Lappin, Nuix
- Ralph Rostas
- Ashley Smith
- Tiana Van Dyk, Burnet, Duckworth & Palmer LLP
- Andy Ward, Nuix
When processing data it is important to remember that you may be opening files from a variety of sources on your computer system. This creates a level of vulnerability to virus files. Although there are anti-virus programs that can scan and monitor new files, these programs can conflict with processing tools. There are various methods to still allow processing to occur correctly while keeping your system virus-free.
One best practice is to keep your systems separate from each other. By hosting the processing computers on a separate network that is not accessible to the main business network should any malicious activity occur due to virus infected data, only the processing computers will be impacted.
Another important note is that anti-virus programs generally alter data as they remove or disable viruses. Thus, using anti-virus software prior to, or during processing can compromise the integrity of the data being processed. Therefore a better approach is to avoid using anti-virus programs prior to or during processing, but instead, use virus protection only when moving files from processing servers to other locations. Ultimately, if you are unable to turn off anti-virus programs make sure to perform backups of your files before processing or moving data.
A container file is a single file that contains one or more other files. Container files are used for various purposes including as a way to transport multiple files in one file package and compress the files saving space, such as a convenience method for attaching multiple files to an email. Examples of container files include:
- Containers used to compress loose files such as ZIP, RAR and 7Z.
- Email containers such as PST, NSF, OST and EDB that are used for storage by email systems.
- Forensic containers such as E01, AD1, DD, LO1 that are used by computer imaging software to preserve a forensic copy of a computer system.
When considering how to process container files, it is useful to keep the following considerations in mind:
- Inclusion – container files themselves generally should not be sent to review. After extraction there is no probative value in examining container file However, the emails and attachments inside of the PST should be considered for review.
- Numbering – While container files may be important for day-to-day processes, they might not be used when producing documents and therefore they do not need to be included as part of a family group.
- Placeholders – Often it is not necessary to know that files were grouped together and there is not any data associated with a container file just the files within it. Therefore often there is not a reference to a container file beyond the file path details in the database but just all the files grouped in the database. If you want a reference to the container file to note the grouping or original storage format of other files a placeholder can be created.
- Size of data – Costs associated with processing data can quickly become substantial; generally this happens when data sizes increase after container files are uncompressed during processing. Therefore it is important to know whether processing fees are calculated based on the sizes of files before they are uncompressed or after.
- Exception Handling – Container files that are encrypted, corrupted, or have other terminal issues must be identified by the processing process so that any issues can be reported and remediated. Please see the Exception handling section for further detail.
Metadata is the underlying data that is part of and descriptive of the electronic files that have been created during the course of business. It describes and gives other information about other data. Metadata comes in two main types- properties and derived metadata.
Properties: Properties are fields of descriptive data about the structure and contents of electronic files that have been collected. Different file types contain different types of properties. For instance, file level metadata fields include such as “Date Last Modified” or “Date Created” and picture files may contain descriptive EXIF metadata which may include information about the device that created the picture and the location the picture was taken Other common metadata attributes are things like “author”, file type, file size, “subject” or “CC” (MAPI-Display- CC) for carbon copies or “BCC” (MAPI-Display-BCC) for blind carbon copies.
Derived Metadata: While properties exist within physical files that were collected, derived metadata is created by e-discovery software to help better describe and categorize electronic files. An example of Derived metadata is the MD5 field for each file, which is calculated by the software during processing. Other examples of Derived metadata include “Custodian”, “Docid”, and “Parent ID”.
It is common that a party in litigation will request a “standard list” of metadata fields be exchanged as part of a production. Consider which fields are desired by the parties before processing data. It is important to remember that there are a vast number of metadata properties associated with the hundreds of thousands of types of files that can be processed into most tools. Each of these fields has a potential purpose during the different stages a legal or investigative process, and may be applicable in the proper context. It is important to understand which fields your processing tool can read, analyze and export.
Numbering or Item Identification
During processing, individual files are extracted from container files and attachments and embedded objects are extracted from individual files. Each extracted item is saved as a separate file.
Each extracted file is assigned a unique numeric or alpha-numeric identifier (referred to as DocId). Document identifiers typically consist of an alphabetical prefix followed by a unique number. The prefix might be the client name, custodian name or initials, but other protocols are used as well.
Although attachments and embedded objects are extracted as separate files, those separate files should maintain a link or relationship back to their parent files.
It is common to have multiple copies of a file in a dataset. In order to minimize duplicative work, secondary copies can be removed so that only one copy is present. There are different features about a file that are utilized to make a unique identifier that represents the fingerprint of the file. Any variation of the files down to one character can differentiate it to a new identifier. This identification is the file hash. For further information on file hashes, see the sections on Hash Values and File Hash Analysis Basics, below. When duplicate files are found through some sort of well-known hashing algorithm (such as MD5, SHA1 or SHA256) these files can be suppressed. In which case they are not removed or otherwise deleted. Duplicates still need to be accounted for but usually do not have to be produced or promoted to the review platform.
Deduplication can be done on a “per custodian” or global basis. This is an important consideration and should be agreed upon by both parties before deduplication operations begin. Commonly objections are made if custodian versus global deduplication is not ironed out prior to production. Often global deduplication is sufficient when email is at issue as long as the original message with sender and all recipients are viewable. It is recommended to track in the database all custodians related to an item. Global deduplication usually offers significantly higher rates of deduplication allowing for faster production times and small data volumes.
HELPFUL HINT:The common custodian field is an important one that should always be populated and preserved. One, your opponent may want this information. Two, should the scope of a collection change having this field will mitigate risk and the need to reprocess data.
Data Reduction/Culling Techniques
Reducing the data set during the processing stage can be low risk and highly beneficial. Let’s consider two primary methods for reducing the dataset before Analysis and Review begin: De-Nisting and Mime-type filtering.
Excluding Files Based on De-Nisting
In today’s digital world, most software applications contain hundreds, often thousands, of files known as system files. These are common files and are easily identified by their consistent and known hash values (see Hash values below for more information). Typical examples of system files are:
- Dynamic Linked Library; Microsoft shared library – .dll
- Executable files – .exe
- Command files containing commands to be issued to the operating system – .com
By contrast, common examples of user-generated files – files often of potential interest – include:
- Microsoft Word documents – .doc, .docx
- Microsoft Excel spreadsheets – .xls, .xlsx
- Microsoft PowerPoint presentations – .ppt, .pptx
- Microsoft Outlook e-mail files – .pst
- Lotus Notes e-mail files – .nsf
Many system files are standard files installed by programs such as Microsoft Windows or Office. Because of this, these files contain known hash values and can be easily identified and removed from any given data set based on these known values.
The National Institute of Standards and Technology (NIST) (www.nist.gov) compiles a list of hash values for files of these types. NIST does this with a sub-project called the NSRL or National Software Reference Library, generally referred to by e-discovery practitioners as “the NIST List”:
The National Software Reference Library (NSRL) is designed to collect software from various sources and incorporate file profiles computed from this software into a Reference Data Set (RDS) of information. The RDS can be used by law enforcement, government, and industry organizations to review files on a computer by matching file profiles in the RDS. This will help alleviate much of the effort involved in determining which files are important as evidence on computers or file systems that have been seized as part of criminal investigations.
The RDS is a collection of digital signatures of known, traceable software applications.
This list can be used to remove common system files presumed to be irrelevant to almost all litigation matters. DeNISTing decreases the amount of time to review ESI and removes irrelevant files from the processing set of useable documents and can dramatically reduce the size of any given collection.
Processing tools should be utilized to conduct de-Nisting for most matters. However, only top-level (parent) items should be de-Nisted, as to not break apart document families. As an example, an item that appears on the NIST list that is attached to an email should not be removed by de-Nisting, as it is not a top level file.
Excluding Files Based on Mime-Types
Though using the NIST List will remove many non-useable files, it is not comprehensive. Due to the high number of new and old applications in existence as well as ever-growing number of new ones being developed, many program files are not yet included on the NIST List. This can mean that a de-NISTed data set still will contain non-usable files. The number of overlooked files also is a function of each individual system as individual systems can be configured with many different versions of operating systems and program files.
One way to further reduce the data set is to use Mime-type classifications to remove all system files from the data set. A Mime-type is a common name for a type of file. For Instance, a Windows Dynamic Link Library file (DLL) is almost certainly not relevant to most eDiscovery matters- so it can be safely excluded from review. The Mime-type is different than a file extension in that Mime-types are verified by a files document header. Keep in mind that like de-Nisting, Mime-type filtering should only be used at the top (parent) level to avoid breaking apart document family relationships.
A computer file’s digital signature can be viewed as a digital fingerprint, also known as a hash value. Theoretically and for practical purposes, every file has a unique hash value. If two files have the same hash value, they are considered to be duplicates.
File Hash Analysis Basics
MD5 (message-digest algorithm 5), is a widely used cryptographic hash function producing a 128-bit (16-byte) hash value, typically expressed as a 32 digit hexadecimal number. It is calculated based on an algorithm developed by Rivest, Shamir, and Adleman (RSA) in 1991. It is often called an electronic fingerprint because it uniquely identifies any stream of data or file. The odds of any two files having the same MD5 are 1 in 2218, which is, more graphically, 1 in 340,282,366,920,938,000,000,000,000,000,000,000,000. Needless to say, when two files have matching MD5 values, there is an extremely high confidence factor in stating the contents of the two files are identical.
The idea behind this algorithm is to take up a random data (text or binary[1. A binary file is a computer file that is not a text file; it may contain any type of data, encoded in binary form for computer storage and processing purposes. Binary files are usually thought of as being a sequence of bytes, which means the binary digits (bits) are grouped in eights.]) as an input and generate a fixed size “hash value” as the output. The input data can be of any size or length, but the output “hash value” size is always fixed.
A MD5 hash is nothing but a 32 digit hexadecimal number which can be something as follows
A Sample MD5 Hash: e4d909c290d0fb1ca068ffaddf22cbd0
This hash is unique for every file irrespective of its size and type. That means two .exe files with the same size will not have the same MD5 hash even though they are of same type and size. So MD5 hash can be used to uniquely identify a file.
Characteristics of a hash value are:
- It is deterministic; the hash value which is generated for a given message remains the same no matter how many times it is calculated
- It returns a bit string of specific size (the hash value)
- It is easy to compute the hash value for any given message
- It is not feasible to generate a message that has a given hash value
- It is not feasible to change a message without changing the hash value
- It is not feasible to find two different messages with the same hash value
What Affects the MD5 Hash?
If a single character were to change and the data were fed back through the MD5 hash algorithm, the resulting hash value would change as well. This could be any change in the characteristics of the document.
Virtually any non-malicious change to a file will cause its MD5 hash value to change; therefore the MD5 hash is used to verify the integrity of files. Typically, MD5 is used to verify a file has not been changed as a result of a faulty file transfer, a disk error or any type of change to the file. The following example below illustrates this point.
File Name: File Hash Analysis.docx
File Path: C:\Users\rrostas\Documents\Documentation\File Hash Analysis.docx
Created Date: 12/3/2013 8:41:55 AM\r\nLast Accessed: 12/3/2013 8:41:55 AM
Last Modified: 12/3/2013 8:41:55 AM\r\nFile Size: 19932
CRC32 Digest: 3B6B26C2
MD5 Hash: E6A0A941658254D152AE405BAEA9EA1C
This above hash value is the document in the draft phases as it is saved based on the last modified date.
File Name: File Hash Analysis.docx
File Path: C:\Users\rrostas\Documents\Documentation\File Hash Analysis.docx
Created Date: 12/3/2013 8:41:55 AM
Last Accessed: 12/3/2013 8:48:43 AM
Last Modified: 12/3/2013 8:48:43 AM
File Size: 19957
CRC32 Digest: 0C348099
MD5 Hash: 7248BEB15FA633E1A8524DA062D45F73
There is a great misconception that moving a file by using drag and drop or cut and paste will change the file’s hash value. The reason the hash value does not change is because the actual file has not be internally manipulated. By this we mean the file was not open or internally accessed and manipulated. The information that changes by cut and paste or drag and drop is the system metadata. System metadata fields include, date and time accessed, date and time modified, date and time created. These can change; however, they do not necessarily affect the MD5 hash.
We have to be careful of other metadata fields that are internal to certain file types. An example can be any MS Word document. Internal metadata fields can change the MD5 hash. These internal metadata fields are stored internally within the Microsoft Word document. These fields include “Modified” and “Accessed” dates. Typically they should duplicate the metadata maintained by the operating system. “Creation” date can be and often is different, because a Word document will keep its internal creation date even when the file is copied to a new name. Other internal fields include revisions, versioning, template utilization, “printed” “Last saved by” “Revision number” and “Total editing time”. These are just a sample; however they are listed here to show the numerous internal fields that can change and affect the MD5 hash value of the file. The hash value will not change just from opening the file. An edit of some sort must be made to change the hash value.
Hash Value Creation
Relativity calculates the SHA256 hash in a standard way—all the bits and bytes that make the content of the file are involved in hash calculation. Metadata is excluded from the hash value for loose files. Relativity then compares this hash to other loose files to identify duplicates.\r\n\r\nThe following is the standard method for computing a checksum for large and small files:
- Open the file.
- Read 8k blocks from the file.
- Pass each block into an MD5/SHA1/SHA256 collator, which uses the corresponding standard algorithm to accumulate the values until the final block of the file is read. The final checksum is derived.
The Processing engine generates four different SHA256 hashes:
- Body hash – takes the text of the body of the e-mail and generates a hash
- Header hash – takes the message time, subject, author’s name and e-mail, and generates a hash
- Recipient hash – takes the recipient’s name and emails and generates a hash
- Attachment hash – takes each SHA256 hash of each attachment and hashes the SHA256 hashes together
The following is the process for computing Email HeaderHash:
- A Unicode string containing <crlf>SenderName<crlf>SenderEMail<crlf>ClientSubmitTime is constructed
- A SHA256 hash is derived from the above
- ClientSubmitTime is formatted with: m/d/yyyy hh:mm:ss AM/PM
The following is a constructed string: RE: Your last email Robert Simpsonrobert@kcura.com10/4/2010 05:42:01 PM
The following is the process for computing Email RecipientHash:
- A Unicode string is constructed by looping through each recipient in the email and inserting each recipient into the string
- Once the loop completes, the SHA256 hash is computed from the string RecipientName<space>RecipientEMail<crlf>
The following is an example of a constructed recipient string of two recipients: Russell Scarcella email@example.comKristen Vercellino firstname.lastname@example.org
The following is the process for computing Email MessageBodyHash:
- If the PR_BODY tag is present in the MSG, capture it into a Unicode string
- If the PR_BODY tag is not present, get the native body from the PR_RTF_COMPRESSED tag and either convert the HTML or the RTF to Unicode text
- Construct a SHA256 hash from the above string
The following is the process for computing Email AttachmentHash:
- Compute the loose file standard SHA256 hash from each attachment
- Encode the hash in a Unicode string as a string of hexadecimal numbers without <crlf> separators
- Construct a SHA256 hash from the composed string
The following is an example of constructed string of two attachments: 80D03318867DB05E40E20CE10B7C8F511B1D0B9F336EF2C787CC3D51B9E26BC9974C9D2C0EEC0F515C770B8282C87C1E8F957FAF34654504520A7ADC2E0E23EA
In all email scenarios, the following is the process for deriving a SHA256 from a Unicode string:
- The string is converted to a byte array of UTF8 values
- The resulting array of bytes is fed to a standard SHA256 subroutine which computes the SHA256 hash of the UTF8 byte array
Global deduplication involves comparing hash values for incoming documents against all other documents present in a database. Different software use different fields for hashing files, therefore results can vary depending on the tool used and the settings selected.
Advantages to this are not having duplicates across the database.
Custodial deduplication differs in that it involves comparing hash values for incoming documents against documents in the database for the selected custodian. Different software use different fields for hashing files, therefore results can vary depending on the tool used and the settings selected.
This method will leave copies in each custodian but not leave multiple copies in each custodian.
Time Zone Considerations
One of the fundamental characteristics of Electronically Stored Information (ESI) is time zone. Most electronic data[2. Some applications store the time zone/location of the user (e.g., Bloomberg). In these instances, special processing and/or conversion may be required.] is stored in UTC (Coordinated Universal Time[3. Coordinated Universal Time (UTC): Primary time standard by which the world regulates clocks and time. Time zones around the world are expressed as positive or negative offsets from UTC. For example, 3:00 a.m. Mountain Standard Time = 10:00 UTC – 7.]). The user’s operating system uses regional settings on the user’s system to convert the UTC time to the user’s local time zone. In order to avoid discrepancies caused by custodians who travel between multiple time zones, or projects with custodians in multiple time zones, normalization[4. Normalization: The process of reformatting data so that it is stored in a standardized form, such as setting the date and time stamp of all ESI data for a matter to a specific zone, often UTC, to be used for de-duplication.] is needed.
Consider for a moment what would happen if we were to process data under different time zones.
Two employees (Custodian A and Custodian B) are key subjects in a lawsuit. Custodian A resides in New York, and Custodian B resides in Los Angeles. Their laptops are forensically imaged[5. Forensic Image: An exact bit-stream copy of all electronic data on a device, performed in a manner that ensures that the information is not altered. (NIST IR 7298 Revision 2, Glossary of Key Information Security Terms)], and their data is processed for Relativity hosting. The Houston, TX based attorney instructs the processing team to handle the data in the “time zone for the custodian”. Without normalization, this instruction will cause huge issues for determining timelines of communications for emails sent to and some from the custodian which in turn may affect the review and production of the processed data.
During deduplication, date and time metadata are key fields. In the above example, if Custodian A and Custodian B have copies of the same email sent from a third party on Sunday, March 9, 2014 at 7am UTC, both copies would not deduplicate. The metadata for Custodian A’s email has been extracted and using Eastern time zone settings (3am EDT or UTC -4). The metadata for Custodians B’s email was offset to Pacific time zone settings (Saturday, March 8th at 11pm PST or UTC -8). If date and time metadata is used to identify duplicates, both copies would be seen as unique. The result is that there will be two copies of the same email showing different received date/time. To add additional complication, the email was sent at the time of year when a change to Daylights Savings Time occurs in most of the US. If the daylight savings offset was not included in the offset calculation an additional mismatch would occur. Because not all states/countries recognize daylight savings time, another layer of complexity exists.
Second, consider the impact on the review. Not only do you have multiple instances of the same document that survived deduplication, but without a standard or normalized date/time field the review team cannot run searches or sort documents for the purposes of creating a chronology. Finally, perhaps the most critically important, but often overlooked consideration, is uniformity between parties. If, for example, one side decided to process and subsequently produce data in PST (Pacific Standard Time), and the other side decided to process and data in EST (Eastern Standard Time), it would result in the production sets having a three-hour discrepancy, leading to confusion and possible discovery protocol disputes.
Having a normalized and standard time zone for all data processed is a critical aspect of data processing, but that is not to suggest that other time zones cannot also be displayed in the review environment or that another time zone cannot be used as the base time zone for processing. Both of these options can and should be explored depending on the matter; however, there are a few things to consider in both situations.
- If additional time zone offsets are displayed during the review, it is important that the review team understand which time zone is/will be displayed on any images for production. It is important that a single time zone is selected so that a chronology can be created across custodians/time zones easily.
- In some cases, both sides agree to process and produce data in a time zone other than UTC, and that is perfectly acceptable. Remember, date/time information is stored in UTC, it is simply the workstation settings that offset the files to a particular time zone. Let us consider a case where the subject company and all of the employees are based in New York, NY, counsel is in New York, NY and the case is filed in NY court – does it make sense to process data using UTC? Probably not. It may make more sense for this case to process everything using Eastern Time.
In both instances, transparency and normalization is key. Some processing applications allow the technician to include the time zone in the date/time displayed on any images. While this setting is not always available, this simple inclusion can address a lot of questions when reviewing data. In addition, some processing application will provide the time zone (EDT) as a field value which can be used by the review team to determine chronology for emails and can be requested as a field to include in production deliveries.
Document sets often contain many different types of files. Not all of them can be reviewed, nor do they need to be. Creating reports based on the processed data can help streamline review and increase productivity by identifying files that do not need to be included in the review set. In many cases, reports can allow the first pass review to eliminate many files missed by deNIST and deduplification.
Reports can be used as part of the culling process. For example, summarizing date ranges or tallying custodians can yield information that can help identify missing data, or data that does not need to be passed on to reviewers. Reports can also provide insight into the number of documents that are responsive to a certain search term, making them an important step in creating and revising search term lists.
Another way reports can be implemented is to summarize data to send back to the client for input. An example would be to provide the client with a list of the files and file types that need proprietary software for review. Often the client can provide different versions of those files which can easily be reviewed in standard programs, such as Adobe.
When processing files often times you encounter password protected items. The ideal situation is to receive unprotected files. However if you do encounter them during the processing stage many types of software offer the option to enter a password and retry the file. This can be done by providing a list of items to the client after the tool has determined the password protected files.
Other options include password cracking software. Depending on the native file type and the encryption method some off the shelf products exist for password cracking. If files are still unable to be cracked a list should be provided to indicate protected files. This can be reviewed with file locations to determine the necessity of taking it further and having an expert work to open the file.
Extraction of Embedded Images
Emails often contain images that are incorporated into the signature line. This is often a corporate logo or sometimes a design element around a signature. Processing software can mistake these items for something embedded that needs to be a separate document. Some processing software can detect this and only extract images that are true images. Pictures attached to email photographed or created separate of the email can be extracted to be their own files. The key is not removing items added as part of a signature block or stationary. This can create separate documents and pollute the database with extra documents that should be not be separated.
Processing and Problem Files
What are Exceptions?
In eDiscovery, exceptions are documents that cannot be completely interpreted or understood by the processing tool, that require reporting and/or remediation.
Types of Exceptions
Corrupt files have structural problems which prevent them from being opened or manipulated even in their native application. File corruption can be caused by numerous factors such as network transmission errors, errors in the medium where files were stored (e.g. bad sectors on a hard drive) or unexpected termination of the software that was being used to edit the file (e.g. a power failure).
When handling corrupt file exceptions, the first course of action usually is to investigate the possibility of obtaining a replacement. If a replacement copy is not available, depending on the nature of the case and how critical the corrupt file is, attempting to repair the file may be a viable option (e.g. recovering a corrupt mailbox). Alternatively, the corrupt file can be excluded from processing and delivered in native format. In any case, the exception should be logged and all steps taken should be thoroughly documented.
Unsupported files are files that do not support the common e-Discovery actions such as text and metadata extraction. For example, system files such as executables and dynamic link libraries are typically unsupported file types. Depending on the type of file and the processing tool, some unsupported files may be able to be text-stripped to extract the text in a more rudimentary way, which may offer some value over not extracting text in the traditional methods.
Encrypted files are files that were protected by a password, via digital rights management (DRM) or other encryption schemes. Encrypted files can be single documents such as Microsoft Office files or PDFs, or encrypted containers such as TrueCrypt volumes.
Attorneys may be able to obtain the passwords for the encrypted files in the data set. If passwords are not available, they can often be discovered by strategically reviewing neighbor documents or by attempting to crack the passwords. In either case, the decrypted files can be loaded into the processing tool as completely new files, or they can be overlaid in place to maintain their family relationships within the current database.
In general, anytime an exception is encountered during processing a container file such as a mail container or forensic container – this is considered a critical exception. Even failing to process one NSF file due to corruption, encryption, etc. could result in missing hundreds of thousands of emails. In most workflows, any and all container exceptions should be reported and remediated with high priority.
Exception Handling: How Should Exceptions be Tracked, Handled and Reported?
The processing software should provide the following mechanisms for exception tracking, handling and reporting:
- All encountered exceptions should be logged. The log files should contain detailed information about the exceptions such as the full file path, file name, hash value and a description of the exception. These logs should be sent to the sponsoring attorney to decide whether or not pursue replacements, passwords or static images from the software where the file originated.
Extracted Text (part of Fields from Processing)
Files should be processed to include extracted text. For any files where the processing system is unable to extract text, i.e., non-searchable pdf files, the files should be imaged and then processed through an OCR generator. Any images created for generating OCR should be loaded to and maintained in the review platform.
Note: See Item Numbering Identification for more information on numbering images created for OCR generation.
Sample Metadata Field Listing
It is common that a party in litigation will request a list of metadata fields be exchanged as part of a production. In fact, there is a list of typical metadata fields below. The list is divided into a “Commonly Used” section and an “Other Metadata Fields To Consider” section. The important thing to remember, is that this list only serves as an example of a list that can be agreed upon between parties. There are millions of metadata properties associated with the hundreds of thousands of types of files that can be processed into most tools. Each of these fields has a potential purpose, and may be applicable in the proper context. It’s important to choose a processing tool that extracts all properties associated with a file. Only by actually processing files can the range of available metadata fields be determined, suggesting that some form of sampling be used to identify which metadata fields ought to be processed for a particular project or matter. To the extent a meet-and-confer process is provided for, as part of that process it can be beneficial to discuss what metadata will be exchanged.
Commonly Used Fields Resulting from Processing
|Processing Field Name||Description|
|Attachment Document ID(s)||Attachment document IDs of all child items in family group, delimited by semicolon, only present on parent items.|
|Custodian||Custodian associated with (or assigned to) the processing set during processing.|
|Document ID||Unique identifier of the document as it relates to the exported numbering scheme to be used in review.|
|File Name||Subject of the document. If the document is an email, this field contains the email subject. If the document is not an email, this field contains the document’s file name.|
|GUID||Unique identifier of the document in the processing engine database.|
|Hash (MD5 and/or SHA)||Identifying value of an electronic record that is used for de-duplication during processing.|
|Native Path||The path the native file for a record to be loaded for review.|
|Original Path||Folder structure and path to file from the original location identified during processing.|
|Parent Document ID||Document ID of the parent document. This field is only available on child items.|
|Text Path||The path to the extracted text and/or OCR for a record to be loaded for review.|
Other Metadata Fields To Consider From Processing
|Processing Field Name||Description|
|Attachment Name(s)||Attachment file names of all child items in a family group, delimited by semicolon, only present on parent items.|
|Common Custodians||The list of all custodians who have a copy of this same file. Can be used in conjunction with global de-duplication to assist with data minimization rather than choosing to de-duplicate custodial basis.|
|Container Extension||Document extension of the container file in which the document originated, if applicable.|
|Container ID||Unique identifier of the container file in which the document originated, if applicable. This is used to identify or group files that came from the same container.|
|Container Name||Name of the container file in which the document originated, if applicable.|
|Contains Embedded Files||The yes/no indicator of whether a file such as a Microsoft Word document has additional files embedded within it.|
|Conversation||Normalized subject of email messages. This is the subject line of the email after removing the RE and FW that are added by the system when emails are forwarded or replied to.|
|Conversation Index||Email thread created by the email system. This is a 44-character string of numbers and letters that is created in the initial email and has 10 characters added for each reply or forward of an email.|
|Corrected Document Extension||Character extension of the file as a result of file header analysis during processing.|
|Date Created||Date and time a file was created.|
|Date Last Modified||Date and time a file was last modified.|
|Date Last Printed||Date and time that the file was last printed, if applicable.|
|Date Received||Date and time that the email message was received (according to original time zones). This applies to communications only; this field is not populated for most loose files.|
|Date Sent||Date and time that the email message was sent (according to original time zones). This applies to communications only; this field is not populated for most loose files.|
|Document Class||This field can be one of Email, Edoc, or Attachment.|
|Domains (Email BCC)||Domains of ‘Blind Carbon Copy’ recipients of the email message. See the Note below.|
|Domains (Email CC)||Domains of ‘Carbon Copy’ recipients of the email message. See the Note below.|
|Domains (Email From)||Domains of Originator of the email message. See the Note below.|
|Domains (Email To)||Domains of ‘To’ recipients of the email message. See the Note below.|
|Email BCC||Recipients of ‘Blind Carbon Copies’ of the email message.|
|Email CC||Recipients of ‘Carbon Copies’ of the email message.|
|Email From||Originator of the email message.|
|Email To||List of recipients or addressees of the email message.|
|File Size||Generally a decimal number indicating the size in bytes of a file.|
|File Type or Kind||The type of native file loaded into the system- for example email, spreadsheet, presentation, calendar item, etc. Less detailed than mime-type.|
|Has Hidden Data||Indication of the existence of hidden document data such as hidden text in a Word document, hidden columns, rows, or worksheets in Excel, or slide notes in PowerPoint.|
|Importance||Notation created for email messages to note a higher level of importance than other email messages added by the email originator.|
|Last Accessed Date/Time||The date and time at which the loose file was last accessed.|
|Last Saved Date/Time||The internal value entered for the date and time at which a document was last saved.|
|Level||Numeric value indicating how deeply nested the document is within the family. The higher the number, the deeper the document is nested.|
|Message Header||The full string of values contained in an email message header.|
|Mime-type||Description that represents the file type to the Windows Operating System. Examples are Adobe Portable Document Format, Microsoft Word 97 – 2003 Document, or Microsoft Office Word Open XML Format.|
|Near-duplicate ID(s)||Indicates the control numbers of any record(s) that are near-duplicates of the file.|
|Number of Attachments||Number of files attached to a parent document.|
|OCR Text||The yes/no indicator of whether the extracted text field contains OCR text.|
|Original Document Extension||Character extension of the file as received by the processing engine.|
|Originating Processing Set||The processing set in which the document was processed.|
|Encrypted||Indicates the documents that were password protected. It contains the value ‘Decrypted’ if the password was identified, ‘Encrypted” if the password was not identified, or no value if the file was not password protected.|
|Primary Date or Item Date||ate taken from Email Sent Date, Email Received Date, or Last Modified Date in the order of precedence.|
|Processing Errors||Any associated errors that occurred on the document during processing. This field is a link to the associated Processing Errors record.Completely overhaul metadata description and field listing|
|Processing||The intake of file information and links to files for use in a database that provides a collaboration environment.|
|File system||The area of the computer system that provides organization and storage of information and programmatic functions.|
|Virus protection||An active application that guards against infection from a computer virus and alters files it finds problematic.|
|Cleansing||Removal of data from file.|
|Network||A central linked group of computers which [xx]|
|Custodian||A person who owns data.|
|Container file||File that holds multiple other files generally for compression or security.|
|Normalization||Equalization across a dataset to make all things consistent.|
|Uncompressed||Files extracted from container files are uncompressed and the file size is consistent with the original size before adding to the container file and compressing.|
|Unprocessable||Files that cannot be opened or read for extracting the metadata and inserting into a database.|
|Metadata||File information related to various aspects of a file generated by the file system and not user created.|
|Alpha-numeric identifier||Unique identifier that contains letters and numbers combined.|
|Doc id||Document identifier is a unique name for each file in the system.|
|Tiffing||Slang term to mean the creation of a tiff image file from a native version of the file.|
|Tifs||A file format that is a static image of a file.|
|deNIST||Referring to the National Institute and Standards Organization which creates a list of system files that belong with standard windows installation or other software use. These files are not client generated and are removed from the processing phase.|
|Hash values||Unique algorithm generated identifiers based on file information for the purpose of creating a identifier which is used for duplicate identification.|
|MD5 Hash||An algorithm used to calculate a hash value.|
|Compression||The saving of a file in a reduced file size using a container file so that it can be smaller for purposes of easier transport among devices. It also often includes encryption of files.|
|Compression||The saving of a file in a reduced file size using a container file so that it can be smaller for purposes of easier transport among devices. It also often includes encryption of files.|
|Loose File||A stand alone file not part of something else such as a group of file or attached to an email.|
Potential Future Topics
- Language Identification – Sean
- Special Considerations: Parallel Processing and Extraction – Sean
- Project Intake or Details Form – Sena
- Native Support does not use MAPI – Greg
- EML Files – Sean
- Mime vs Text File ID and Extraction – Greg
- Processing Lotus Notes email- Sean
- Support for Non-Email Databases Output of Lotus Notes – Sean
- MHT, Rich Text and HTML – Corinne
- VCF, ICX Formats – Greg
- Other Folders Field – Greg
- Identifying PII – Sean
- ECA – Greg