2.0 Initial Filtering

Depending on the nature of the matter, the scope of the collection can vary greatly from narrow and targeted to broad and overly inclusive. As a result, processing software must filter files based on type, date range or other operator criteria. With broad collections, processing software should offer the ability to remove or hold collected but unwanted files rather than moving them to the next stage of processing.

Sections 2.1 through 2.3 discuss the types of filters a processing system should include.
 

2.1         Identify System and Program Files Based on the NIST List

System and program files[1] are rarely useful in e-discovery and are often removed (or at least not promoted further) during processing. System and program files are easily identified by comparing the hash value of each with an extensive hash list maintained by the National Institute of Standards and Technology (“NIST”).[2] If the hash value matches, the file can safely be identified as a system or program file.

NIST files are typically withheld from further processing because they do not contain discoverable content and are not useful in the e-discovery process. This process is known as DeNISTing, and it reduces the volume of data to be hosted and later reviewed.
 

2.2         Identify and Remove Duplicate Files

As previously discussed, duplicate files may be identified by hash values. If two files have the same hash values, the content is identical. The process of removing or withholding duplicate files from further processing is known as “deduping” or “deduplication.”

In some cases, deduplication by custodian is performed by identifying and removing or withholding all but one copy of each document maintained by the custodian. This reduces the volume of documents associated with that custodian, avoids repetitive review, and ensures the custodian is associated with the file that may have evidentiary significance.

In other cases, deduplication is performed  across all custodians in a process known as  global deduping. This process leaves one copy of the file to be promoted while withholding the rest.

With either approach to deduplication, it is customary to include information with the file known as a load file that demonstrates where else it appears in the larger collection. Information should include two fields “ALL CUSTODIANS” and “ALL SOURCE PATHS” where the files reside. The load file may ultimately be loaded into a litigation support database. This information should be updated every time new data is processed.
 

2.3         Filtering by Date Range or File Types

Excluding files by criteria such as date range or file type is another method of reducing the volume of files processed for review. Depending on the issues involved and if the matter has a defined date range, it may be prudent to exclude files outside the date range from further review.

Processing software should include a master date field in order to filter consistently. The master date (or DocDate) may reflect different dates and times from different source files, e.g. the SentDate for email, the LastSavedDate, etc..

Processing software should permit identification and inclusion (or exclusion) of files by date range and file type with the understanding that the system uses an appropriate process that goes beyond the use of the file extensions to identify file types.


[1] System files are files associated with and used by a computer’s operating system, e.g. Windows, Linux or Mac OS. Program files are those associated with and used by the applications we run on computers.

[2] See https://www.nist.gov/itl/ssd/software-quality-group/national-software-reference-library-nsrl.