4.0 Processing Output

The goal of e-discovery processing is to convert selected files, images, text and metadata into a format appropriate for loading into litigation support software by delivering one or more load files, along with the associated native, image and text files that result from processing. In some cases, native files are converted into an image format and delivered with the natives.

Many of the standard steps for delivering processing output are optional depending on the needs of the matter.  The following  are typical options in e-discovery processing software:

4.1       Keyword and Metadata Filtering

In some cases, legal professionals may want to further filter and reduce the volume of files being promoted for review through metadata filtering based on criteria such as date ranges, custodians, recipients, and subjects. Keyword searching for terms and phrases in documents or message text relevant to issues that may be used to support the claims and defenses of a party may be used to filter and reduce data volume.

Metadata searches can be run against the processing database. Keyword search requires the processing software index text extracted from processed files. Often, the search engine used for this purpose will be similar to the one used during the search and review phase.

In most cases the assignment is to run relatively simple metadata and keyword searches that can be used to safely cull the processing output before promoting it to the next EDRM stage. Culling searches reduce the volume of files for review and analysis. Note that keyword searches can miss relevant documents for a variety of reasons including poor search construction, misspellings of key terms, the use of acronyms, synonyms or code names not included in the search syntax, etc. Thus, there is always a tradeoff between what information data scientists refer to as “precision” and “recall” when using keyword search to find relevant information or to reduce the review population.

4.2         Developing Load Files

Litigation support software used during the review phase needs metadata for a number of reasons. Metadata is used to provide information about the document or message to reviewers. The resulting metadata fields are used to filter searches and to sort results. The database must include appropriate information to link each record to the underlying native, image and text documents that are output during processing. This allows the reviewer to not only search against fields and text but to review that information together on a computer screen.

Thus, the output of a processing system must include a load file that provides important information about each document and also facilitates data and file loading into the system. There are a number of standard load files used in e-discovery processing including:

  • Concordance load file (DAT and LFP/OPT for images)
  • Other Custom Delimited Files[1]
  • Microsoft Access database file (MDB)
  • Summation load file format and
  • EDRM load file format.

These standard formats may not be sufficient for all types of processing output and particularly for SMS and IM collections. These formats often include emojis and various picture formats which might not be rendered consistently with the original content. Likewise, the conversational format may be lost if the system tries to render the data in a traditional page format. As a result, new load file types are being developed for these conversational formats by different processing companies.

Ultimately, a good processing system will allow administrators  to choose between different types of load file formats based on the chosen litigation support software.

4.3         Converting Native Files to Images

Some litigation support software requires files be converted to images for viewing. There are several standard image formats available, including:

  • Single-page Tiffs: This format includes one image per file with no text included.
  • Multi-page Tiffs: This format includes multiple images per file but does not include text. These are typically not used in litigation support software.
  • PDFs: This is a multipage, color file format which may include text information and is used in many modern litigation support systems.
  • JPEGs or PNGs: These are typically used for color images in systems that rely on TIFFs for basic image format.

In the early days of e-discovery, single page TIFFs were the standard. Today, many systems prefer PDFs or even a near native format such as a SVG (support vector graphic) because they can display color and the image files are compressed.

4.4          Creating Text Files

Many litigation support software systems require separate text files for each image or native file output. The review systems then index and make searchable the separate text files. Once users retrieve a document through search, the associated image or native rendition of the document is analyzed for  relevance to the inquiry.

4.5        OCR Image Files

With the exception of some types of PDF files, most images do not have extractable text and are not keyword searchable. As a result, many processing systems include the ability to OCR (optical character recognition) files so that text can be extracted for later search.

While OCR does not always capture image text correctly, modern OCR software does a good job of correctly identifying text from scanned images. At the least, OCR’d text is better than having nothing.

4.6        File Names

File naming is an important output function for processing software. While each file has an original name, it is common to rename files to correspond with an assigned control number.[2] Control numbers are often issued consecutively as files are processed which may provide some information as to file origin and proximity to other files. Many processors include a text prefix or suffix to provide further information about a file’s origin or purpose.

Many call these IDs “Bates numbers.” Bates referred to a popular band of a hand number stamper that was used to identify paper documents that were being produced. Computers typically overlay these numbers on document images for production purposes while inserting them in a field for the associated database record.

4.7         Family Relationships

As mentioned earlier, many email files act as containers for their attachments. Processing software must extract these attachments, which can include additional container files each of which must be exploded recursively. In doing so, the software must number the attachments consecutively and keep a record of the control numbers for the parent email and its attachments.

When files are output, it is important to have a record showing which files were part of an email or container family.

Most litigation support software will preserve the family relationship such that a reviewer can review emails by family and, in some cases, tag both the parent email message and its attachments at one time.

4.8         Email Threading

Many email messages are part of a larger conversation, involving the original message and one or more replies. When an email is sent to multiple recipients, the number of replies and further replies held by the recipients can be large and spread throughout the collection. Review of these files can become repetitive and inefficient.

Many processing systems include the ability to analyze certain components of an email message to determine whether it is part of a larger conversation. If so, it will provide links to the larger conversation so that it can be reviewed in an integrated manner.

4.9         Near Deduplication

Some processing systems offer the ability to identify files that are similar in content although they do not match out using the hash algorithms described above. In some cases, they are similar but for a slight change in a metadata value, perhaps in a message header. In other cases, the body text may be different because the documents are different but highly-similar drafts of the same content.

Grouping these documents together through links or other reference information can be valuable in processing because the files can later be reviewed, and sometimes tagged, as a group. Doing so, promotes review efficiency particularly when compared to the prospect of reviewing each file separately. The potential for inconsistent tagging is reduced as well.

[1] These are often referred to as CSV files, a reference to a delimiter format involving comma separated values. In practice, the use of a comma to separate values can cause problems with data because there can be commas occurring within field elements. For example Smith, John may be the value in a Name field. The better practice is to use other delimiters, e.g. pipes |, carets ^, that are unlikely to appear naturally in the data.

[2] In such cases, it is important to save the original file name as part of the processing metadata.